Pandas 数据分析（1） - 《数据分析》

索引选取
分组计算

索引选取

根据某种条件筛选出数据的子集！

.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found.
loc主要基于标签，但也可以与布尔数组一起使用。如果标签不存在，那么程序会报错：keyError

下面是loc方式选取数据的四种方式

* A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index. This use is not an integer position along the index)
 单个标签，比如 5 或者 'a'
* A list or array of labels ['a', 'b', 'c']
一系列的标签
* A slice object with labels 'a':'f' (note that contrary to usual python slices, both the start and the stop are included, when present in the index! - also see Slicing with labels)
切片标签(较为特殊，为前闭后闭形式，python中list切片是前闭后开)
* A boolean array
布尔数组/布尔向量(省掉.loc，直接通过布尔向量也是可以选取的，推荐带上.loc)

.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with python/numpy slice semantics).
iloc主要基于整数位置(integer)，也可以使用布尔类型数组。如果索引超出范围，iloc程序会报错IndexError.

下面是iloc方式选取数据的四种方式

* An integer e.g. 5
单个整数
* A list or array of integers [4, 3, 0]
一系列的整数
* A slice object with ints 1:7
切片整数(前闭后开，与python中list切片的选取是相同)
* A boolean array
布尔数组/布尔向量(省掉.iloc，直接通过布尔向量也是可以选取的，推荐带上iloc)

基于label.loc

Series操作

s1 = pd.Series(["Hello", "ZheDa", 'x', 'o'], index=list('acbd'))
s1

执行

a    Hello
c    ZheDa
b        x
d        o
dtype: object

选取数据的练习

# 单个标签
s1.loc['a']                                # 'Hello'
# 多个标签(输出的结果类型依旧为Series)
s1.loc[['a', 'c']]                        # 'Hello', 'ZheDa'  
# 切片标签(注意切片不按照字典顺序，而是按照给定索引的顺序)
s1.loc['a':'b']                            # 'Hello', 'ZheDa', 'x'
# 布尔数组
s1.loc[s1.str.endswith('o')]            # 'Hello', 'o'

DataFrame操作

DataFrame是二维数据，可以操控index与column

选取规则基本同Series一样
行索引与列索引中间以逗号隔开
理论上有4*4中选取方式

df = pd.DataFrame({"name": ["Jeff", "Tom", "Peter", "Amy"], "age": [21, 25, 22, 18], "sex":["female", "male", "female", "female"]})
df

执行

	name	age	sex
0	Jeff	21	female
1	Tom	25	male
2	Peter	22	female
3	Amy	18	female

注意下面例子中关于行索引选取的数字，是索引的label，即标签而不是position位置

# 单个标签，单个标签
df.loc[0, 'sex']                        # 'female'
# 多个标签，多个标签
df.loc[[0,1], ['name', 'sex']]

	name	sex
0	Jeff	female
1	Tom	male

# 布尔数组，单个标签
df.loc[df['age'] > 22, ['name']]

	name
1	Tom

# 布尔数组，布尔数组
df.loc[df['age'] > 22, [False, True, True]]

	age	sex
1	25	male

# 布尔数组，布尔数组
df.loc[df['age'] > 22, df.columns.str.startswith('a')]

	age
1	25

基于位置.iloc

基于整数的布尔数组选取，必须要是传统的数组，因此返回Series类型后，要去除标签转为纯数组

Series操作

s1 = pd.Series(["Hello", "ZheDa", 'x', 'o'], index=list('acbd'))


# 单个整数
s1.iloc[0]                                # 'Hello'

# 多个整数
s1.iloc[[0, 1]]                            # 'Hello', 'ZheDa'

# 整数切片
s1.iloc[0:3]                            # 'Hello', 'ZheDa', 'x'

# 布尔数组
t = s1.str.endswith('o')
s1.iloc[t.values]                        # 'Hello', 'o'

DataFrame操作

依旧使用我们上面创建的df数据集

# 单个整数
df.iloc[0]

执行

name      Jeff
age         21
sex     female
Name: 0, dtype: object

# 多个整数，单个整数
df.iloc[[0,1],1]

执行

0    21
1    25
Name: age, dtype: int64

# 多个整数，多个整数
df.iloc[[0,1],[0,2]]

	name	sex
0	Jeff	female
1	Tom	male

# 多个整数，布尔数组
df.iloc[[0,1],[False, True, True]]

	age	sex
0	21	female
1	25	male

# 多个整数，布尔数组
df.iloc[[0,1],df.columns.str.startswith('a')]

	age
0	21
1	25

随机选取数据

A random selection of rows or columns from a Series or DataFrame with the sample() method. The method will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows.

使用sample()方法对数据进行列或者行的随机选取，默认是对行进行选取。函数接受参数可以用来指定返回的数量或者百分比！

random_state参数可理解为种子，指定相同的random_state，每次随机返回结果相同

frac参数表示百分比

df.sample(1)

	name	age	sex
1	Tom	25	male

df.sample(1, axis=1, random_state=1)

	name
0	Jeff
1	Tom
2	Peter

df.sample(frac=0.5)

	name	age	sex
1	Tom	25	male
2	Peter	22	female

使用isin()

该函数会返回一个布尔型向量，根据Series里面的值(value)是否在给定的列表中进行筛选。

我们可以通过这个条件筛选出一列或者多列数据！

Series应用

s1 = pd.Series(["sex", "speak", "tree", "trigger"])
s1

执行

0        sex
1      speak
2       tree
3    trigger
dtype: object

开始筛选了，注意：这里的输出结果都做了一定的简化处理

s1.isin(['speak','tree'])                    # False True True False

s1.index.isin([0,1])                        # True True False False

s1.isin([0,1])                                # False False False False

DataFrame应用

df

	name	age	sex
0	Jeff	21	female
1	Tom	25	male
2	Peter	22	female
3	Amy	18	female

df_res = df.isin({"age": [22,24], "sex": ["female"]})
df_res

	name	age	sex
0	False	False	True
1	False	False	False
2	False	True	True
3	False	False	True

我们来说一下any和all函数

all指定axis上元素全部为True
any指定axis上元素至少一个为True
怎么判断某一列总共有多少个True呢
对该列的数据求和，即求sum
False被视为0，True被视为1处理

df_res.any()                            # False True True # 纵向，3纵
df_res.any(axis=1)                        # True False True True # 横向 4横

df_res['sex'].sum()                     # 3

有了这些布尔向量，我们可以做什么操作，当然是结合loc/iloc进一步的做数据选取了

数据过滤

基于loc的强大功能，我们可以对数据做很多复杂的操作。第一个就是实现数据的过滤，类似于SQL里面的where功能

假设我们想筛选年龄为22岁或25岁，同时为女性的人，该如何筛选?

# df[(df['age'].isin([22,25])) & (df['sex'] == 'female')]
df.loc[(df['age'].isin([22,25])) & (df['sex'] == 'female')]

	name	age	sex	tag
2	Peter	22	female	Beauty

上面的案例好像没体现出loc的强大，现在我们换个需求，假设我们想增加一列数据，名为tag，满足下列要求：

年龄在24岁以上，同时为男性的，设置值为青少年“Adult”
年龄在20岁以上，在24岁以下，同时为女性的，设置值为美人”Beauty”
其余的则设置为孩子“children”

df.loc[(df['age'] > 24) & (df['sex'] == 'male'), 'tag'] = 'Adult'
df.loc[(df['age'] > 20) & (df['age'] < 24) & (df['sex'] == 'female'), 'tag'] = 'Beauty'

# 对上面两个条件或的结果加括号进行取反
df.loc[~(((df['age'] > 20) & (df['age'] < 24) & (df['sex'] == 'female')) | ((df['age'] > 24) & (df['sex'] == 'male'))), 'tag'] = 'children'

df

	name	age	sex	tag
0	Jeff	21	female	Beauty
1	Tom	25	male	Adult
2	Peter	22	female	Beauty
3	Amy	18	female	children

query()方法

使用表达式进行数据筛选.不接受外部变量！

df.query("age > 24 & sex == 'male'")

	name	age	sex	tag
1	Tom	25	male	Adult

# 使用索引
df.query("index > 1")

	name	age	sex	tag
2	Peter	22	female	Beauty
3	Amy	18	female	children

索引设置

set_index()方法可以将一列或者多列设置为索引，下面是一些常用参数的解释

keys参数，要将哪些列设置为索引
drop参数，该列设为索引后，是否在columns中删除，默认True
append参数，是否保留原来数据框DataFrame中的索引，默认False
inplace参数，是否在当前DataFrame上生效，默认False，一般情况我们不轻易改动原数据

df.set_index(keys='tag')

	name	age	sex
tag
Beauty	Jeff	21	female
Adult	Tom	25	male
Beauty	Peter	22	female
children	Amy	18	female

df1 = df.set_index(keys=['tag', 'name'], drop=True, append=False)
df1

		age	sex
tag	name
Beauty	Jeff	21	female
Adult	Tom	25	male
Beauty	Peter	22	female
children	Amy	18	female

reset_index()方法将索引放回数据框中的列，并且设置简单的整数索引.

level参数需要特别注意，它可以指定哪一列或哪几列索引，通过整数或name都可以实现

df1.reset_index(level=1)    # 把第一列的索引放回去，注意，原数据的索引并未发生改变

	name	age	sex
tag
Beauty	Jeff	21	female
Adult	Tom	25	male
Beauty	Peter	22	female
children	Amy	18	female

# 这两行代码是一样的效果，看输出
df1.reset_index(level=[0,1])

# df1数据框是有两列索引的，是tag和name
df1.reset_index(level=['tag', 'name'])

	tag	name	age	sex
0	Beauty	Jeff	21	female
1	Adult	Tom	25	male
2	Beauty	Peter	22	female
3	children	Amy	18	female

where方法

Selecting values from a Series with a boolean vector generally returns a subset of the data. To guarantee that selection output has the same shape as the original data, you can use the where method in Series and DataFrame.

通过布尔类型的数组选取数据仅仅返回数据的子集。但是where()函数能够确保返回的结果和原数据的shape一样！不满足条件的值会用NaN替代。

df.where(df['age'] >= 22)

	name	age	sex	tag
0	NaN	NaN	NaN	NaN
1	Tom	25.0	male	Adult
2	Peter	22.0	female	Beauty
3	NaN	NaN	NaN	NaN

print(s1)                            # sex speak tree trigger

s1.where(s1 == 'sex')                # sex NaN NaN NaN

处理重复数据 duplicate

duplicated函数返回一个和行数相等的布尔数组，表明某一行是否是重复的；而drop_duplicates函数则用来删除重复的行；我们一起来看下几个参数的含义：

通过keep参数来控制行的取舍

keep=’first’ (default): mark / drop duplicates except for the first occurrence.（标记第一个为非重复值）
keep=’last’: mark / drop duplicates except for the last occurrence.（标记最后一个为非重复值）
keep=False: mark / drop all duplicates.（不分前后）

subset参数用来指定对哪几列进行比较

# np.random.randn(6) 随机生成6个在标准正态分布图中的数据
df2 = pd.DataFrame({'a':['one', 'one', 'two', 'three', 'one', 'four'], 'b':['x', 'y', 'x', 'z', 'x', 'z'],
                   'c': np.random.randn(6)})
df2

	a	b	c
0	one	x	0.106626
1	one	y	0.547766
2	two	x	-1.369171
3	three	z	0.631923
4	one	x	0.355265
5	four	z	-1.013671

# 这里并没有重复的行
df2.duplicated()                               # False False False False False

# 针对a,b列，第0行和第4行是重复的
df2.duplicated(subset=['a','b'], keep=False)   # True False False True False

# 标记第一次出现的为非重复行，其它的删除
df2.drop_duplicates(subset=['a', 'b'], keep='first')

	a	b	c
0	one	x	0.106626
1	one	y	0.547766
2	two	x	-1.369171
3	three	z	0.631923
5	four	z	-1.013671

MultiIndex

层次索引可以允许我们操作更加复杂的数据

索引创建

常见的有两种方式，第一种：使用多维数组的方式来创建多级索引；第二种：使用tuple的方式

# 第一种
a = [['i5', 'i5', 'i7', 'i6', 'i7'], ['128G', '256G', '128G', '30G', '50G']]
index = pd.MultiIndex.from_arrays(a, names=['cpu', 'memory'])
index

执行

MultiIndex(levels=[['i5', 'i6', 'i7'], ['128G', '256G', '30G', '50G']],
           codes=[[0, 0, 2, 1, 2], [0, 1, 0, 2, 3]],
           names=['cpu', 'memory'])

特别注意下面zip函数的用法，这里可理解为对a拆包再打包，然后传递给list函数

# 第二种
tuples = list(zip(*a))          # [('i5', '128G'), ('i6', '256G'), ('i7', '128G'), ('i5', '30G'), ('i7', '50G')]
index = pd.MultiIndex.from_tuples(tuples, names=['cpu', 'memory'])
index

执行

MultiIndex(levels=[['i5', 'i6', 'i7'], ['128G', '256G', '30G', '50G']],
           codes=[[0, 1, 2, 0, 2], [0, 1, 0, 2, 3]],
           names=['cpu', 'memory'])

直接通过get_level_values函数获取某个level

index.get_level_values(0)        # Index(['i5', 'i6', 'i7', 'i5', 'i7'], dtype='object', name='cpu')

index.get_level_values(1)        # Index(['128G', '256G', '128G', '30G', '50G'], dtype='object', name='memory')

把多级索引赋值给新的DataFrame或Series，这里要注意索引的level数要匹配行数

pd.Series(np.random.randn(5), index=index)

执行

cpu  memory
i5   128G     -0.006671
     256G      0.860407
i7   128G      0.314731
i6   30G       1.406275
i7   50G      -2.046360
dtype: float64

索引选取

df = pd.DataFrame(np.random.randn(5, 4), index=index, columns=list('ABCD'))
df

		A	B	C	D
cpu	memory
i5	128G	-0.930731	0.269575	-1.635472	0.879672
	256G	-0.316811	-1.217830	-0.104210	-0.218656
i7	128G	0.747805	0.582092	-0.013546	-0.856799
i6	30G	0.564064	-0.074697	1.611539	-0.607888
i7	50G	-0.820006	0.652166	-2.324307	-0.340452

常规的选取：

只不过这个时候的lable标签为元组，还是用中括号包裹，看下面的案例

df.loc[('i5', '128G')]

执行

A   -0.930731
B    0.269575
C   -1.635472
D    0.879672
Name: (i5, 128G), dtype: float64

指定索引，选取一列或多列的情况

df.loc[('i5', '128G'), 'A']                                # -0.930731429320895

df.loc[('i5', '128G'), ['A', 'B']]                        # -0.930731  0.269575

df.loc[[('i5', '128G'), ('i7', '128G')], ['A', 'B']]

		A	B
cpu	memory
i5	128G	-0.930731	0.269575
i7	128G	0.747805	0.582092

df.loc[('i5')]

	A	B	C	D
memory
128G	-0.930731	0.269575	-1.635472	0.879672
256G	-0.316811	-1.217830	-0.104210	-0.218656

使用切片(slicers)对多重索引进行选取

你可以使用任意的列表，元祖，布尔型作为Indexer
可以使用sclie(None)表达在某个level上选取全部的内容，不需要对全部的level进行指定，它们会被隐式的推导为slice(None)
所有的axis必须都被指定，意味着index和column上都要被显式的指明

# 冒号显式的指明选取所有列，逗号和冒号不可缺失
df.loc[(slice(None), '128G'),:]

		A	B	C	D
cpu	memory
i5	128G	-0.930731	0.269575	-1.635472	0.879672
i7	128G	0.747805	0.582092	-0.013546	-0.856799

IndexSlice是一种更接近自然语法的用法，可以替换slice，索引里面又嵌套了一层索引

注意：IndexSlice如果包裹的是MultiIndex，那么针对某个level，使用start:end的切片形式会出错；但针对不是MultiIndex的索引是可以的，比如column列。

也就说 df.loc[idx[[‘i5’:’i6’],’128G’],:] 这样写是会报错的

idx = pd.IndexSlice
df.loc[idx[:,'128G'],:]

		A	B	C	D
cpu	memory
i5	128G	-0.930731	0.269575	-1.635472	0.879672
i7	128G	0.747805	0.582092	-0.013546	-0.856799

df.loc[idx[:,['128G', '30G']], idx['A', 'B']]

		A	B
cpu	memory
i5	128G	-0.930731	0.269575
i7	128G	0.747805	0.582092
i6	30G	0.564064	-0.074697

df.loc[idx['i5',['128G', '30G']], idx['A': 'B']]

		A	B
cpu	memory
i5	128G	-0.930731	0.269575

使用函数xs()可以让我们在指定level的索引上进行数据选取

不过局限性是，在一个level上只能指定一个值，如果指定多个值会出错。

df.xs('128G', level=1)

	A	B	C	D
cpu
i5	-0.930731	0.269575	-1.635472	0.879672
i7	0.747805	0.582092	-0.013546	-0.856799

df.xs(('i5','128G'), level=(0, 1))

		A	B	C	D
cpu	memory
i5	128G	-0.930731	0.269575	-1.635472	0.879672

索引排序

level参数可根据指定的level进行排序

我们依旧使用上面创建的df数据集，自始至终，我们并未改变它

df.sort_index()

		A	B	C	D
cpu	memory
i5	128G	-0.930731	0.269575	-1.635472	0.879672
	256G	-0.316811	-1.217830	-0.104210	-0.218656
i6	30G	0.564064	-0.074697	1.611539	-0.607888
i7	128G	0.747805	0.582092	-0.013546	-0.856799
	50G	-0.820006	0.652166	-2.324307	-0.340452

df.sort_index(level=[1,0])

		A	B	C	D
cpu	memory
i5	128G	-0.930731	0.269575	-1.635472	0.879672
i7	128G	0.747805	0.582092	-0.013546	-0.856799
i5	256G	-0.316811	-1.217830	-0.104210	-0.218656
i6	30G	0.564064	-0.074697	1.611539	-0.607888
i7	50G	-0.820006	0.652166	-2.324307	-0.340452

分组计算

By “group by” we are referring to a process involving one or more of the following steps

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
Of these, the split step is the most straightforward. In fact, in many situations you may wish to split the data set into groups and do something with those groups yourself. In the apply step, we might wish to one of the following:
Aggregation: computing a summary statistic (or statistics) about each group. Some examples:
- Compute group sums or means
- Compute group sizes / counts
Transformation: perform some group-specific computations and return a like-indexed. Some examples:
- Standardizing data (zscore) within group
- Filling NAs within groups with a value derived from each group
Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
- Discarding data that belongs to groups with only a few members
- Filtering out data based on the group sum or mean

Some combination of the above: GroupBy will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories

类似于SQL里面的group by 语句，不过pandas提供了更加复杂的函数方法
我们可以对index或column进行分组，也可以对一个元素或任意多个元素分组。

groupby函数有俩个常用参数，一个是参数by，指定根据什么来分组；一个是参数level，指定第几层索引

先创建我们的测试数据集

df = pd.DataFrame({"name":["A", 'B', 'C', 'D', 'E', 'F','G'],
                   "sex": ["f", 'm', 'f', 'f', 'm', 'm', 'm'],
                   "age": [24, 23, 22, 21, 20, 21, 22],
                   "class": [1, 2, 3, 2, 3, 1, 2]})
df

	name	sex	age	class
0	A	f	24	1
1	B	m	23	2
2	C	f	22	3
3	D	f	21	2
4	E	m	20	3
5	F	m	21	1
6	G	m	22	2

基本用法

size / groups

size()方法，按组返回每组元素的数量
groups属性，按组返回每组的索引，结果是字典dict类型
注意字典类型是如何获取到key，以及循环的（忘记了，见下面例子）
对索引进行分组

g1 = df.groupby('sex')

g2 = df.groupby(['class', 'sex'])

type(g1)                                # pandas.core.groupby.generic.DataFrameGroupBy

len(g1)                                    # 2

# 按组别返回每个组的元素个数
g1.size()                                # 输出等价于df['sex'].value_counts()

执行

sex
f    3
m    4
dtype: int64

size方法与通过列名调用count方法的返回结果是等价的，由此可知与列名的选取没有关系；也就是说在这个地方g1.size()、g1[‘class’].size()、g1[‘age’].count()的输出是相同的

print(list(g1.groups.keys()))                # ['f', 'm']

# 查看每个组别的索引
g1.groups

执行

{'f': Int64Index([0, 2, 3], dtype='int64'),
 'm': Int64Index([1, 4, 5, 6], dtype='int64')}

循环输出看一下每个组别的内容是什么？

for a,b in g1:
    print(a)
    print(b)
    print(type(b))
    print(len(b))

执行

f
  name sex  age  class
0    A   f   24      1
2    C   f   22      3
3    D   f   21      2
<class 'pandas.core.frame.DataFrame'>
3
m
  name sex  age  class
1    B   m   23      2
4    E   m   20      3
5    F   m   21      1
6    G   m   22      2
<class 'pandas.core.frame.DataFrame'>
4

对索引进行分组

# 将sex列设置为索引列
df1 = df.set_index('sex')
df1

	name	age	class
sex
f	A	24	1
m	B	23	2
f	C	22	3
f	D	21	2
m	E	20	3
m	F	21	1
m	G	22	2

g4 = df1.groupby('sex')
g4.groups

执行

{'f': Index(['f', 'f', 'f'], dtype='object', name='sex'),
 'm': Index(['m', 'm', 'm', 'm'], dtype='object', name='sex')}

下面的这两行代码与上面的输出是一致的

g4 = df1.groupby(level=0)
g4.groups

执行

{'f': Index(['f', 'f', 'f'], dtype='object', name='sex'),
 'm': Index(['m', 'm', 'm', 'm'], dtype='object', name='sex')}

我们给数据集加一个层级索引，还是练习对索引的分组

a = [['a', 'a', 'a', 'b', 'c', 'c', 'g'], ['x', 'y', 'x', 'x', 'x', 'y', 'x']]
index = pd.MultiIndex.from_arrays(a)
df.index =index 
df

		name	sex	age	class
a	x	A	f	24	1
	y	B	m	23	2
	x	C	f	22	3
b	x	D	f	21	2
c	x	E	m	20	3
	y	F	m	21	1
g	x	G	m	22	2

df.groupby(level=1).groups

执行

{'x': MultiIndex(levels=[['a', 'b', 'c', 'g'], ['x', 'y']],
            codes=[[0, 0, 1, 2, 3], [0, 0, 0, 0, 0]]),
 'y': MultiIndex(levels=[['a', 'b', 'c', 'g'], ['x', 'y']],
            codes=[[0, 2], [1, 1]])}

统计计算

分组计算很重要的一点是：我们的每一个统计函数都是作用在每一个group上，不是单个样本，也不是全部数据

单个统计量应用到某一列或多列 mean/sum/std
多个统计量应用到某一列或多列
不同列应用不同统计量

单个统计量的应用

g1.sum()

	age	class
sex
f	67	6
m	86	8

# 应用到一列
g1['age'].mean()

执行

sex
f    22.333333
m    21.500000
Name: age, dtype: float64

# 应用到多列
g1[['age', 'class']].std()

	age	class
sex
f	1.527525	1.000000
m	1.290994	0.816497

多个统计量的应用

# 应用一列
g1['age'].agg(['min', 'max', 'std'])

	min	max	std
sex
f	21	24	1.527525
m	20	23	1.290994

# 应用多列
g1[['age', 'class']].agg(['min', 'max'])

	age		class
	min	max	min	max
sex
f	21	24	1	3
m	20	23	1	3

不同列应用不同的统计量

g1.agg({'age':['min', 'max'], 'class':['count']})

	age		class
	min	max	count
sex
f	21	24	3
m	20	23	4

Transformation

transform方法，接收一个函数，该函数应用到每个单独组上，注意一点：应用后shape不变

先来个求每组平均数的例子，相对来说好理解些，仔细观察输出的结果

m = lambda x: x.mean()
g1['age'].transform(m)

执行

a  x    22.333333
   y    21.500000
   x    22.333333
b  x    22.333333
c  x    21.500000
   y    21.500000
g  x    21.500000
Name: age, dtype: float64

下面是z-score公式的练习，z-score实际上是把一组数据转化为了标准正态分布

z_score = lambda s: (s - s.mean()) / s.std()

# 注意：结果中的行数与数据框的行数是一致的
g1['age'].transform(z_score)

执行

a  x    1.091089
   y    1.161895
   x   -0.218218
b  x   -0.872872
c  x   -1.161895
   y   -0.387298
g  x    0.387298
Name: age, dtype: float64

Filteration

filter方法，接收一个函数，该函数应用到每个单独组上，类似于Sql中的having

# f一组，m一组，female这一组的年龄平均数是符合要求的
g1.filter(lambda g: g['age'].mean() > 22)

		name	sex	age	class
a	x	A	f	24	1
	x	C	f	22	3
b	x	D	f	21	2

g1.filter(lambda g: len(g) > 3)

		name	age	class
a	y	B	23	2
c	x	E	20	3
	y	F	21	1
g	x	G	22	2

这里抛出一个问题，将一个DataFrame赋值给一个变量后，是真复制还是假复制呢？事实上，自己可实验