Series的层级索引
二层索引
import pandas as pdimport numpy as npser = pd.Series(np.random.randn(6), index=[['a', 'a', 'b', 'b', 'c', 'c'],[0, 1, 2, 0, 1, 2]])print(ser)'''a 0 -0.2861951 -0.462873b 2 -0.6970020 -0.354256c 1 0.7041552 -0.343872dtype: float64'''print(type(ser.index))# <class 'pandas.core.indexes.multi.MultiIndex'># 外层选取print(ser['c'])'''1 0.1101062 0.000338dtype: float64'''# 内层选取print(ser[:, 2])'''b 0.505124c 1.036544dtype: float64'''
常用函数
import pandas as pdimport numpy as npser = pd.Series(np.random.randn(6), index=[['a', 'a', 'b', 'b', 'c', 'c'],[0, 1, 2, 0, 1, 2]])print(ser)'''a 0 0.6640901 0.356893b 2 2.5513930 0.674561c 1 1.0294112 -0.492202dtype: float64'''print(ser.index.levels) # [['a', 'b', 'c'], [0, 1, 2]]# swapleveal():交换内层和外层索引print(ser.swaplevel())'''0 a 0.6640901 a 0.3568932 b 2.5513930 b 0.6745611 c 1.0294112 c -0.492202dtype: float64'''# unstack():将具有多层索引的Series转化为DataFrame,参数可以指定处理的层索引print(ser.unstack())'''0 1 2a -2.094717 0.593266 NaNb -0.006563 NaN -3.098567c NaN -0.790008 -0.603009'''---------------------print(ser.unstack(0))'''a b c0 -2.094717 -0.006563 NaN1 0.593266 NaN -0.7900082 NaN -3.098567 -0.603009'''# stack(): 将DataFrame转化为具有多层索引的Series,参数可以指定处理的层索引print(ser.unstack(0).stack())'''0 a -1.176036b 0.9168511 a -0.280411c 0.0153312 b -0.962107c 1.050720dtype: float64'''
统计计算

import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(4, 3), columns=['a', 'b', 'c'])df.loc[0, 'a'] = np.nanprint(df)'''a b c0 NaN -0.351020 -0.0829291 -0.227825 -2.144045 1.4039832 -0.491475 0.729788 2.3186083 -0.693459 0.006438 1.203519'''print(df.min())'''a -0.693459b -2.144045c -0.082929dtype: float64'''print(df.min(skipna=False))'''a NaNb -2.144045c -0.082929dtype: float64'''
分组与聚合
基本概念
- 在划分出来的组上应用一些统计函数,从而达到数据分析的目的,比如对分组数据进行聚合、转换、过滤,这个过程主要包含以下三步:
- 拆分(Spliting):对数据进行分组
- 应用(Applying):对分组数据应用聚合函数,进行相应计算
- 合并(Combining):汇总计算结果
- groupby的过程就是将原有的DataFrame按照groupby的字段,划分为若干个分组DataFrame,被分为多少个组就有多少个分组DataFrame,而之后的一系列操作(比如agg、apply等),均是基于子DataFrame的操作
分组
分组对象包括 DataFrameGroupBy 和 SeriesGroupBy
分组操作
没有进行实际运算,只是包含分组的中间数据
import pandas as pdimport numpy as npvar = {'key1': ['a', 'b', 'a', 'b'],'key2': ['one', 'one', 'two', 'two'],'data1': np.random.randn(4),'data2': np.random.randn(4)}df = pd.DataFrame(var)print(df)'''key1 key2 data1 data20 a one -0.521521 -1.4976521 b one 0.671180 -1.3942522 a two -0.884387 -0.0061093 b two -0.047436 -0.876259'''print(type(df.groupby('key1')))# <class 'pandas.core.groupby.generic.DataFrameGroupBy'>print(type(df['data1'].groupby(df['key1'])))# <class 'pandas.core.groupby.generic.SeriesGroupBy'>
分组运算
对group对象进行分组运算,非数值数据不进行分组运算
# size()返回每个分组的元素个数print(df.groupby('key1').size())'''key1a 2b 2dtype: int64'''print(df.groupby(['key1', 'key2']).size())'''key1 key2a one 1two 1b one 1two 1dtype: int64'''print(df['data1'].groupby(df['key1']).size())'''key1a 2b 2Name: data1, dtype: int64'''
聚合
内置的聚合函数
sum()、mean()、max()、min()、count()、size()、describe()等
自定义函数,传入agg方法中
import pandas as pdimport numpy as npvar = {'key1': ['a', 'b', 'a', 'b'],'key2': ['one', 'one', 'two', 'two'],'data1': np.random.randn(4),'data2': np.random.randn(4)}df = pd.DataFrame(var)print(df)'''key1 key2 data1 data20 a one -0.347282 0.8269071 b one -0.387571 1.3198502 a two -0.659646 0.5434893 b two 0.899404 0.508680'''def peak_range(g):return g.max() - g.min()print(df.groupby('key1').agg(peak_range))print(df.groupby('key1').agg(lambda g: g.max() - g.min()))'''data1 data2key1a 0.312364 0.283418b 1.286975 0.811170'''
使用多个聚合函数
def peak_range(g):return g.max() - g.min()print(df.groupby('key1').agg(['mean', 'count', peak_range]))'''data1 data2mean count peak_range mean count peak_rangekey1a -0.411191 2 0.204956 -0.396638 2 1.254341b -1.417198 2 0.579441 -1.376166 2 1.557887'''# 通过元组提供新的列名print(df.groupby('key1').agg(['mean', 'count', ('range', peak_range)]))'''data1 data2mean count range mean count rangekey1a -0.411191 2 0.204956 -0.396638 2 1.254341b -1.417198 2 0.579441 -1.376166 2 1.557887'''
对不同的列使用不同的聚合函数
使用dict
print(df.groupby('key1').agg({'data1': 'mean', 'data2': 'sum'}))'''data1 data2key1a -0.624912 1.954829b 1.062752 -1.502316'''print(df.groupby('key1').agg({'data1': ['mean', 'max'], 'data2': 'sum'}))'''data1 data2mean max sumkey1a -0.624912 -0.446746 1.954829b 1.062752 2.126030 -1.502316'''
