2.2 Pandas - 2.2.7 多重索引 - 《数据分析》

import pandas as pd
position = pd.read_csv('position.csv')
company = pd.read_csv('company_sql.csv',encoding='gbk')
position.groupby(by=['city','education']).mean()
如何把上海博士平均薪资切片出来？
1、(1)使用索引
- position.groupby(by=['city','education']).mean().avg，变成series形式
- 在series形式里面可以直接输入索引的标签，去进行一个切片，比如后边加上['上海']['博士']
position.groupby(by=['city','education']).mean().avg
position.groupby(by=['city','education']).mean().avg['上海']['博士']
(2)使用loc：loc支持对多个索引去进行引用
- 有一个数据框，是多重索引的形式，想知道某个具体的索引下面的值，用loc最方便，如果想要一层层深入的话，可以再loc里输入多个值
- 用series切片再进行查找，也挺方便。
position.groupby(by=['city','education']).mean().loc['上海']['avg']     # 可以
position.groupby(by=['city','education']).mean().loc['上海']['博士']    # 报错
position.groupby(by=['city','education']).mean().loc['上海','博士']    # 报错的修改之后
2、不借助groupby,如何进行多重索引呢？
- set_index，把列变成索引
- reset_index，把索引变成字段
- 然后再进行切片就会比较方便
position.set_index(['city','education'])    # 会进行加工，但是没有排好序
# 先排序再加工
position.sort_values(['city','education']).set_index(['city','education'])
# reset_index()把索引变成字段，再用query过滤或者切片筛选符合条件的值
position.groupby(by=['city','education']).mean().reset_index()
(1)query查找上海博士的平均薪资：（以下三个等价）
- position.groupby(by=['city','education']).mean().reset_index().query("(city=='上海') and (education=='博士')")
- position.groupby(by=['city','education']).mean().reset_index().query("city=='上海'").query("education=='博士'")
- position.groupby(by=['city','education']).mean().reset_index().query("(city=='上海') & (education=='博士')")
position.groupby(by=['city','education']).mean().reset_index().query('city=="上海"and education=="博士"')
(2)切片查找上海博士的平均薪资：
a=position.groupby(by=['city','education']).mean().reset_index()
a[(a.city=='上海')&(a.education=='博士')]
a[(a.city=='上海')&(a.education=='博士')].avg