Pandas 数据分析（2） - 《数据分析》

表联结
- Concatenate
- Joining / Merging
数据透视与重塑
- 重塑 reshape
- 透视表 pivot table
一键数据分析">一键数据分析

表联结

提供了类似于SQL的join接口，供我们进行多表组合。不同的是，pandas可以对index进行join

Concatenate

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)

objs : a sequence or mapping of Series or DataFrame objects. If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.
axis : {0, 1, …}, default 0. The axis to concatenate along.
join : {‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on other axis(es). Outer for union and inner for intersection.
ignore_index : boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.
keys : sequence, default None. Construct hierarchical index using the passed keys as the outermost level. If multiple levels passed, should contain tuples.
levels : list of sequences, default None. Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.
names : list, default None. Names for the levels in the resulting hierarchical index.
verify_integrity : boolean, default False. Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.
copy : boolean, default True. If False, do not copy data unnecessarily.

比较重要的几个参数

objs通常是一个Series或DataFrame的序列
axis哪个轴上进行拼接，default 0
join两种情况，inner依据索引共同出现的部分进行拼接；outer全部都要，索引不匹配的填充NaN，default outer
ignore_index，是否忽略拼接轴上的索引，如果忽略则从0开始计数

先准备一下我们的练习数据集

df1 = pd.DataFrame({'B': ['B1', 'B2', 'B3'],
                    'A': ['A1', 'A2', 'A3'],
                    'C': ['C1', 'C2', 'C3']},
                   index = [1,2,3])
df1

	B	A	C
1	B1	A1	C1
2	B2	A2	C2
3	B3	A3	C3

df2 = pd.DataFrame({'D': ['C3', 'C4', 'C5'],
                    'C': ['A3', 'A4', 'A5'],
                    'B': ['B3', 'B4', 'B5']},
                   index = [3,4,5])
df2

	D	C	B
3	C3	A3	B3
4	C4	A4	B4
5	C5	A5	B5

pd.concat([df1, df2], sort=False)

	A	B	C	D
1	A1	B1	C1	NaN
2	A2	B2	C2	NaN
3	A3	B3	C3	NaN
3	NaN	B3	A3	C3
4	NaN	B4	A4	C4
5	NaN	B5	A5	C5

pd.concat([df1, df2], join='inner')

	B	C
1	B1	C1
2	B2	C2
3	B3	C3
3	B3	A3
4	B4	A4
5	B5	A5

pd.concat([df1, df2], join='inner', ignore_index=True)

	B	C
0	B1	C1
1	B2	C2
2	B3	C3
3	B3	A3
4	B4	A4
5	B5	A5

pd.concat([df1, df2], axis=1)

	A	B	C	C	B	D
1	A1	B1	C1	NaN	NaN	NaN
2	A2	B2	C2	NaN	NaN	NaN
3	A3	B3	C3	A3	B3	C3
4	NaN	NaN	NaN	A4	B4	C4
5	NaN	NaN	NaN	A5	B5	C5

pd.concat([df1, df2], axis=1, join='inner')

	A	B	C	C	B	D
3	A3	B3	C3	A3	B3	C3

Joining / Merging

merge函数用来对两张表进行join，非常类似于sql当中的表联结。 pandas里面不仅可以对columns进行join, 还可以对index进行join。

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)

left: A DataFrame object
right: Another DataFrame object
on: Columns (names) to join on. Must be found in both the left and right DataFrame objects. If not passed and left_index and right_index are False, the intersection of the columns in the DataFrames will be inferred to be the join keys
left_on: Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame
right_on: Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame
left_index: If True, use the index (row labels) from the left DataFrame as its join key(s). In the case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame
right_index: Same usage as left_index for the right DataFrame
how: One of ‘left’, ‘right’, ‘outer’, ‘inner’. Defaults to inner. See below for more detailed description of each method
sort: Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve performance substantially in many cases
suffixes: A tuple of string suffixes to apply to overlapping columns. Defaults to (‘_x’, ‘_y’).
copy: Always copy data (default True) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless.
indicator: Add a column to the output DataFrame called _merge with information on the source of each row. _merge is Categorical-type and takes on a value of left_only for observations whose merge key only appears in ‘left’ DataFrame, right_only for observations whose merge key only appears in ‘right’ DataFrame, and both if the observation’s merge key is found in both.

对一些常用参数进行说明

left / right参数指定左右表
on参数关联的key是什么，适用于左右表关联key相同的情况
left_on / right_on参数分别指定，适用于左右表关联key不同的情况
left_index / right_index参数是否按照index来join
how参数 left / right / outer / inner 默认为inner方式
indicator参数 True / False

基于columns进行join

# 在一个主键上merge
pd.merge(df1, df2, on='B')
# 下面这条语句和上面是等价的效果
# pd.merge(df1, df2, left_on='B', right_on='B')

	B	A	C_x	D	C_y
0	B3	A3	C3	C3	A3

df2.loc[3, 'C'] = 'C3'
df2

	D	C	B
3	C3	C3	B3
4	C4	A4	B4
5	C5	A5	B5

# 在多个主键上merge
pd.merge(df1, df2, on=['B', 'C'])

	B	A	C	D
0	B3	A3	C3	C3

# 在多个主键上merge，分别指定不同的列
pd.merge(df1, df2, left_on=['B', 'C'], right_on=['B', 'D'])

	B	A	C	D
0	B3	A3	C3	C3

注意看加上indicator参数后，输出结果与上面对比有什么不同？这个参数表明是左右表都有，还是只是左表或右表有。

pd.merge(df1, df2, on=['B', 'C'], indicator=True)

	B	A	C	D	_merge
0	B3	A3	C3	C3	both

pd.merge(df1, df2, how='left', indicator=True)

	B	A	C	D	_merge
0	B1	A1	C1	NaN	left_only
1	B2	A2	C2	NaN	left_only
2	B3	A3	C3	C3	both

pd.merge(df1, df2, how='right', indicator=True)

	B	A	C	D	_merge
0	B3	A3	C3	C3	both
1	B4	NaN	A4	C4	right_only
2	B5	NaN	A5	C5	right_only

pd.merge(df1, df2, how='outer', indicator=True)

	B	A	C	D	_merge
0	B1	A1	C1	NaN	left_only
1	B2	A2	C2	NaN	left_only
2	B3	A3	C3	C3	both
3	B4	NaN	A4	C4	right_only
4	B5	NaN	A5	C5	right_only

基于index进行join

注意：即便是MultiIndex，也是可以进行join的，只要两张表有公共的索引部分即可

pd.merge(df1, df2, left_index=True, right_index=True)

	B_x	A	C_x	D	C_y	B_y
3	B3	A3	C3	C3	C3	B3

索引和列混合join

df2 = df2.reset_index()
pd.merge(df1, df2, left_index=True, right_on='index')

注意事项：当在多个key上进行merge时，keys的数量要一致，keys的位置要相互对应

最后，对比concat和merge

* concat主要是基于某个轴上的索引进行拼接的
* 而merge主要是进行横向的合并
* 当运用concat函数时axis为1时，也可用相应的merge实现等价的操作，除了结果数据框的列名不同

数据透视与重塑

重塑 reshape

先写一下对pivot函数的理解，pivot函数仅仅是对数据框做了格式的转换，此时还没有汇总计算。

index参数，指定某一列数据进行去重，使之成为行标签。（每一行的标签）

columns参数，指定某一列数据进行去重，使之成为列标签。（每一列的标签）

values参数，指定其中某一列的值填充到新数据框，如果没有指定，则所有可能的值都会出现在新的数据框中，形成multiIndex

Pandas 数据分析（2） - 图1

df = pd.DataFrame({'foo': ['one', 'one', 'two', 'two'], 
                    'baz': ['A', 'B', 'C', 'B'],
                     'bar': [1, 2, 3, 4]})
df

	foo	baz	bar
0	one	A	1
1	one	B	2
2	two	C	3
3	two	B	4

df.pivot(index='foo', columns='baz', values='bar')

baz	A	B	C
foo
one	1.0	2.0	NaN
two	NaN	4.0	3.0

df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
                  'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                  'baz': [1, 2, 3, 4, 5, 6],
                  'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
df

	foo	bar	baz	zoo
0	one	A	1	x
1	one	B	2	y
2	one	C	3	z
3	two	A	4	q
4	two	B	5	w
5	two	C	6	t

df.pivot(index='foo', columns='bar')

	baz			zoo
bar	A	B	C	A	B	C
foo
one	1	2	3	x	y	z
two	4	5	6	q	w	t

现在我们来模拟一份销售数据，有日期，有组别，有销量，然后来作一个pivot运算

from datetime import datetime, timedelta

today = datetime.now().date()

date = [today - timedelta(i) for i in range(5)] * 6

group = ['A'] * 5 + ['B'] * 5 + ['C'] * 5 + ['D'] * 5 + ['E'] * 5 + ['F'] * 5

sells = np.random.randint(1000, 100000, size=30)

data = {"date": date, "group": group, 'sells': sells}
df = pd.DataFrame({"date": date, "group": group, 'sells': sells})
df.head(10)

	date	group	sells
0	2019-09-30	A	48452
1	2019-09-29	A	38219
2	2019-09-28	A	44226
3	2019-09-27	A	35117
4	2019-09-26	A	33288
5	2019-09-30	B	5537
6	2019-09-29	B	5315
7	2019-09-28	B	9035
8	2019-09-27	B	76891
9	2019-09-26	B	45026

# df.pivot(index='group', columns='date', values='sells')

# 假设我们想按照日期进行查看
df.pivot(index='date', columns='group', values='sells')

group	A	B	C	D	E	F
date
2019-09-26	33288	45026	24553	16556	83922	55761
2019-09-27	35117	76891	75131	92268	46653	12062
2019-09-28	44226	9035	97750	99789	48696	56881
2019-09-29	38219	5315	49445	44874	36867	90270
2019-09-30	48452	5537	12272	25255	73326	20004

Note

pivot() will error with a ValueError: Index contains duplicate entries, cannot reshape if the index/column pair is not unique. In this case, consider using pivot_table() which is a generalization of pivot that can handle duplicate values for one index/column pair.

如果行标签与列标签不能唯一确定一个值，则会报错；此时应该使用pivot_table；在应用pivot时行标签和列标签对应的值不可以是重复的；这一点不同于Excel，它的数据透视可以直接做统计。看下面的小例子。

# 添加了重复值
df.loc[3, 'foo'] = 'one'
# 会报错
df.pivot(index='foo', columns='baz', values='bar')

Pandas 数据分析（2） - 图2

stack()与 unstack() 函数都是用于多重索引的

stack()：column转换成index
unstack():index转换成column

关于stack函数的一点理解

栈，堆叠的意思
会把数据框中所有列的名字都传入到索引里
堆叠起来，展示数据
最后形成multiIndex
结果中共m * n行记录（其中m是原数据框中的行数，n是原数据框中的列数，在无重复列名的情况下）

关于unstack函数的一点理解

stack的逆操作；
默认是unstack最内层最里面的索引，即最后一个level
将该level索引里面的唯一值都作为列名
把后面的值按照对应关系填充到新的数据框内
其中multiIndex会减少一个level

和pd.melt操作作对比

melt将一列或多列转化为一列来显示
而stack将所有不重复的列转化为索引来显示
其实melt能实现的操作通过stack也是可以实现的

和pivot_table作对比

pivot_bale将某一列的唯一值转化为了行标签
而stack将所有不重复的列名转化为了行标签

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
         ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df_mul = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df_mul

		A	B
first	second
bar	one	1.607880	-0.364222
	two	1.753577	-2.514792
baz	one	1.143161	0.685996
	two	-1.570172	0.333576
foo	one	0.099722	0.923258
	two	-0.793577	-2.215845
qux	one	-0.093891	1.003900
	two	0.504794	1.563427

df_mul.stack()

执行（A, B两列的数据被堆叠了起来）

first  second   
bar    one     A    0.442057
               B    0.757438
       two     A   -0.161969
               B   -0.028757
baz    one     A    0.211266
               B   -0.590143
       two     A   -0.175162
               B    0.517473
foo    one     A   -1.854350
               B   -0.975292
       two     A    1.297431
               B   -1.077260
qux    one     A   -1.348159
               B    1.318323
       two     A    1.131164
               B    1.410893
dtype: float64

注意这个地方，level参数值不同所造成结果的不同

df1 = df_mul.stack()

df1.unstack(level=2)

执行

		A	B
first	second
bar	one	0.442057	0.757438
	two	-0.161969	-0.028757
baz	one	0.211266	-0.590143
	two	-0.175162	0.517473
foo	one	-1.854350	-0.975292
	two	1.297431	-1.077260
qux	one	-1.348159	1.318323
	two	1.131164	1.410893

# df1.unstack(level=0)

df1.unstack(level=1)

执行

	second	one	two
first
bar	A	0.442057	-0.161969
	B	0.757438	-0.028757
baz	A	0.211266	-0.175162
	B	-0.590143	0.517473
foo	A	-1.854350	1.297431
	B	-0.975292	-1.077260
qux	A	-1.348159	1.131164
	B	1.318323	1.410893

Pandas 数据分析（2） - 图3

Pandas 数据分析（2） - 图4
Pandas 数据分析（2） - 图5

Pandas 数据分析（2） - 图6

melt函数将数据框从wide format转化为long format；某些列设置为标记变量；其它的列设置为衡量变量；函数会自动生成两列: “variable” and “value”；我们也可以通过var
_name和value_name两个参数自定义列名。

df = pd.DataFrame({
                    'A': ['foo', 'foo', 'foo', 'bar', 'bar'],
                    'B': ['one', 'two', 'two', 'one', 'one'],
                    'C': ['s', 's', 'l', 'l', 's'],
                    'D': [1, 2, 2, 3, 4],
                    'E': [2, 4, 8, 16, 32]})
df

	A	B	C	D	E
0	foo	one	s	1	2
1	foo	two	l	2	4
2	foo	two	l	2	8
3	bar	one	l	3	16
4	bar	one	s	4	32

宽格式要转成长格式了！

df.melt(id_vars=['A', 'B'], value_vars=['C', 'D'], var_name='mear', value_name='val')

	A	B	mear	val
0	foo	one	C	s
1	foo	two	C	l
2	foo	two	C	l
3	bar	one	C	l
4	bar	one	C	s
5	foo	one	D	1
6	foo	two	D	2
7	foo	two	D	2
8	bar	one	D	3
9	bar	one	D	4

下面的这几行代码通过重置索引和stack的方式，实现了和上面一行代码同样的效果

df1 = df.loc[:,'A':'D']
df1 = df1.set_index(['A', 'B'])
df1 = df1.stack()
df1= df1.reset_index()
df1.rename(columns={'level_2': 'mear', 0: 'val'})

透视表 pivot table

pivot_table 提供了类似于EXCEL数据透视表的功能，重点的参数如下:

data: A DataFrame object
values: a column or a list of columns to aggregate
index: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
columns: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
aggfunc: function to use for aggregation, defaulting to numpy.mean

crosstab 用于计算两个以上的因子的cross-tabulation. 默认的是计算因子之间的频率，除非指定了其它数组或者函数进行计算

index: array-like, values to group by in the rows
columns: array-like, values to group by in the columns
values: array-like, optional, array of values to aggregate according to the factors
aggfunc: function, optional, If no values array is passed, computes a frequency table
rownames: sequence, default None, must match number of row arrays passed
colnames: sequence, default None, if passed, must match number of column arrays passed
margins: boolean, default False, Add row/column margins (subtotals)
normalize: boolean, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False. Normalize by dividing all values by the sum of values.

df.pivot_table函数，用来办pivot函数办不了的事儿，主要是用来汇总的。

values参数，要聚集（统计）的列

index参数，该列的值去重后充当行标签

columns参数，该列的值去重后充当列标签

aggfunc参数，可以使字符串，list或字典

df

	A	B	C	D	E
0	foo	one	s	1	2
1	foo	two	s	2	4
2	foo	two	l	2	8
3	bar	one	l	3	16
4	bar	one	s	4	32

df.pivot_table(values='D', index='A', columns='B', aggfunc=[np.sum, np.mean])

	sum		mean
B	one	two	one	two
A
bar	7.0	NaN	3.5	NaN
foo	1.0	4.0	1.0	2.0

df.pivot_table(values=['D', 'E'], index=['A', 'B'], columns='C', aggfunc=np.sum, fill_value=0)

		D		E
	C	l	s	l	s
A	B
bar	one	3	4	16	32
foo	one	0	1	0	2
	two	2	2	8	4

df.loc[1, 'C'] = 'l'
df

	A	B	C	D	E
0	foo	one	s	1	2
1	foo	two	l	2	4
2	foo	two	l	2	8
3	bar	one	l	3	16
4	bar	one	s	4	32

不同的列应用不同的聚集函数

df.pivot_table(values=['D', 'E'], index=['A', 'B'], columns='C', aggfunc={'D': np.sum, 'E': np.mean}, fill_value=0)

		D		E
	C	l	s	l	s
A	B
bar	one	3	4	16	32
foo	one	0	1	0	2
	two	4	0	6	0

一键数据分析

注意该工具包存在BUG，不能确保所有电脑的jupyter都能安装上！！！

https://github.com/pandas-profiling/pandas-profiling

安装命令：pip install pandas-profiling

import pandas_profiling

df.profile_report(style={'full_width':True})

	A	B	C	D
1	A1	B1	C1	NaN
2	A2	B2	C2	NaN
3	A3	B3	C3	NaN
3	NaN	B3	A3	C3
4	NaN	B4	A4	C4
5	NaN	B5	A5	C5

	A	B	C	C	B	D
1	A1	B1	C1	NaN	NaN	NaN
2	A2	B2	C2	NaN	NaN	NaN
3	A3	B3	C3	A3	B3	C3
4	NaN	NaN	NaN	A4	B4	C4
5	NaN	NaN	NaN	A5	B5	C5

	A	B	mear	val
0	foo	one	C	s
1	foo	two	C	l
2	foo	two	C	l
3	bar	one	C	l
4	bar	one	C	s
5	foo	one	D	1
6	foo	two	D	2
7	foo	two	D	2
8	bar	one	D	3
9	bar	one	D	4

	A	B	C	D
1	A1	B1	C1	NaN
2	A2	B2	C2	NaN
3	A3	B3	C3	NaN
3	NaN	B3	A3	C3
4	NaN	B4	A4	C4
5	NaN	B5	A5	C5

	A	B	C	C	B	D
1	A1	B1	C1	NaN	NaN	NaN
2	A2	B2	C2	NaN	NaN	NaN
3	A3	B3	C3	A3	B3	C3
4	NaN	NaN	NaN	A4	B4	C4
5	NaN	NaN	NaN	A5	B5	C5

	A	B	mear	val
0	foo	one	C	s
1	foo	two	C	l
2	foo	two	C	l
3	bar	one	C	l
4	bar	one	C	s
5	foo	one	D	1
6	foo	two	D	2
7	foo	two	D	2
8	bar	one	D	3
9	bar	one	D	4

	A	B	C	D
1	A1	B1	C1	NaN
2	A2	B2	C2	NaN
3	A3	B3	C3	NaN
3	NaN	B3	A3	C3
4	NaN	B4	A4	C4
5	NaN	B5	A5	C5

	A	B	C	C	B	D
1	A1	B1	C1	NaN	NaN	NaN
2	A2	B2	C2	NaN	NaN	NaN
3	A3	B3	C3	A3	B3	C3
4	NaN	NaN	NaN	A4	B4	C4
5	NaN	NaN	NaN	A5	B5	C5

	A	B	mear	val
0	foo	one	C	s
1	foo	two	C	l
2	foo	two	C	l
3	bar	one	C	l
4	bar	one	C	s
5	foo	one	D	1
6	foo	two	D	2
7	foo	two	D	2
8	bar	one	D	3
9	bar	one	D	4