1. 函数生成简单的DataFrame

  1. import pandas as pd
  2. import numpy as np
  3. def make_df(cols, ind):
  4. # 一个简单的DataFrame
  5. data = {c: [str(c) + str(i) for i in ind]
  6. for c in cols}
  7. return pd.DataFrame(data, ind)
  8. df = make_df('ABC', range(3))
  9. print(df)
  10. A B C
  11. 0 A0 B0 C0
  12. 1 A1 B1 C1
  13. 2 A2 B2 C2

2. Numpy数组的合并

  1. print('Numpy数组的合并')
  2. x = [1, 2, 3]
  3. y = [4, 5, 6]
  4. z = [7, 8, 9]
  5. np.concatenate([x, y, z])
  6. Numpy数组的合并
  7. array([1, 2, 3, 4, 5, 6, 7, 8, 9])

3. Pandas的合并

1、 Series合并

  1. ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
  2. ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
  3. print('按行合并')
  4. print(pd.concat([ser1, ser2], axis=0))
  5. print('按列合并')
  6. print(pd.concat([ser1, ser2], axis=1))
  7. print(pd.concat([ser1, ser2], axis='columns'))
  8. 按行合并
  9. 1 A
  10. 2 B
  11. 3 C
  12. 4 D
  13. 5 E
  14. 6 F
  15. dtype: object
  16. 按列合并
  17. 0 1
  18. 1 A NaN
  19. 2 B NaN
  20. 3 C NaN
  21. 4 NaN D
  22. 5 NaN E
  23. 6 NaN F
  24. 0 1
  25. 1 A NaN
  26. 2 B NaN
  27. 3 C NaN
  28. 4 NaN D
  29. 5 NaN E
  30. 6 NaN F

2、 DataFrame-concat合并

2.1 指定合并轴

  1. print('对DataFrame的合并,可以指定合并轴')
  2. df1 = make_df('AB', [1, 2])
  3. df2 = make_df('AB', [3, 4])
  4. print(df1)
  5. print(df2)
  6. print(pd.concat([df1, df2]))
  7. print(pd.concat([df1, df2], axis='columns')) # 也可以axis=1
  8. DataFrame的合并,可以指定合并轴
  9. A B
  10. 1 A1 B1
  11. 2 A2 B2
  12. A B
  13. 3 A3 B3
  14. 4 A4 B4
  15. A B
  16. 1 A1 B1
  17. 2 A2 B2
  18. 3 A3 B3
  19. 4 A4 B4
  20. A B A B
  21. 1 A1 B1 NaN NaN
  22. 2 A2 B2 NaN NaN
  23. 3 NaN NaN A3 B3
  24. 4 NaN NaN A4 B4

2.2 并集合并

对DataFrame的合并,sort默认值为True,此时默认进行并集合并(join=’outer’).当更改sort为False时,列的顺序维持原样.

  1. df5 = make_df('BAC', [1, 2])
  2. df6 = make_df('BCD', [3, 4])
  3. print(df5)
  4. print(df6)
  5. print("sort值为True时,列的顺序进行排序")
  6. print(pd.concat([df5, df6]))
  7. print(pd.concat([df5, df6], sort=True))
  8. print("sort值为False时,列的顺序维持原样,不进行重新排序")
  9. print(pd.concat([df5, df6], sort=False))
  10. DataFrame的合并,默认进行并集合并(join='outer'sort默认值为True,表示
  11. B A C
  12. 1 B1 A1 C1
  13. 2 B2 A2 C2
  14. B C D
  15. 3 B3 C3 D3
  16. 4 B4 C4 D4
  17. sort值为True时,列的顺序进行排序
  18. A B C D
  19. 1 A1 B1 C1 NaN
  20. 2 A2 B2 C2 NaN
  21. 3 NaN B3 C3 D3
  22. 4 NaN B4 C4 D4
  23. A B C D
  24. 1 A1 B1 C1 NaN
  25. 2 A2 B2 C2 NaN
  26. 3 NaN B3 C3 D3
  27. 4 NaN B4 C4 D4
  28. sort值为False时,列的顺序维持原样,不进行重新排序
  29. B A C D
  30. 1 B1 A1 C1 NaN
  31. 2 B2 A2 C2 NaN
  32. 3 B3 NaN C3 D3
  33. 4 B4 NaN C4 D4

2.3 交集合并

  1. print(df5)
  2. print(df6)
  3. print('交集合并')
  4. print(pd.concat([df5, df6],join='inner'))
  5. B A C
  6. 1 B1 A1 C1
  7. 2 B2 A2 C2
  8. B C D
  9. 3 B3 C3 D3
  10. 4 B4 C4 D4
  11. 交集合并
  12. B C
  13. 1 B1 C1
  14. 2 B2 C2
  15. 3 B3 C3
  16. 4 B4 C4

3、 DataFrame-merge合并

3.1 merge左合并右合并

  1. print('\n按左边df5数据的行进行合并:')
  2. print(pd.merge(df5, df6, how='left', left_index=True, right_index=True))
  3. print('\n按右边df6数据的行进行合并:')
  4. print(pd.merge(df5, df6, how='right', left_index=True, right_index=True))
  5. 按左边df5数据的行进行合并:
  6. B_x A C_x B_y C_y D
  7. 1 B1 A1 C1 NaN NaN NaN
  8. 2 B2 A2 C2 NaN NaN NaN
  9. 按右边df6数据的行进行合并:
  10. B_x A C_x B_y C_y D
  11. 3 NaN NaN NaN B3 C3 D3
  12. 4 NaN NaN NaN B4 C4 D4

3.2 一对一连接

merge默认丢弃原来的索引,按相同列进行合并,employee这一列完全一样,所以就按照这一列进行合并

  1. import pandas as pd
  2. df1 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
  3. 'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
  4. df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
  5. 'hire_date': [2004, 2008, 2012, 2014]})
  6. print(df1)
  7. print(df2)
  8. print('一对一连接')
  9. df3 = pd.merge(df1, df2)
  10. print(df3)
  11. employee group
  12. 0 Lisa Accounting
  13. 1 Bob Engineering
  14. 2 Jake Engineering
  15. 3 Sue HR
  16. employee hire_date
  17. 0 Lisa 2004
  18. 1 Bob 2008
  19. 2 Jake 2012
  20. 3 Sue 2014
  21. 一对一连接
  22. employee group hire_date
  23. 0 Lisa Accounting 2004
  24. 1 Bob Engineering 2008
  25. 2 Jake Engineering 2012
  26. 3 Sue HR 2014

3.3 一对多连接

  1. df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
  2. 'supervisor': ['Carly', 'Guido', 'Steve']})
  3. print(df3)
  4. print(df4)
  5. print(pd.merge(df3, df4))
  6. employee group hire_date
  7. 0 Lisa Accounting 2004
  8. 1 Bob Engineering 2008
  9. 2 Jake Engineering 2012
  10. 3 Sue HR 2014
  11. group supervisor
  12. 0 Accounting Carly
  13. 1 Engineering Guido
  14. 2 HR Steve
  15. employee group hire_date supervisor
  16. 0 Lisa Accounting 2004 Carly
  17. 1 Bob Engineering 2008 Guido
  18. 2 Jake Engineering 2012 Guido
  19. 3 Sue HR 2014 Steve

3.4 多对多连接

  1. df5 = pd.DataFrame({'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR'],
  2. 'skills': ['math', 'spreadsheets', 'coding', 'spreadsheets', 'organization']})
  3. print(df1)
  4. print(df5)
  5. print(pd.merge(df1, df5))
  6. employee group
  7. 0 Lisa Accounting
  8. 1 Bob Engineering
  9. 2 Jake Engineering
  10. 3 Sue HR
  11. group skills
  12. 0 Accounting math
  13. 1 Accounting spreadsheets
  14. 2 Engineering coding
  15. 3 Engineering spreadsheets
  16. 4 HR organization
  17. employee group skills
  18. 0 Lisa Accounting math
  19. 1 Lisa Accounting spreadsheets
  20. 2 Bob Engineering coding
  21. 3 Bob Engineering spreadsheets
  22. 4 Jake Engineering coding
  23. 5 Jake Engineering spreadsheets
  24. 6 Sue HR organization

3.5 按照不同列名的两列进行融合

  1. df3 = pd.DataFrame({'name': ['Lisa', 'Bob', 'Jake', 'Sue'],
  2. 'salary': [7000, 9000, 10000, 20000]})
  3. aa = pd.merge(df1, df3, left_on='employee', right_on='name')
  4. print(df1)
  5. print(df3)
  6. print(aa)
  7. employee group
  8. 0 Lisa Accounting
  9. 1 Bob Engineering
  10. 2 Jake Engineering
  11. 3 Sue HR
  12. name salary
  13. 0 Lisa 7000
  14. 1 Bob 9000
  15. 2 Jake 10000
  16. 3 Sue 20000
  17. employee group name salary
  18. 0 Lisa Accounting Lisa 7000
  19. 1 Bob Engineering Bob 9000
  20. 2 Jake Engineering Jake 10000
  21. 3 Sue HR Sue 20000

3.6 内连接与外连接

merge()函数默认的是使用内连接

  1. df4 = pd.DataFrame({'employee': ['Lisa', 'David', 'Ben'],
  2. 'salary': [7000, 9000, 10000]})
  3. print(df1)
  4. print(df4)
  5. print(pd.merge(df1, df4))
  6. employee group
  7. 0 Lisa Accounting
  8. 1 Bob Engineering
  9. 2 Jake Engineering
  10. 3 Sue HR
  11. employee salary
  12. 0 Lisa 7000
  13. 1 David 9000
  14. 2 Ben 10000
  15. employee group salary
  16. 0 Lisa Accounting 7000

可以指定merge()里的how参数更改连接方式为外连接

  1. print(df1)
  2. print(df4)
  3. print('连接方式改为外连接')
  4. print(pd.merge(df1, df4, how='outer'))
  5. employee group
  6. 0 Lisa Accounting
  7. 1 Bob Engineering
  8. 2 Jake Engineering
  9. 3 Sue HR
  10. employee salary
  11. 0 Lisa 7000
  12. 1 David 9000
  13. 2 Ben 10000
  14. 连接方式改为外连接
  15. employee group salary
  16. 0 Lisa Accounting 7000.0
  17. 1 Bob Engineering NaN
  18. 2 Jake Engineering NaN
  19. 3 Sue HR NaN
  20. 4 David NaN 9000.0
  21. 5 Ben NaN 10000.0

4、 DataFrame-join合并

pandas.DataFrame.join
参数说明
other:【DataFrame,或者带有名字的Series,或者DataFrame的list】如果传递的是Series,那么其name属性应当是⼀个集合,并且该集合将会作为结果DataFrame的列名
on:【列名称,或者列名称的list/tuple,或者类似形状的数组】连接的列,默认使⽤索引连接
how:【{‘left’, ‘right’, ‘outer’, ‘inner’}, default:‘left’】连接的⽅式,默认为左连接
lsuffix:【string】左DataFrame中重复列的后缀
rsuffix:【string】右DataFrame中重复列的后缀
sort:【boolean, default False】按照字典顺序对结果在连接键上排序。如果为False,连接键的顺序取决于连接类型(关键字)。

通过索引连接DataFrame

  1. import pandas as pd
  2. first=pd.DataFrame({'item_id':['a','b','c','b','d'],'item_price':[1,2,3,2,4]})
  3. other=pd.DataFrame({'item_id':['a','b','f'],'item_atr':['k1','k2','k3']})
  4. print(first)
  5. print(other)
  6. print(first.join(other, lsuffix='_left', rsuffix='_right'))

image.png

通过指定的列连接DataFrame

  1. import pandas as pd
  2. first=pd.DataFrame({'item_id':['a','b','c','b','d'],'item_price':[1,2,3,2,4]})
  3. other=pd.DataFrame({'item_id':['a','b','f'],'item_atr':['k1','k2','k3']})
  4. print(first)
  5. print(other)
  6. print(first.set_index('item_id').join(other.set_index('item_id')))

image.png

通过on参数指定连接的列

通过on参数指定连接的列,DataFrame.join总是使⽤other的索引去连接first,因此我们可以把指定的列设置为other的索引,然后⽤on去指定first的连接列,这样可以让连接结果的索引和first⼀致

  1. import pandas as pd
  2. first=pd.DataFrame({'item_id':['a','b','c','b','d'],'item_price':[1,2,3,2,4]})
  3. other=pd.DataFrame({'item_id':['a','b','f'],'item_atr':['k1','k2','k3']})
  4. print(first)
  5. print(other)
  6. print(first.join(other.set_index('item_id'),on='item_id'))

image.png

左右连接

  1. import pandas as pd
  2. first=pd.DataFrame({'item_id':['a','b','c','b','d'],'item_price':[1,2,3,2,4]})
  3. other=pd.DataFrame({'item_id':['a','b','f'],'item_atr':['k1','k2','k3']})
  4. print(first)
  5. print(other)
  6. # 左右连接
  7. print(first.join(other,how='right',lsuffix='_left',rsuffix='_right')) # 右连接
  8. print(first.join(other,how='left',lsuffix='_left',rsuffix='_right')) # 左连接 同索引连接效果相同

image.png
内外连接

  1. import pandas as pd
  2. first=pd.DataFrame({'item_id':['a','b','c','b','d'],'item_price':[1,2,3,2,4]})
  3. other=pd.DataFrame({'item_id':['a','b','f'],'item_atr':['k1','k2','k3']})
  4. print(first)
  5. print(other)
  6. # 外连接
  7. print(first.join(other.set_index('item_id'),on='item_id',how='outer'))
  8. # 内连接
  9. print(first.join(other.set_index('item_id'),on='item_id',how='inner'))

image.png
————————————————————————————
作者:绵掌骞魁2021
链接:https://wenku.baidu.com/view/036f0802a6e9856a561252d380eb6294dd882224.html
来源:百度文库
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。