90个Pandas案例

  • 如何使用列表和字典创建 Series
    • 使用列表创建 Series
    • 使用 name 参数创建 Series
    • 使用简写的列表创建 Series
    • 使用字典创建 Series
  • 如何使用 Numpy 函数创建 Series
  • 如何获取 Series 的索引和值
  • 如何在创建 Series 时指定索引
  • 如何获取 Series 的大小和形状
  • 如何获取 Series 开始或末尾几行数据
    • Head()
    • Tail()
    • Take()
  • 使用切片获取 Series 子集
  • 如何创建 DataFrame
  • 如何设置 DataFrame 的索引和列信息
  • 如何重命名 DataFrame 的列名称
  • 如何根据 Pandas 列中的值从 DataFrame 中选择或过滤行
  • 在 DataFrame 中使用“isin”过滤多行
  • 迭代 DataFrame 的行和列
  • 如何通过名称或索引删除 DataFrame 的列
  • 向 DataFrame 中新增列
  • 如何从 DataFrame 中获取列标题列表
  • 如何随机生成 DataFrame
  • 如何选择 DataFrame 的多个列
  • 如何将字典转换为 DataFrame
  • 使用 ioc 进行切片
  • 检查 DataFrame 中是否是空的
  • 在创建 DataFrame 时指定索引和列名称
  • 使用 iloc 进行切片
  • iloc 和 loc 的区别
  • 使用时间索引创建空 DataFrame
  • 如何改变 DataFrame 列的排序
  • 检查 DataFrame 列的数据类型
  • 更改 DataFrame 指定列的数据类型
  • 如何将列的数据类型转换为 DateTime 类型
  • 将 DataFrame 列从 floats 转为 ints
  • 如何把 dates 列转换为 DateTime 类型
  • 两个 DataFrame 相加
  • 在 DataFrame 末尾添加额外的行
  • 为指定索引添加新行
  • 如何使用 for 循环添加行
  • 在 DataFrame 顶部添加一行
  • 如何向 DataFrame 中动态添加行
  • 在任意位置插入行
  • 使用时间戳索引向 DataFrame 中添加行
  • 为不同的行填充缺失值
  • append, concat 和 combine_first 示例
  • 获取行和列的平均值
  • 计算行和列的总和
  • 连接两列
  • 过滤包含某字符串的行
  • 过滤索引中包含某字符串的行
  • 使用 AND 运算符过滤包含特定字符串值的行
  • 查找包含某字符串的所有行
  • 如果行中的值包含字符串,则创建与字符串相等的另一列
  • 计算 pandas group 中每组的行数
  • 检查字符串是否在 DataFrme 中
  • 从 DataFrame 列中获取唯一行值
  • 计算 DataFrame 列的不同值
  • 删除具有重复索引的行
  • 删除某些列具有重复值的行
  • 从 DataFrame 单元格中获取值
  • 使用 DataFrame 中的条件索引获取单元格上的标量值
  • 设置 DataFrame 的特定单元格值
  • 从 DataFrame 行获取单元格值
  • 用字典替换 DataFrame 列中的值
  • 统计基于某一列的一列的数值
  • 处理 DataFrame 中的缺失值
  • 删除包含任何缺失数据的行
  • 删除 DataFrame 中缺失数据的列
  • 按降序对索引值进行排序
  • 按降序对列进行排序
  • 使用 rank 方法查找 DataFrame 中元素的排名
  • 在多列上设置索引
  • 确定 DataFrame 的周期索引和列
  • 导入 CSV 指定特定索引
  • 将 DataFrame 写入 csv
  • 使用 Pandas 读取 csv 文件的特定列
  • Pandas 获取 CSV 列的列表
  • 找到列值最大的行
  • 使用查询方法进行复杂条件选择
  • 检查 Pandas 中是否存在列
  • 为特定列从 DataFrame 中查找 n-smallest 和 n-largest 值
  • 从 DataFrame 中查找所有列的最小值和最大值
  • 在 DataFrame 中找到最小值和最大值所在的索引位置
  • 计算 DataFrame Columns 的累积乘积和累积总和
  • 汇总统计
  • 查找 DataFrame 的均值、中值和众数
  • 测量 DataFrame 列的方差和标准偏差
  • 计算 DataFrame 列之间的协方差
  • 计算 Pandas 中两个 DataFrame 对象之间的相关性
  • 计算 DataFrame 列的每个单元格的百分比变化
  • 在 Pandas 中向前和向后填充 DataFrame 列的缺失值
  • 在 Pandas 中使用非分层索引使用 Stacking
  • 使用分层索引对 Pandas 进行拆分
  • Pandas 获取 HTML 页面上 table 数据

    如何使用列表和字典创建 Series

    使用列表创建 Series

    ```python import pandas as pd

ser1 = pd.Series([1.5, 2.5, 3, 4.5, 5.0, 6]) print(ser1)

  1. <a name="dY5xC"></a>
  2. #### OutPut
  3. ```python
  4. 0 1.5
  5. 1 2.5
  6. 2 3.0
  7. 3 4.5
  8. 4 5.0
  9. 5 6.0
  10. dtype: float64

使用 name 参数创建 Series

  1. import pandas as pd
  2. ser2 = pd.Series(["India", "Canada", "Germany"], name="Countries")
  3. print(ser2)

OutPut

  1. 0 India
  2. 1 Canada
  3. 2 Germany
  4. Name: Countries, dtype: object

使用简写的列表创建 Series

  1. import pandas as pd
  2. ser3 = pd.Series(["A"]*4)
  3. print(ser3)

OutPut

  1. 0 A
  2. 1 A
  3. 2 A
  4. 3 A
  5. dtype: object

使用字典创建 Series

  1. import pandas as pd
  2. ser4 = pd.Series({"India": "New Delhi",
  3. "Japan": "Tokyo",
  4. "UK": "London"})
  5. print(ser4)

OutPut

  1. India New Delhi
  2. Japan Tokyo
  3. UK London
  4. dtype: object

如何使用 Numpy 函数创建 Series

  1. import pandas as pd
  2. import numpy as np
  3. ser1 = pd.Series(np.linspace(1, 10, 5))
  4. print(ser1)
  5. ser2 = pd.Series(np.random.normal(size=5))
  6. print(ser2)

OutPut

  1. 0 1.00
  2. 1 3.25
  3. 2 5.50
  4. 3 7.75
  5. 4 10.00
  6. dtype: float64
  7. 0 -1.694452
  8. 1 -1.570006
  9. 2 1.713794
  10. 3 0.338292
  11. 4 0.803511
  12. dtype: float64

如何获取 Series 的索引和值

  1. import pandas as pd
  2. import numpy as np
  3. ser1 = pd.Series({"India": "New Delhi",
  4. "Japan": "Tokyo",
  5. "UK": "London"})
  6. print(ser1.values)
  7. print(ser1.index)
  8. print("\n")
  9. ser2 = pd.Series(np.random.normal(size=5))
  10. print(ser2.index)
  11. print(ser2.values)

OutPut

  1. ['New Delhi' 'Tokyo' 'London']
  2. Index(['India', 'Japan', 'UK'], dtype='object')
  3. RangeIndex(start=0, stop=5, step=1)
  4. [ 0.66265478 -0.72222211 0.3608642 1.40955436 1.3096732 ]

如何在创建 Series 时指定索引

  1. import pandas as pd
  2. values = ["India", "Canada", "Australia",
  3. "Japan", "Germany", "France"]
  4. code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]
  5. ser1 = pd.Series(values, index=code)
  6. print(ser1)

OutPut

  1. IND India
  2. CAN Canada
  3. AUS Australia
  4. JAP Japan
  5. GER Germany
  6. FRA France
  7. dtype: object

如何获取 Series 的大小和形状

  1. import pandas as pd
  2. values = ["India", "Canada", "Australia",
  3. "Japan", "Germany", "France"]
  4. code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]
  5. ser1 = pd.Series(values, index=code)
  6. print(len(ser1))
  7. print(ser1.shape)
  8. print(ser1.size)

OutPut

  1. 6
  2. (6,)
  3. 6

如何获取 Series 开始或末尾几行数据

Head()

  1. import pandas as pd
  2. values = ["India", "Canada", "Australia",
  3. "Japan", "Germany", "France"]
  4. code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]
  5. ser1 = pd.Series(values, index=code)
  6. print("-----Head()-----")
  7. print(ser1.head())
  8. print("\n\n-----Head(2)-----")
  9. print(ser1.head(2))

OutPut

  1. -----Head()-----
  2. IND India
  3. CAN Canada
  4. AUS Australia
  5. JAP Japan
  6. GER Germany
  7. dtype: object
  8. -----Head(2)-----
  9. IND India
  10. CAN Canada
  11. dtype: object

Tail()

  1. import pandas as pd
  2. values = ["India", "Canada", "Australia",
  3. "Japan", "Germany", "France"]
  4. code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]
  5. ser1 = pd.Series(values, index=code)
  6. print("-----Tail()-----")
  7. print(ser1.tail())
  8. print("\n\n-----Tail(2)-----")
  9. print(ser1.tail(2))

OutPut

  1. -----Tail()-----
  2. CAN Canada
  3. AUS Australia
  4. JAP Japan
  5. GER Germany
  6. FRA France
  7. dtype: object
  8. -----Tail(2)-----
  9. GER Germany
  10. FRA France
  11. dtype: object

Take()

  1. import pandas as pd
  2. values = ["India", "Canada", "Australia",
  3. "Japan", "Germany", "France"]
  4. code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]
  5. ser1 = pd.Series(values, index=code)
  6. print("-----Take()-----")
  7. print(ser1.take([2, 4, 5]))

OutPut

  1. -----Take()-----
  2. AUS Australia
  3. GER Germany
  4. FRA France
  5. dtype: object

使用切片获取 Series 子集

  1. import pandas as pd
  2. num = [000, 100, 200, 300, 400, 500, 600, 700, 800, 900]
  3. idx = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
  4. series = pd.Series(num, index=idx)
  5. print("\n [2:2] \n")
  6. print(series[2:4])
  7. print("\n [1:6:2] \n")
  8. print(series[1:6:2])
  9. print("\n [:6] \n")
  10. print(series[:6])
  11. print("\n [4:] \n")
  12. print(series[4:])
  13. print("\n [:4:2] \n")
  14. print(series[:4:2])
  15. print("\n [4::2] \n")
  16. print(series[4::2])
  17. print("\n [::-1] \n")
  18. print(series[::-1])

OutPut

  1. [2:2]
  2. C 200
  3. D 300
  4. dtype: int64
  5. [1:6:2]
  6. B 100
  7. D 300
  8. F 500
  9. dtype: int64
  10. [:6]
  11. A 0
  12. B 100
  13. C 200
  14. D 300
  15. E 400
  16. F 500
  17. dtype: int64
  18. [4:]
  19. E 400
  20. F 500
  21. G 600
  22. H 700
  23. I 800
  24. J 900
  25. dtype: int64
  26. [:4:2]
  27. A 0
  28. C 200
  29. dtype: int64
  30. [4::2]
  31. E 400
  32. G 600
  33. I 800
  34. dtype: int64
  35. [::-1]
  36. J 900
  37. I 800
  38. H 700
  39. G 600
  40. F 500
  41. E 400
  42. D 300
  43. C 200
  44. B 100
  45. A 0
  46. dtype: int64

如何创建 DataFrame

  1. import pandas as pd
  2. employees = pd.DataFrame({
  3. 'EmpCode': ['Emp001', 'Emp00'],
  4. 'Name': ['John Doe', 'William Spark'],
  5. 'Occupation': ['Chemist', 'Statistician'],
  6. 'Date Of Join': ['2018-01-25', '2018-01-26'],
  7. 'Age': [23, 24]})
  8. print(employees)

OutPut

  1. Age Date Of Join EmpCode Name Occupation
  2. 0 23 2018-01-25 Emp001 John Doe Chemist
  3. 1 24 2018-01-26 Emp00 William Spark Statistician

如何设置 DataFrame 的索引和列信息

  1. import pandas as pd
  2. employees = pd.DataFrame(
  3. data={'Name': ['John Doe', 'William Spark'],
  4. 'Occupation': ['Chemist', 'Statistician'],
  5. 'Date Of Join': ['2018-01-25', '2018-01-26'],
  6. 'Age': [23, 24]},
  7. index=['Emp001', 'Emp002'],
  8. columns=['Name', 'Occupation', 'Date Of Join', 'Age'])
  9. print(employees)

OutPut

  1. Name Occupation Date Of Join Age
  2. Emp001 John Doe Chemist 2018-01-25 23
  3. Emp002 William Spark Statistician 2018-01-26 24

如何重命名 DataFrame 的列名称

  1. import pandas as pd
  2. employees = pd.DataFrame({
  3. 'EmpCode': ['Emp001', 'Emp00'],
  4. 'Name': ['John Doe', 'William Spark'],
  5. 'Occupation': ['Chemist', 'Statistician'],
  6. 'Date Of Join': ['2018-01-25', '2018-01-26'],
  7. 'Age': [23, 24]})
  8. employees.columns = ['EmpCode', 'EmpName', 'EmpOccupation', 'EmpDOJ', 'EmpAge']
  9. print(employees)

OutPut

  1. EmpCode EmpName EmpOccupation EmpDOJ EmpAge
  2. 0 23 2018-01-25 Emp001 John Doe Chemist
  3. 1 24 2018-01-26 Emp00 William Spark Statistician

如何根据 Pandas 列中的值从 DataFrame 中选择或过滤行

  1. import pandas as pd
  2. employees = pd.DataFrame({
  3. 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
  4. 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
  5. 'Occupation': ['Chemist', 'Statistician', 'Statistician',
  6. 'Statistician', 'Programmer'],
  7. 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
  8. '2018-03-16'],
  9. 'Age': [23, 24, 34, 29, 40]})
  10. print("\nUse == operator\n")
  11. print(employees.loc[employees['Age'] == 23])
  12. print("\nUse < operator\n")
  13. print(employees.loc[employees['Age'] < 30])
  14. print("\nUse != operator\n")
  15. print(employees.loc[employees['Occupation'] != 'Statistician'])
  16. print("\nMultiple Conditions\n")
  17. print(employees.loc[(employees['Occupation'] != 'Statistician') &
  18. (employees['Name'] == 'John')])

OutPut

  1. Use == operator
  2. Age Date Of Join EmpCode Name Occupation
  3. 0 23 2018-01-25 Emp001 John Chemist
  4. Use < operator
  5. Age Date Of Join EmpCode Name Occupation
  6. 0 23 2018-01-25 Emp001 John Chemist
  7. 1 24 2018-01-26 Emp002 Doe Statistician
  8. 3 29 2018-02-26 Emp004 Spark Statistician
  9. Use != operator
  10. Age Date Of Join EmpCode Name Occupation
  11. 0 23 2018-01-25 Emp001 John Chemist
  12. 4 40 2018-03-16 Emp005 Mark Programmer
  13. Multiple Conditions
  14. Age Date Of Join EmpCode Name Occupation
  15. 0 23 2018-01-25 Emp001 John Chemist

在 DataFrame 中使用“isin”过滤多行

  1. import pandas as pd
  2. employees = pd.DataFrame({
  3. 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
  4. 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
  5. 'Occupation': ['Chemist', 'Statistician', 'Statistician',
  6. 'Statistician', 'Programmer'],
  7. 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
  8. '2018-03-16'],
  9. 'Age': [23, 24, 34, 29, 40]})
  10. print("\nUse isin operator\n")
  11. print(employees.loc[employees['Occupation'].isin(['Chemist','Programmer'])])
  12. print("\nMultiple Conditions\n")
  13. print(employees.loc[(employees['Occupation'] == 'Chemist') |
  14. (employees['Name'] == 'John') &
  15. (employees['Age'] < 30)])

OutPut

  1. Use isin operator
  2. Age Date Of Join EmpCode Name Occupation
  3. 0 23 2018-01-25 Emp001 John Chemist
  4. 4 40 2018-03-16 Emp005 Mark Programmer
  5. Multiple Conditions
  6. Age Date Of Join EmpCode Name Occupation
  7. 0 23 2018-01-25 Emp001 John Chemist

迭代 DataFrame 的行和列

  1. import pandas as pd
  2. employees = pd.DataFrame({
  3. 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
  4. 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
  5. 'Occupation': ['Chemist', 'Statistician', 'Statistician',
  6. 'Statistician', 'Programmer'],
  7. 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
  8. '2018-03-16'],
  9. 'Age': [23, 24, 34, 29, 40]})
  10. print("\n Example iterrows \n")
  11. for index, col in employees.iterrows():
  12. print(col['Name'], "--", col['Age'])
  13. print("\n Example itertuples \n")
  14. for row in employees.itertuples(index=True, name='Pandas'):
  15. print(getattr(row, "Name"), "--", getattr(row, "Age"))

OutPut

  1. Example iterrows
  2. John -- 23
  3. Doe -- 24
  4. William -- 34
  5. Spark -- 29
  6. Mark -- 40
  7. Example itertuples
  8. John -- 23
  9. Doe -- 24
  10. William -- 34
  11. Spark -- 29
  12. Mark -- 40

如何通过名称或索引删除 DataFrame 的列

  1. import pandas as pd
  2. employees = pd.DataFrame({
  3. 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
  4. 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
  5. 'Occupation': ['Chemist', 'Statistician', 'Statistician',
  6. 'Statistician', 'Programmer'],
  7. 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
  8. '2018-03-16'],
  9. 'Age': [23, 24, 34, 29, 40]})
  10. print(employees)
  11. print("\n Drop Column by Name \n")
  12. employees.drop('Age', axis=1, inplace=True)
  13. print(employees)
  14. print("\n Drop Column by Index \n")
  15. employees.drop(employees.columns[[0,1]], axis=1, inplace=True)
  16. print(employees)

OutPut

  1. Age Date Of Join EmpCode Name Occupation
  2. 0 23 2018-01-25 Emp001 John Chemist
  3. 1 24 2018-01-26 Emp002 Doe Statistician
  4. 2 34 2018-01-26 Emp003 William Statistician
  5. 3 29 2018-02-26 Emp004 Spark Statistician
  6. 4 40 2018-03-16 Emp005 Mark Programmer
  7. Drop Column by Name
  8. Date Of Join EmpCode Name Occupation
  9. 0 2018-01-25 Emp001 John Chemist
  10. 1 2018-01-26 Emp002 Doe Statistician
  11. 2 2018-01-26 Emp003 William Statistician
  12. 3 2018-02-26 Emp004 Spark Statistician
  13. 4 2018-03-16 Emp005 Mark Programmer
  14. Drop Column by Index
  15. Name Occupation
  16. 0 John Chemist
  17. 1 Doe Statistician
  18. 2 William Statistician
  19. 3 Spark Statistician
  20. 4 Mark Programmer

向 DataFrame 中新增列

  1. import pandas as pd
  2. employees = pd.DataFrame({
  3. 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
  4. 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
  5. 'Occupation': ['Chemist', 'Statistician', 'Statistician',
  6. 'Statistician', 'Programmer'],
  7. 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
  8. '2018-03-16'],
  9. 'Age': [23, 24, 34, 29, 40]})
  10. employees['City'] = ['London', 'Tokyo', 'Sydney', 'London', 'Toronto']
  11. print(employees)

OutPut

  1. Age Date Of Join EmpCode Name Occupation City
  2. 0 23 2018-01-25 Emp001 John Chemist London
  3. 1 24 2018-01-26 Emp002 Doe Statistician Tokyo
  4. 2 34 2018-01-26 Emp003 William Statistician Sydney
  5. 3 29 2018-02-26 Emp004 Spark Statistician London
  6. 4 40 2018-03-16 Emp005 Mark Programmer Toronto

如何从 DataFrame 中获取列标题列表

  1. import pandas as pd
  2. employees = pd.DataFrame({
  3. 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
  4. 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
  5. 'Occupation': ['Chemist', 'Statistician', 'Statistician',
  6. 'Statistician', 'Programmer'],
  7. 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
  8. '2018-03-16'],
  9. 'Age': [23, 24, 34, 29, 40]})
  10. print(list(employees))
  11. print(list(employees.columns.values))
  12. print(employees.columns.tolist())

OutPut

  1. ['Age', 'Date Of Join', 'EmpCode', 'Name', 'Occupation']
  2. ['Age', 'Date Of Join', 'EmpCode', 'Name', 'Occupation']
  3. ['Age', 'Date Of Join', 'EmpCode', 'Name', 'Occupation']

如何随机生成 DataFrame

  1. import pandas as pd
  2. import numpy as np
  3. np.random.seed(5)
  4. df_random = pd.DataFrame(np.random.randint(100, size=(10, 6)),
  5. columns=list('ABCDEF'),
  6. index=['Row-{}'.format(i) for i in range(10)])
  7. print(df_random)s

OutPut

  1. A B C D E F
  2. Row-0 99 78 61 16 73 8
  3. Row-1 62 27 30 80 7 76
  4. Row-2 15 53 80 27 44 77
  5. Row-3 75 65 47 30 84 86
  6. Row-4 18 9 41 62 1 82
  7. Row-5 16 78 5 58 0 80
  8. Row-6 4 36 51 27 31 2
  9. Row-7 68 38 83 19 18 7
  10. Row-8 30 62 11 67 65 55
  11. Row-9 3 91 78 27 29 33

如何选择 DataFrame 的多个列

  1. import pandas as pd
  2. employees = pd.DataFrame({
  3. 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
  4. 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
  5. 'Occupation': ['Chemist', 'Statistician', 'Statistician',
  6. 'Statistician', 'Programmer'],
  7. 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
  8. '2018-03-16'],
  9. 'Age': [23, 24, 34, 29, 40]})
  10. df = employees[['EmpCode', 'Age', 'Name']]
  11. print(df)

OutPut

  1. EmpCode Age Name
  2. 0 Emp001 23 John
  3. 1 Emp002 24 Doe
  4. 2 Emp003 34 William
  5. 3 Emp004 29 Spark
  6. 4 Emp005 40 Mark

如何将字典转换为 DataFrame

  1. import pandas as pd
  2. data = ({'Age': [30, 20, 22, 40, 32, 28, 39],
  3. 'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black',
  4. 'Red'],
  5. 'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese',
  6. 'Melon', 'Beans'],
  7. 'Height': [165, 70, 120, 80, 180, 172, 150],
  8. 'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
  9. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  10. })
  11. print(data)
  12. df = pd.DataFrame(data)
  13. print(df)

OutPut

  1. {'Height': [165, 70, 120, 80, 180, 172, 150], 'Food': ['Steak', 'Lamb', 'Mango',
  2. 'Apple', 'Cheese', 'Melon', 'Beans'], 'Age': [30, 20, 22, 40, 32, 28, 39], 'Sco
  3. re': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2], 'Color': ['Blue', 'Green', 'Red', 'Whi
  4. te', 'Gray', 'Black', 'Red'], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'
  5. ]}
  6. Age Color Food Height Score State
  7. 0 30 Blue Steak 165 4.6 NY
  8. 1 20 Green Lamb 70 8.3 TX
  9. 2 22 Red Mango 120 9.0 FL
  10. 3 40 White Apple 80 3.3 AL
  11. 4 32 Gray Cheese 180 1.8 AK
  12. 5 28 Black Melon 172 9.5 TX
  13. 6 39 Red Beans 150 2.2 TX

使用 ioc 进行切片

  1. import pandas as pd
  2. df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],
  3. 'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black',
  4. 'Red'],
  5. 'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese',
  6. 'Melon', 'Beans'],
  7. 'Height': [165, 70, 120, 80, 180, 172, 150],
  8. 'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
  9. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  10. },
  11. index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
  12. 'Christina', 'Cornelia'])
  13. print("\n -- Selecting a single row with .loc with a string -- \n")
  14. print(df.loc['Penelope'])
  15. print("\n -- Selecting multiple rows with .loc with a list of strings -- \n")
  16. print(df.loc[['Cornelia', 'Jane', 'Dean']])
  17. print("\n -- Selecting multiple rows with .loc with slice notation -- \n")
  18. print(df.loc['Aaron':'Dean'])

OutPut

  1. -- Selecting a single row with .loc with a string --
  2. Age 40
  3. Color White
  4. Food Apple
  5. Height 80
  6. Score 3.3
  7. State AL
  8. Name: Penelope, dtype: object
  9. -- Selecting multiple rows with .loc with a list of strings --
  10. Age Color Food Height Score State
  11. Cornelia 39 Red Beans 150 2.2 TX
  12. Jane 30 Blue Steak 165 4.6 NY
  13. Dean 32 Gray Cheese 180 1.8 AK
  14. -- Selecting multiple rows with .loc with slice notation --
  15. Age Color Food Height Score State
  16. Aaron 22 Red Mango 120 9.0 FL
  17. Penelope 40 White Apple 80 3.3 AL
  18. Dean 32 Gray Cheese 180 1.8 AK

检查 DataFrame 中是否是空的

  1. import pandas as pd
  2. df = pd.DataFrame()
  3. if df.empty:
  4. print('DataFrame is empty!')

OutPut

  1. DataFrame is empty!

在创建 DataFrame 时指定索引和列名称

  1. import pandas as pd
  2. values = ["India", "Canada", "Australia",
  3. "Japan", "Germany", "France"]
  4. code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]
  5. df = pd.DataFrame(values, index=code, columns=['Country'])
  6. print(df)

OutPut

  1. Country
  2. IND India
  3. CAN Canada
  4. AUS Australia
  5. JAP Japan
  6. GER Germany
  7. FRA France

使用 iloc 进行切片

  1. import pandas as pd
  2. df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],
  3. 'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black',
  4. 'Red'],
  5. 'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese',
  6. 'Melon', 'Beans'],
  7. 'Height': [165, 70, 120, 80, 180, 172, 150],
  8. 'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
  9. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  10. },
  11. index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
  12. 'Christina', 'Cornelia'])
  13. print("\n -- Selecting a single row with .iloc with an integer -- \n")
  14. print(df.iloc[4])
  15. print("\n -- Selecting multiple rows with .iloc with a list of integers -- \n")
  16. print(df.iloc[[2, -2]])
  17. print("\n -- Selecting multiple rows with .iloc with slice notation -- \n")
  18. print(df.iloc[:5:3])

OutPut

  1. -- Selecting a single row with .iloc with an integer --
  2. Age 32
  3. Color Gray
  4. Food Cheese
  5. Height 180
  6. Score 1.8
  7. State AK
  8. Name: Dean, dtype: object
  9. -- Selecting multiple rows with .iloc with a list of integers --
  10. Age Color Food Height Score State
  11. Aaron 22 Red Mango 120 9.0 FL
  12. Christina 28 Black Melon 172 9.5 TX
  13. -- Selecting multiple rows with .iloc with slice notation --
  14. Age Color Food Height Score State
  15. Jane 30 Blue Steak 165 4.6 NY
  16. Penelope 40 White Apple 80 3.3 AL

iloc 和 loc 的区别

  • loc 索引器还可以进行布尔选择,例如,如果我们想查找 Age 小于 30 的所有行并仅返回 Color 和 Height 列,我们可以执行以下操作。我们可以用 iloc 复制它,但我们不能将它传递给一个布尔系列,必须将布尔系列转换为 numpy 数组
  • loc 从索引中获取具有特定标签的行(或列)
  • iloc 在索引中的特定位置获取行(或列)(因此它只需要整数) ```python import pandas as pd

df = pd.DataFrame({‘Age’: [30, 20, 22, 40, 32, 28, 39], ‘Color’: [‘Blue’, ‘Green’, ‘Red’, ‘White’, ‘Gray’, ‘Black’, ‘Red’], ‘Food’: [‘Steak’, ‘Lamb’, ‘Mango’, ‘Apple’, ‘Cheese’, ‘Melon’, ‘Beans’], ‘Height’: [165, 70, 120, 80, 180, 172, 150], ‘Score’: [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2], ‘State’: [‘NY’, ‘TX’, ‘FL’, ‘AL’, ‘AK’, ‘TX’, ‘TX’] }, index=[‘Jane’, ‘Nick’, ‘Aaron’, ‘Penelope’, ‘Dean’, ‘Christina’, ‘Cornelia’])

print(“\n — loc — \n”) print(df.loc[df[‘Age’] < 30, [‘Color’, ‘Height’]])

print(“\n — iloc — \n”) print(df.iloc[(df[‘Age’] < 30).values, [1, 3]])

  1. <a name="Awmnw"></a>
  2. ### OutPut
  3. ```python
  4. -- loc --
  5. Color Height
  6. Nick Green 70
  7. Aaron Red 120
  8. Christina Black 172
  9. -- iloc --
  10. Color Height
  11. Nick Green 70
  12. Aaron Red 120
  13. Christina Black 172

使用时间索引创建空 DataFrame

  1. import datetime
  2. import pandas as pd
  3. todays_date = datetime.datetime.now().date()
  4. index = pd.date_range(todays_date, periods=10, freq='D')
  5. columns = ['A', 'B', 'C']
  6. df = pd.DataFrame(index=index, columns=columns)
  7. df = df.fillna(0)
  8. print(df)

OutPut

  1. A B C
  2. 2018-09-30 0 0 0
  3. 2018-10-01 0 0 0
  4. 2018-10-02 0 0 0
  5. 2018-10-03 0 0 0
  6. 2018-10-04 0 0 0
  7. 2018-10-05 0 0 0
  8. 2018-10-06 0 0 0
  9. 2018-10-07 0 0 0
  10. 2018-10-08 0 0 0
  11. 2018-10-09 0 0 0

如何改变 DataFrame 列的排序

  1. import pandas as pd
  2. df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],
  3. 'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black',
  4. 'Red'],
  5. 'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese',
  6. 'Melon', 'Beans'],
  7. 'Height': [165, 70, 120, 80, 180, 172, 150],
  8. 'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
  9. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  10. },
  11. index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
  12. 'Christina', 'Cornelia'])
  13. print("\n -- Change order using columns -- \n")
  14. new_order = [3, 2, 1, 4, 5, 0]
  15. df = df[df.columns[new_order]]
  16. print(df)
  17. print("\n -- Change order using reindex -- \n")
  18. df = df.reindex(['State', 'Color', 'Age', 'Food', 'Score', 'Height'], axis=1)
  19. print(df)

OutPut

  1. -- Change order using columns --
  2. Height Food Color Score State Age
  3. Jane 165 Steak Blue 4.6 NY 30
  4. Nick 70 Lamb Green 8.3 TX 20
  5. Aaron 120 Mango Red 9.0 FL 22
  6. Penelope 80 Apple White 3.3 AL 40
  7. Dean 180 Cheese Gray 1.8 AK 32
  8. Christina 172 Melon Black 9.5 TX 28
  9. Cornelia 150 Beans Red 2.2 TX 39
  10. -- Change order using reindex --
  11. State Color Age Food Score Height
  12. Jane NY Blue 30 Steak 4.6 165
  13. Nick TX Green 20 Lamb 8.3 70
  14. Aaron FL Red 22 Mango 9.0 120
  15. Penelope AL White 40 Apple 3.3 80
  16. Dean AK Gray 32 Cheese 1.8 180
  17. Christina TX Black 28 Melon 9.5 172
  18. Cornelia TX Red 39 Beans 2.2 150

检查 DataFrame 列的数据类型

  1. import pandas as pd
  2. df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],
  3. 'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black',
  4. 'Red'],
  5. 'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese',
  6. 'Melon', 'Beans'],
  7. 'Height': [165, 70, 120, 80, 180, 172, 150],
  8. 'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
  9. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  10. },
  11. index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
  12. 'Christina', 'Cornelia'])
  13. print(df.dtypes)

OutPut

  1. Age int64
  2. Color object
  3. Food object
  4. Height int64
  5. Score float64
  6. State object
  7. dtype: object

更改 DataFrame 指定列的数据类型

  1. import pandas as pd
  2. df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],
  3. 'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black',
  4. 'Red'],
  5. 'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese',
  6. 'Melon', 'Beans'],
  7. 'Height': [165, 70, 120, 80, 180, 172, 150],
  8. 'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
  9. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  10. },
  11. index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
  12. 'Christina', 'Cornelia'])
  13. print(df.dtypes)
  14. df['Age'] = df['Age'].astype(str)
  15. print(df.dtypes)

OutPut

  1. Age int64
  2. Color object
  3. Food object
  4. Height int64
  5. Score float64
  6. State object
  7. dtype: object
  8. Age object
  9. Color object
  10. Food object
  11. Height int64
  12. Score float64
  13. State object
  14. dtype: object

如何将列的数据类型转换为 DateTime 类型

  1. import pandas as pd
  2. df = pd.DataFrame({'DateOFBirth': [1349720105, 1349806505, 1349892905,
  3. 1349979305, 1350065705, 1349792905,
  4. 1349730105],
  5. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  6. },
  7. index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
  8. 'Christina', 'Cornelia'])
  9. print("\n----------------Before---------------\n")
  10. print(df.dtypes)
  11. print(df)
  12. df['DateOFBirth'] = pd.to_datetime(df['DateOFBirth'], unit='s')
  13. print("\n----------------After----------------\n")
  14. print(df.dtypes)
  15. print(df)

OutPut

  1. ----------------Before---------------
  2. DateOFBirth int64
  3. State object
  4. dtype: object
  5. DateOFBirth State
  6. Jane 1349720105 NY
  7. Nick 1349806505 TX
  8. Aaron 1349892905 FL
  9. Penelope 1349979305 AL
  10. Dean 1350065705 AK
  11. Christina 1349792905 TX
  12. Cornelia 1349730105 TX
  13. ----------------After----------------
  14. DateOFBirth datetime64[ns]
  15. State object
  16. dtype: object
  17. DateOFBirth State
  18. Jane 2012-10-08 18:15:05 NY
  19. Nick 2012-10-09 18:15:05 TX
  20. Aaron 2012-10-10 18:15:05 FL
  21. Penelope 2012-10-11 18:15:05 AL
  22. Dean 2012-10-12 18:15:05 AK
  23. Christina 2012-10-09 14:28:25 TX
  24. Cornelia 2012-10-08 21:01:45 TX

将 DataFrame 列从 floats 转为 ints

  1. import pandas as pd
  2. df = pd.DataFrame({'DailyExp': [75.7, 56.69, 55.69, 96.5, 84.9, 110.5,
  3. 58.9],
  4. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  5. },
  6. index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
  7. 'Christina', 'Cornelia'])
  8. print("\n----------------Before---------------\n")
  9. print(df.dtypes)
  10. print(df)
  11. df['DailyExp'] = df['DailyExp'].astype(int)
  12. print("\n----------------After----------------\n")
  13. print(df.dtypes)
  14. print(df)

OutPut

  1. ----------------Before---------------
  2. DailyExp float64
  3. State object
  4. dtype: object
  5. DailyExp State
  6. Jane 75.70 NY
  7. Nick 56.69 TX
  8. Aaron 55.69 FL
  9. Penelope 96.50 AL
  10. Dean 84.90 AK
  11. Christina 110.50 TX
  12. Cornelia 58.90 TX
  13. ----------------After----------------
  14. DailyExp int32
  15. State object
  16. dtype: object
  17. DailyExp State
  18. Jane 75 NY
  19. Nick 56 TX
  20. Aaron 55 FL
  21. Penelope 96 AL
  22. Dean 84 AK
  23. Christina 110 TX
  24. Cornelia 58 TX

如何把 dates 列转换为 DateTime 类型

  1. import pandas as pd
  2. df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
  3. '1986-06-01', '1983-06-04', '1990-03-07',
  4. '1999-07-09'],
  5. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  6. },
  7. index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
  8. 'Christina', 'Cornelia'])
  9. print("\n----------------Before---------------\n")
  10. print(df.dtypes)
  11. df['DateOfBirth'] = df['DateOfBirth'].astype('datetime64')
  12. print("\n----------------After----------------\n")
  13. print(df.dtypes)

OutPut

  1. ----------------Before---------------
  2. DateOfBirth object
  3. State object
  4. dtype: object
  5. ----------------After----------------
  6. DateOfBirth datetime64[ns]
  7. State object
  8. dtype: object

两个 DataFrame 相加

  1. import pandas as pd
  2. df1 = pd.DataFrame({'Age': [30, 20, 22, 40], 'Height': [165, 70, 120, 80],
  3. 'Score': [4.6, 8.3, 9.0, 3.3], 'State': ['NY', 'TX',
  4. 'FL', 'AL']},
  5. index=['Jane', 'Nick', 'Aaron', 'Penelope'])
  6. df2 = pd.DataFrame({'Age': [32, 28, 39], 'Color': ['Gray', 'Black', 'Red'],
  7. 'Food': ['Cheese', 'Melon', 'Beans'],
  8. 'Score': [1.8, 9.5, 2.2], 'State': ['AK', 'TX', 'TX']},
  9. index=['Dean', 'Christina', 'Cornelia'])
  10. df3 = df1.append(df2, sort=True)
  11. print(df3)

OutPut

  1. Age Color Food Height Score State
  2. Jane 30 NaN NaN 165.0 4.6 NY
  3. Nick 20 NaN NaN 70.0 8.3 TX
  4. Aaron 22 NaN NaN 120.0 9.0 FL
  5. Penelope 40 NaN NaN 80.0 3.3 AL
  6. Dean 32 Gray Cheese NaN 1.8 AK
  7. Christina 28 Black Melon NaN 9.5 TX
  8. Cornelia 39 Red Beans NaN 2.2 TX

在 DataFrame 末尾添加额外的行

  1. import pandas as pd
  2. employees = pd.DataFrame({
  3. 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
  4. 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
  5. 'Occupation': ['Chemist', 'Statistician', 'Statistician',
  6. 'Statistician', 'Programmer'],
  7. 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
  8. '2018-03-16'],
  9. 'Age': [23, 24, 34, 29, 40]})
  10. print("\n------------ BEFORE ----------------\n")
  11. print(employees)
  12. employees.loc[len(employees)] = [45, '2018-01-25', 'Emp006', 'Sunny',
  13. 'Programmer']
  14. print("\n------------ AFTER ----------------\n")
  15. print(employees)

OutPut

  1. ------------ BEFORE ----------------
  2. Age Date Of Join EmpCode Name Occupation
  3. 0 23 2018-01-25 Emp001 John Chemist
  4. 1 24 2018-01-26 Emp002 Doe Statistician
  5. 2 34 2018-01-26 Emp003 William Statistician
  6. 3 29 2018-02-26 Emp004 Spark Statistician
  7. 4 40 2018-03-16 Emp005 Mark Programmer
  8. ------------ AFTER ----------------
  9. Age Date Of Join EmpCode Name Occupation
  10. 0 23 2018-01-25 Emp001 John Chemist
  11. 1 24 2018-01-26 Emp002 Doe Statistician
  12. 2 34 2018-01-26 Emp003 William Statistician
  13. 3 29 2018-02-26 Emp004 Spark Statistician
  14. 4 40 2018-03-16 Emp005 Mark Programmer
  15. 5 45 2018-01-25 Emp006 Sunny Programmer

为指定索引添加新行

  1. import pandas as pd
  2. employees = pd.DataFrame(
  3. data={'Name': ['John Doe', 'William Spark'],
  4. 'Occupation': ['Chemist', 'Statistician'],
  5. 'Date Of Join': ['2018-01-25', '2018-01-26'],
  6. 'Age': [23, 24]},
  7. index=['Emp001', 'Emp002'],
  8. columns=['Name', 'Occupation', 'Date Of Join', 'Age'])
  9. print("\n------------ BEFORE ----------------\n")
  10. print(employees)
  11. employees.loc['Emp003'] = ['Sunny', 'Programmer', '2018-01-25', 45]
  12. print("\n------------ AFTER ----------------\n")
  13. print(employees)

OutPut

  1. ------------ BEFORE ----------------
  2. Name Occupation Date Of Join Age
  3. Emp001 John Doe Chemist 2018-01-25 23
  4. Emp002 William Spark Statistician 2018-01-26 24
  5. ------------ AFTER ----------------
  6. Name Occupation Date Of Join Age
  7. Emp001 John Doe Chemist 2018-01-25 23
  8. Emp002 William Spark Statistician 2018-01-26 24
  9. Emp003 Sunny Programmer 2018-01-25 45

如何使用 for 循环添加行

  1. import pandas as pd
  2. cols = ['Zip']
  3. lst = []
  4. zip = 32100
  5. for a in range(10):
  6. lst.append([zip])
  7. zip = zip + 1
  8. df = pd.DataFrame(lst, columns=cols)
  9. print(df)

OutPut

  1. Zip
  2. 0 32100
  3. 1 32101
  4. 2 32102
  5. 3 32103
  6. 4 32104
  7. 5 32105
  8. 6 32106
  9. 7 32107
  10. 8 32108
  11. 9 32109

在 DataFrame 顶部添加一行

  1. import pandas as pd
  2. employees = pd.DataFrame({
  3. 'EmpCode': ['Emp002', 'Emp003', 'Emp004'],
  4. 'Name': ['John', 'Doe', 'William'],
  5. 'Occupation': ['Chemist', 'Statistician', 'Statistician'],
  6. 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26'],
  7. 'Age': [23, 24, 34]})
  8. print("\n------------ BEFORE ----------------\n")
  9. print(employees)
  10. # New line
  11. line = pd.DataFrame({'Name': 'Dean', 'Age': 45, 'EmpCode': 'Emp001',
  12. 'Date Of Join': '2018-02-26', 'Occupation': 'Chemist'
  13. }, index=[0])
  14. # Concatenate two dataframe
  15. employees = pd.concat([line,employees.ix[:]]).reset_index(drop=True)
  16. print("\n------------ AFTER ----------------\n")
  17. print(employees)

OutPut

  1. ------------ BEFORE ----------------
  2. Age Date Of Join EmpCode Name Occupation
  3. 0 23 2018-01-25 Emp002 John Chemist
  4. 1 24 2018-01-26 Emp003 Doe Statistician
  5. 2 34 2018-01-26 Emp004 William Statistician
  6. ------------ AFTER ----------------
  7. Age Date Of Join EmpCode Name Occupation
  8. 0 45 2018-02-26 Emp001 Dean Chemist
  9. 1 23 2018-01-25 Emp002 John Chemist
  10. 2 24 2018-01-26 Emp003 Doe Statistician
  11. 3 34 2018-01-26 Emp004 William Statistician

如何向 DataFrame 中动态添加行

  1. import pandas as pd
  2. df = pd.DataFrame(columns=['Name', 'Age'])
  3. df.loc[1, 'Name'] = 'Rocky'
  4. df.loc[1, 'Age'] = 23
  5. df.loc[2, 'Name'] = 'Sunny'
  6. print(df)

OutPut

  1. Name Age
  2. 1 Rocky 23
  3. 2 Sunny NaN

在任意位置插入行

  1. import pandas as pd
  2. df = pd.DataFrame(columns=['Name', 'Age'])
  3. df.loc[1, 'Name'] = 'Rocky'
  4. df.loc[1, 'Age'] = 21
  5. df.loc[2, 'Name'] = 'Sunny'
  6. df.loc[2, 'Age'] = 22
  7. df.loc[3, 'Name'] = 'Mark'
  8. df.loc[3, 'Age'] = 25
  9. df.loc[4, 'Name'] = 'Taylor'
  10. df.loc[4, 'Age'] = 28
  11. print("\n------------ BEFORE ----------------\n")
  12. print(df)
  13. line = pd.DataFrame({"Name": "Jack", "Age": 24}, index=[2.5])
  14. df = df.append(line, ignore_index=False)
  15. df = df.sort_index().reset_index(drop=True)
  16. df = df.reindex(['Name', 'Age'], axis=1)
  17. print("\n------------ AFTER ----------------\n")
  18. print(df)

OutPut

  1. ------------ BEFORE ----------------
  2. Name Age
  3. 1 Rocky 21
  4. 2 Sunny 22
  5. 3 Mark 25
  6. 4 Taylor 28
  7. ------------ AFTER ----------------
  8. Name Age
  9. 0 Rocky 21
  10. 1 Sunny 22
  11. 2 Jack 24
  12. 3 Mark 25
  13. 4 Taylor 28

使用时间戳索引向 DataFrame 中添加行

  1. import pandas as pd
  2. df = pd.DataFrame(columns=['Name', 'Age'])
  3. df.loc['2014-05-01 18:47:05', 'Name'] = 'Rocky'
  4. df.loc['2014-05-01 18:47:05', 'Age'] = 21
  5. df.loc['2014-05-02 18:47:05', 'Name'] = 'Sunny'
  6. df.loc['2014-05-02 18:47:05', 'Age'] = 22
  7. df.loc['2014-05-03 18:47:05', 'Name'] = 'Mark'
  8. df.loc['2014-05-03 18:47:05', 'Age'] = 25
  9. print("\n------------ BEFORE ----------------\n")
  10. print(df)
  11. line = pd.to_datetime("2014-05-01 18:50:05", format="%Y-%m-%d %H:%M:%S")
  12. new_row = pd.DataFrame([['Bunny', 26]], columns=['Name', 'Age'], index=[line])
  13. df = pd.concat([df, pd.DataFrame(new_row)], ignore_index=False)
  14. print("\n------------ AFTER ----------------\n")
  15. print(df)

OutPut

  1. ------------ BEFORE ----------------
  2. Name Age
  3. 2014-05-01 18:47:05 Rocky 21
  4. 2014-05-02 18:47:05 Sunny 22
  5. 2014-05-03 18:47:05 Mark 25
  6. ------------ AFTER ----------------
  7. Name Age
  8. 2014-05-01 18:47:05 Rocky 21
  9. 2014-05-02 18:47:05 Sunny 22
  10. 2014-05-03 18:47:05 Mark 25
  11. 2014-05-01 18:50:05 Bunny 26

为不同的行填充缺失值

  1. import pandas as pd
  2. a = {'A': 10, 'B': 20}
  3. b = {'B': 30, 'C': 40, 'D': 50}
  4. df1 = pd.DataFrame(a, index=[0])
  5. df2 = pd.DataFrame(b, index=[1])
  6. df = pd.DataFrame()
  7. df = df.append(df1)
  8. df = df.append(df2).fillna(0)
  9. print(df)

OutPut

  1. A B C D
  2. 0 10.0 20 0.0 0.0
  3. 1 0.0 30 40.0 50.0

append, concat 和 combine_first 示例

  1. import pandas as pd
  2. a = {'A': 10, 'B': 20}
  3. b = {'B': 30, 'C': 40, 'D': 50}
  4. df1 = pd.DataFrame(a, index=[0])
  5. df2 = pd.DataFrame(b, index=[1])
  6. d1 = pd.DataFrame()
  7. d1 = d1.append(df1)
  8. d1 = d1.append(df2).fillna(0)
  9. print("\n------------ append ----------------\n")
  10. print(d1)
  11. d2 = pd.concat([df1, df2]).fillna(0)
  12. print("\n------------ concat ----------------\n")
  13. print(d2)
  14. d3 = pd.DataFrame()
  15. d3 = d3.combine_first(df1).combine_first(df2).fillna(0)
  16. print("\n------------ combine_first ----------------\n")
  17. print(d3)

OutPut

  1. ------------ append ----------------
  2. A B C D
  3. 0 10.0 20 0.0 0.0
  4. 1 0.0 30 40.0 50.0
  5. ------------ concat ----------------
  6. A B C D
  7. 0 10.0 20 0.0 0.0
  8. 1 0.0 30 40.0 50.0
  9. ------------ combine_first ----------------
  10. A B C D
  11. 0 10.0 20.0 0.0 0.0
  12. 1 0.0 30.0 40.0 50.0

获取行和列的平均值

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0]],
  3. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  4. index=['Basket1', 'Basket2', 'Basket3'])
  5. df['Mean Basket'] = df.mean(axis=1)
  6. df.loc['Mean Fruit'] = df.mean()
  7. print(df)

OutPut

  1. Apple Orange Banana Pear Mean Basket
  2. Basket1 10.000000 20.0 30.0 40.000000 25.0
  3. Basket2 7.000000 14.0 21.0 28.000000 17.5
  4. Basket3 5.000000 5.0 0.0 0.000000 2.5
  5. Mean Fruit 7.333333 13.0 17.0 22.666667 15.0

计算行和列的总和

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0]],
  3. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  4. index=['Basket1', 'Basket2', 'Basket3'])
  5. df['Sum Basket'] = df.sum(axis=1)
  6. df.loc['Sum Fruit'] = df.sum()
  7. print(df)

OutPut

  1. Apple Orange Banana Pear Sum Basket
  2. Basket1 10 20 30 40 100
  3. Basket2 7 14 21 28 70
  4. Basket3 5 5 0 0 10
  5. Sum Fruit 22 39 51 68 180

连接两列

  1. import pandas as pd
  2. df = pd.DataFrame(columns=['Name', 'Age'])
  3. df.loc[1, 'Name'] = 'Rocky'
  4. df.loc[1, 'Age'] = 21
  5. df.loc[2, 'Name'] = 'Sunny'
  6. df.loc[2, 'Age'] = 22
  7. df.loc[3, 'Name'] = 'Mark'
  8. df.loc[3, 'Age'] = 25
  9. df.loc[4, 'Name'] = 'Taylor'
  10. df.loc[4, 'Age'] = 28
  11. print('\n------------ BEFORE ----------------\n')
  12. print(df)
  13. df['Employee'] = df['Name'].map(str) + ' - ' + df['Age'].map(str)
  14. df = df.reindex(['Employee'], axis=1)
  15. print('\n------------ AFTER ----------------\n')
  16. print(df)

OutPut

  1. ------------ BEFORE ----------------
  2. Name Age
  3. 1 Rocky 21
  4. 2 Sunny 22
  5. 3 Mark 25
  6. 4 Taylor 28
  7. ------------ AFTER ----------------
  8. Employee
  9. 1 Rocky - 21
  10. 2 Sunny - 22
  11. 3 Mark - 25
  12. 4 Taylor - 28

过滤包含某字符串的行

  1. import pandas as pd
  2. df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
  3. '1986-06-01', '1983-06-04', '1990-03-07',
  4. '1999-07-09'],
  5. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  6. },
  7. index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
  8. 'Christina', 'Cornelia'])
  9. print(df)
  10. print("\n---- Filter with State contains TX ----\n")
  11. df1 = df[df['State'].str.contains("TX")]
  12. print(df1)

OutPut

  1. DateOfBirth State
  2. Jane 1986-11-11 NY
  3. Nick 1999-05-12 TX
  4. Aaron 1976-01-01 FL
  5. Penelope 1986-06-01 AL
  6. Dean 1983-06-04 AK
  7. Christina 1990-03-07 TX
  8. Cornelia 1999-07-09 TX
  9. ---- Filter with State contains TX ----
  10. DateOfBirth State
  11. Nick 1999-05-12 TX
  12. Christina 1990-03-07 TX
  13. Cornelia 1999-07-09 TX

过滤索引中包含某字符串的行

  1. import pandas as pd
  2. df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
  3. '1986-06-01', '1983-06-04', '1990-03-07',
  4. '1999-07-09'],
  5. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  6. },
  7. index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane',
  8. 'Christina', 'Cornelia'])
  9. print(df)
  10. print("\n---- Filter Index contains ane ----\n")
  11. df.index = df.index.astype('str')
  12. df1 = df[df.index.str.contains('ane')]
  13. print(df1)

OutPut

  1. DateOfBirth State
  2. Jane 1986-11-11 NY
  3. Pane 1999-05-12 TX
  4. Aaron 1976-01-01 FL
  5. Penelope 1986-06-01 AL
  6. Frane 1983-06-04 AK
  7. Christina 1990-03-07 TX
  8. Cornelia 1999-07-09 TX
  9. ---- Filter Index contains ane ----
  10. DateOfBirth State
  11. Jane 1986-11-11 NY
  12. Pane 1999-05-12 TX
  13. Frane 1983-06-04 AK

使用 AND 运算符过滤包含特定字符串值的行

  1. import pandas as pd
  2. df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
  3. '1986-06-01', '1983-06-04', '1990-03-07',
  4. '1999-07-09'],
  5. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  6. },
  7. index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane',
  8. 'Christina', 'Cornelia'])
  9. print(df)
  10. print("\n---- Filter DataFrame using & ----\n")
  11. df.index = df.index.astype('str')
  12. df1 = df[df.index.str.contains('ane') & df['State'].str.contains("TX")]
  13. print(df1)

OutPut

  1. DateOfBirth State
  2. Jane 1986-11-11 NY
  3. Pane 1999-05-12 TX
  4. Aaron 1976-01-01 FL
  5. Penelope 1986-06-01 AL
  6. Frane 1983-06-04 AK
  7. Christina 1990-03-07 TX
  8. Cornelia 1999-07-09 TX
  9. ---- Filter DataFrame using & ----
  10. DateOfBirth State
  11. Pane 1999-05-12 TX

查找包含某字符串的所有行

  1. import pandas as pd
  2. df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
  3. '1986-06-01', '1983-06-04', '1990-03-07',
  4. '1999-07-09'],
  5. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  6. },
  7. index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane',
  8. 'Christina', 'Cornelia'])
  9. print(df)
  10. print("\n---- Filter DataFrame using & ----\n")
  11. df.index = df.index.astype('str')
  12. df1 = df[df.index.str.contains('ane') | df['State'].str.contains("TX")]
  13. print(df1)

OutPut

  1. DateOfBirth State
  2. Jane 1986-11-11 NY
  3. Pane 1999-05-12 TX
  4. Aaron 1976-01-01 FL
  5. Penelope 1986-06-01 AL
  6. Frane 1983-06-04 AK
  7. Christina 1990-03-07 TX
  8. Cornelia 1999-07-09 TX
  9. ---- Filter DataFrame using & ----
  10. DateOfBirth State
  11. Jane 1986-11-11 NY
  12. Pane 1999-05-12 TX
  13. Frane 1983-06-04 AK
  14. Christina 1990-03-07 TX
  15. Cornelia 1999-07-09 TX

如果行中的值包含字符串,则创建与字符串相等的另一列

  1. import pandas as pd
  2. import numpy as np
  3. df = pd.DataFrame({
  4. 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
  5. 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
  6. 'Occupation': ['Chemist', 'Accountant', 'Statistician',
  7. 'Statistician', 'Programmer'],
  8. 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
  9. '2018-03-16'],
  10. 'Age': [23, 24, 34, 29, 40]})
  11. df['Department'] = pd.np.where(df.Occupation.str.contains("Chemist"), "Science",
  12. pd.np.where(df.Occupation.str.contains("Statistician"), "Economics",
  13. pd.np.where(df.Occupation.str.contains("Programmer"), "Computer", "General")))
  14. print(df)

OutPut

  1. Age Date Of Join EmpCode Name Occupation Department
  2. 0 23 2018-01-25 Emp001 John Chemist Science
  3. 1 24 2018-01-26 Emp002 Doe Accountant General
  4. 2 34 2018-01-26 Emp003 William Statistician Economics
  5. 3 29 2018-02-26 Emp004 Spark Statistician Economics
  6. 4 40 2018-03-16 Emp005 Mark Programmer Computer

计算 pandas group 中每组的行数

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0],
  3. [6, 6, 6, 6], [8, 8, 8, 8], [5, 5, 0, 0]],
  4. columns=['Apple', 'Orange', 'Rice', 'Oil'],
  5. index=['Basket1', 'Basket2', 'Basket3',
  6. 'Basket4', 'Basket5', 'Basket6'])
  7. print(df)
  8. print("\n ----------------------------- \n")
  9. print(df[['Apple', 'Orange', 'Rice', 'Oil']].
  10. groupby(['Apple']).agg(['mean', 'count']))

OutPut

  1. Apple Orange Rice Oil
  2. Basket1 10 20 30 40
  3. Basket2 7 14 21 28
  4. Basket3 5 5 0 0
  5. Basket4 6 6 6 6
  6. Basket5 8 8 8 8
  7. Basket6 5 5 0 0
  8. -----------------------------
  9. Orange Rice Oil
  10. mean count mean count mean count
  11. Apple
  12. 5 5 2 0 2 0 2
  13. 6 6 1 6 1 6 1
  14. 7 14 1 21 1 28 1
  15. 8 8 1 8 1 8 1
  16. 10 20 1 30 1 40 1

检查字符串是否在 DataFrme 中

  1. import pandas as pd
  2. df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
  3. '1986-06-01', '1983-06-04', '1990-03-07',
  4. '1999-07-09'],
  5. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  6. },
  7. index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane',
  8. 'Christina', 'Cornelia'])
  9. if df['State'].str.contains('TX').any():
  10. print("TX is there")

OutPut

  1. TX is there

从 DataFrame 列中获取唯一行值

  1. import pandas as pd
  2. df = pd.DataFrame({'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  3. },
  4. index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
  5. 'Christina', 'Cornelia'])
  6. print(df)
  7. print("\n----------------\n")
  8. print(df["State"].unique())

OutPut

  1. State
  2. Jane NY
  3. Nick TX
  4. Aaron FL
  5. Penelope AL
  6. Dean AK
  7. Christina TX
  8. Cornelia TX
  9. ----------------
  10. ['NY' 'TX' 'FL' 'AL' 'AK']

计算 DataFrame 列的不同值

  1. import pandas as pd
  2. df = pd.DataFrame({'Age': [30, 20, 22, 40, 20, 30, 20, 25],
  3. 'Height': [165, 70, 120, 80, 162, 72, 124, 81],
  4. 'Score': [4.6, 8.3, 9.0, 3.3, 4, 8, 9, 3],
  5. 'State': ['NY', 'TX', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},
  6. index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Jaane', 'Nicky', 'Armour', 'Ponting'])
  7. print(df.Age.value_counts())

OutPut

  1. 20 3
  2. 30 2
  3. 25 1
  4. 22 1
  5. 40 1
  6. Name: Age, dtype: int64

删除具有重复索引的行

  1. import pandas as pd
  2. df = pd.DataFrame({'Age': [30, 30, 22, 40, 20, 30, 20, 25],
  3. 'Height': [165, 165, 120, 80, 162, 72, 124, 81],
  4. 'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3],
  5. 'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},
  6. index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky',
  7. 'Armour', 'Ponting'])
  8. print("\n -------- Duplicate Rows ----------- \n")
  9. print(df)
  10. df1 = df.reset_index().drop_duplicates(subset='index',
  11. keep='first').set_index('index')
  12. print("\n ------- Unique Rows ------------ \n")
  13. print(df1)

OutPut

  1. -------- Duplicate Rows -----------
  2. Age Height Score State
  3. Jane 30 165 4.6 NY
  4. Jane 30 165 4.6 NY
  5. Aaron 22 120 9.0 FL
  6. Penelope 40 80 3.3 AL
  7. Jaane 20 162 4.0 NY
  8. Nicky 30 72 8.0 TX
  9. Armour 20 124 9.0 FL
  10. Ponting 25 81 3.0 AL
  11. ------- Unique Rows ------------
  12. Age Height Score State
  13. index
  14. Jane 30 165 4.6 NY
  15. Aaron 22 120 9.0 FL
  16. Penelope 40 80 3.3 AL
  17. Jaane 20 162 4.0 NY
  18. Nicky 30 72 8.0 TX
  19. Armour 20 124 9.0 FL
  20. Ponting 25 81 3.0 AL

删除某些列具有重复值的行

  1. import pandas as pd
  2. df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],
  3. 'Height': [120, 162, 120, 120, 120, 72, 120, 81],
  4. 'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3],
  5. 'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},
  6. index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky',
  7. 'Armour', 'Ponting'])
  8. print("\n -------- Duplicate Rows ----------- \n")
  9. print(df)
  10. df1 = df.reset_index().drop_duplicates(subset=['Age','Height'],
  11. keep='first').set_index('index')
  12. print("\n ------- Unique Rows ------------ \n")
  13. print(df1)

OutPut

  1. -------- Duplicate Rows -----------
  2. Age Height Score State
  3. Jane 30 120 4.6 NY
  4. Jane 40 162 4.6 NY
  5. Aaron 30 120 9.0 FL
  6. Penelope 40 120 3.3 AL
  7. Jaane 30 120 4.0 NY
  8. Nicky 30 72 8.0 TX
  9. Armour 20 120 9.0 FL
  10. Ponting 25 81 3.0 AL
  11. ------- Unique Rows ------------
  12. Age Height Score State
  13. index
  14. Jane 30 120 4.6 NY
  15. Jane 40 162 4.6 NY
  16. Penelope 40 120 3.3 AL
  17. Nicky 30 72 8.0 TX
  18. Armour 20 120 9.0 FL
  19. Ponting 25 81 3.0 AL

从 DataFrame 单元格中获取值

  1. import pandas as pd
  2. df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],
  3. 'Height': [120, 162, 120, 120, 120, 72, 120, 81],
  4. 'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3],
  5. 'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},
  6. index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky',
  7. 'Armour', 'Ponting'])
  8. print(df.loc['Nicky', 'Age'])

OutPut

  1. 30

使用 DataFrame 中的条件索引获取单元格上的标量值

  1. import pandas as pd
  2. df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],
  3. 'Height': [120, 162, 120, 120, 120, 72, 120, 81],
  4. 'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3],
  5. 'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},
  6. index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky',
  7. 'Armour', 'Ponting'])
  8. print("\nGet Height where Age is 20")
  9. print(df.loc[df['Age'] == 20, 'Height'].values[0])
  10. print("\nGet State where Age is 30")
  11. print(df.loc[df['Age'] == 30, 'State'].values[0])

OutPut

  1. Get Height where Age is 20
  2. 120
  3. Get State where Age is 30
  4. NY

设置 DataFrame 的特定单元格值

  1. import pandas as pd
  2. df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],
  3. 'Height': [120, 162, 120, 120, 120, 72, 120, 81]},
  4. index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky',
  5. 'Armour', 'Ponting'])
  6. print("\n--------------Before------------\n")
  7. print(df)
  8. df.iat[0, 0] = 90
  9. df.iat[0, 1] = 91
  10. df.iat[1, 1] = 92
  11. df.iat[2, 1] = 93
  12. df.iat[7, 1] = 99
  13. print("\n--------------After------------\n")
  14. print(df)

OutPut

  1. --------------Before------------
  2. Age Height
  3. Jane 30 120
  4. Jane 40 162
  5. Aaron 30 120
  6. Penelope 40 120
  7. Jaane 30 120
  8. Nicky 30 72
  9. Armour 20 120
  10. Ponting 25 81
  11. --------------After------------
  12. Age Height
  13. Jane 90 91
  14. Jane 40 92
  15. Aaron 30 93
  16. Penelope 40 120
  17. Jaane 30 120
  18. Nicky 30 72
  19. Armour 20 120
  20. Ponting 25 99

从 DataFrame 行获取单元格值

  1. import pandas as pd
  2. df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],
  3. 'Height': [120, 162, 120, 120, 120, 72, 120, 81]},
  4. index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky',
  5. 'Armour', 'Ponting'])
  6. print(df.loc[df.Age == 30,'Height'].tolist())

OutPut

  1. [120, 120, 120, 72]

用字典替换 DataFrame 列中的值

  1. import pandas as pd
  2. df = pd.DataFrame({'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  3. },
  4. index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
  5. 'Christina', 'Cornelia'])
  6. print(df)
  7. dict = {"NY": 1, "TX": 2, "FL": 3, "AL": 4, "AK": 5}
  8. df1 = df.replace({"State": dict})
  9. print("\n\n")
  10. print(df1)

OutPut

  1. State
  2. Jane NY
  3. Nick TX
  4. Aaron FL
  5. Penelope AL
  6. Dean AK
  7. Christina TX
  8. Cornelia TX
  9. State
  10. Jane 1
  11. Nick 2
  12. Aaron 3
  13. Penelope 4
  14. Dean 5
  15. Christina 2
  16. Cornelia 2

统计基于某一列的一列的数值

  1. import pandas as pd
  2. df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
  3. '1986-06-01', '1983-06-04', '1990-03-07',
  4. '1999-07-09'],
  5. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  6. },
  7. index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
  8. 'Christina', 'Cornelia'])
  9. print(df.groupby('State').DateOfBirth.nunique())

OutPut

  1. State
  2. AK 1
  3. AL 1
  4. FL 1
  5. NY 1
  6. TX 3
  7. Name: DateOfBirth, dtype: int64

处理 DataFrame 中的缺失值

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5,]],
  3. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  4. index=['Basket1', 'Basket2', 'Basket3'])
  5. print("\n--------- DataFrame ---------\n")
  6. print(df)
  7. print("\n--------- Use of isnull() ---------\n")
  8. print(df.isnull())
  9. print("\n--------- Use of notnull() ---------\n")
  10. print(df.notnull())

OutPut

  1. --------- DataFrame ---------
  2. Apple Orange Banana Pear
  3. Basket1 10 20.0 30.0 40.0
  4. Basket2 7 14.0 21.0 28.0
  5. Basket3 5 NaN NaN NaN
  6. --------- Use of isnull() ---------
  7. Apple Orange Banana Pear
  8. Basket1 False False False False
  9. Basket2 False False False False
  10. Basket3 False True True True
  11. --------- Use of notnull() ---------
  12. Apple Orange Banana Pear
  13. Basket1 True True True True
  14. Basket2 True True True True
  15. Basket3 True False False False

删除包含任何缺失数据的行

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5,]],
  3. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  4. index=['Basket1', 'Basket2', 'Basket3'])
  5. print("\n--------- DataFrame ---------\n")
  6. print(df)
  7. print("\n--------- Use of dropna() ---------\n")
  8. print(df.dropna())

OutPut

  1. --------- DataFrame ---------
  2. Apple Orange Banana Pear
  3. Basket1 10 20.0 30.0 40.0
  4. Basket2 7 14.0 21.0 28.0
  5. Basket3 5 NaN NaN NaN
  6. --------- Use of dropna() ---------
  7. Apple Orange Banana Pear
  8. Basket1 10 20.0 30.0 40.0
  9. Basket2 7 14.0 21.0 28.0

删除 DataFrame 中缺失数据的列

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5,]],
  3. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  4. index=['Basket1', 'Basket2', 'Basket3'])
  5. print("\n--------- DataFrame ---------\n")
  6. print(df)
  7. print("\n--------- Drop Columns) ---------\n")
  8. print(df.dropna(1))

OutPut

  1. --------- DataFrame ---------
  2. Apple Orange Banana Pear
  3. Basket1 10 20.0 30.0 40.0
  4. Basket2 7 14.0 21.0 28.0
  5. Basket3 5 NaN NaN NaN
  6. --------- Drop Columns) ---------
  7. Apple
  8. Basket1 10
  9. Basket2 7
  10. Basket3 5

按降序对索引值进行排序

  1. import pandas as pd
  2. df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
  3. '1986-06-01', '1983-06-04', '1990-03-07',
  4. '1999-07-09'],
  5. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  6. },
  7. index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane',
  8. 'Christina', 'Cornelia'])
  9. print(df.sort_index(ascending=False))

OutPut

  1. DateOfBirth State
  2. Penelope 1986-06-01 AL
  3. Pane 1999-05-12 TX
  4. Jane 1986-11-11 NY
  5. Frane 1983-06-04 AK
  6. Cornelia 1999-07-09 TX
  7. Christina 1990-03-07 TX
  8. Aaron 1976-01-01 FL

按降序对列进行排序

  1. import pandas as pd
  2. employees = pd.DataFrame({
  3. 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
  4. 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
  5. 'Occupation': ['Chemist', 'Statistician', 'Statistician',
  6. 'Statistician', 'Programmer'],
  7. 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
  8. '2018-03-16'],
  9. 'Age': [23, 24, 34, 29, 40]})
  10. print(employees.sort_index(axis=1, ascending=False))

OutPut

  1. Occupation Name EmpCode Date Of Join Age
  2. 0 Chemist John Emp001 2018-01-25 23
  3. 1 Statistician Doe Emp002 2018-01-26 24
  4. 2 Statistician William Emp003 2018-01-26 34
  5. 3 Statistician Spark Emp004 2018-02-26 29
  6. 4 Programmer Mark Emp005 2018-03-16 40

使用 rank 方法查找 DataFrame 中元素的排名

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0]],
  3. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  4. index=['Basket1', 'Basket2', 'Basket3'])
  5. print("\n--------- DataFrame Values--------\n")
  6. print(df)
  7. print("\n--------- DataFrame Values by Rank--------\n")
  8. print(df.rank())

OutPut

  1. --------- DataFrame Values--------
  2. Apple Orange Banana Pear
  3. Basket1 10 20 30 40
  4. Basket2 7 14 21 28
  5. Basket3 5 5 0 0
  6. --------- DataFrame Values by Rank--------
  7. Apple Orange Banana Pear
  8. Basket1 3.0 3.0 3.0 3.0
  9. Basket2 2.0 2.0 2.0 2.0
  10. Basket3 1.0 1.0 1.0 1.0

在多列上设置索引

  1. import pandas as pd
  2. employees = pd.DataFrame({
  3. 'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
  4. 'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
  5. 'Occupation': ['Chemist', 'Statistician', 'Statistician',
  6. 'Statistician', 'Programmer'],
  7. 'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
  8. '2018-03-16'],
  9. 'Age': [23, 24, 34, 29, 40]})
  10. print("\n --------- Before Index ----------- \n")
  11. print(employees)
  12. print("\n --------- Multiple Indexing ----------- \n")
  13. print(employees.set_index(['Occupation', 'Age']))

OutPut

  1. Date Of Join EmpCode Name
  2. Occupation Age
  3. Chemist 23 2018-01-25 Emp001 John
  4. Statistician 24 2018-01-26 Emp002 Doe
  5. 34 2018-01-26 Emp003 William
  6. 29 2018-02-26 Emp004 Spark
  7. Programmer 40 2018-03-16 Emp005 Mark

确定 DataFrame 的周期索引和列

  1. import pandas as pd
  2. values = ["India", "Canada", "Australia",
  3. "Japan", "Germany", "France"]
  4. pidx = pd.period_range('2015-01-01', periods=6)
  5. df = pd.DataFrame(values, index=pidx, columns=['Country'])
  6. print(df)

OutPut

  1. Country
  2. 2015-01-01 India
  3. 2015-01-02 Canada
  4. 2015-01-03 Australia
  5. 2015-01-04 Japan
  6. 2015-01-05 Germany
  7. 2015-01-06 France

导入 CSV 指定特定索引

  1. import pandas as pd
  2. df = pd.read_csv('test.csv', index_col="DateTime")
  3. print(df)

OutPut

  1. Wheat Rice Oil
  2. DateTime
  3. 10/10/2016 10.500 12.500 16.500
  4. 10/11/2016 11.250 12.750 17.150
  5. 10/12/2016 10.000 13.150 15.500
  6. 10/13/2016 12.000 14.500 16.100
  7. 10/14/2016 13.000 14.825 15.600
  8. 10/15/2016 13.075 15.465 15.315
  9. 10/16/2016 13.650 16.105 15.030
  10. 10/17/2016 14.225 16.745 14.745
  11. 10/18/2016 14.800 17.385 14.460
  12. 10/19/2016 15.375 18.025 14.175

将 DataFrame 写入 csv

  1. import pandas as pd
  2. df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
  3. '1986-06-01', '1983-06-04', '1990-03-07',
  4. '1999-07-09'],
  5. 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
  6. },
  7. index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane',
  8. 'Christina', 'Cornelia'])
  9. df.to_csv('test.csv', encoding='utf-8', index=True)

OutPut

  1. 检查本地文件

使用 Pandas 读取 csv 文件的特定列

  1. import pandas as pd
  2. df = pd.read_csv("test.csv", usecols = ['Wheat','Oil'])
  3. print(df)

Pandas 获取 CSV 列的列表

  1. import pandas as pd
  2. cols = list(pd.read_csv("test.csv", nrows =1))
  3. print(cols)

OutPut

  1. ['DateTime', 'Wheat', 'Rice', 'Oil']

找到列值最大的行

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12]],
  3. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  4. index=['Basket1', 'Basket2', 'Basket3'])
  5. print(df.ix[df['Apple'].idxmax()])

OutPut

  1. Apple 55
  2. Orange 15
  3. Banana 8
  4. Pear 12
  5. Name: Basket3, dtype: int64

使用查询方法进行复杂条件选择

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12]],
  3. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  4. index=['Basket1', 'Basket2', 'Basket3'])
  5. print(df)
  6. print("\n ----------- Filter data using query method ------------- \n")
  7. df1 = df.ix[df.query('Apple > 50 & Orange <= 15 & Banana < 15 & Pear == 12').index]
  8. print(df1)

OutPut

  1. Apple Orange Banana Pear
  2. Basket1 10 20 30 40
  3. Basket2 7 14 21 28
  4. Basket3 55 15 8 12
  5. ----------- Filter data using query method -------------
  6. Apple Orange Banana Pear
  7. Basket3 55 15 8 12

检查 Pandas 中是否存在列

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12]],
  3. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  4. index=['Basket1', 'Basket2', 'Basket3'])
  5. if 'Apple' in df.columns:
  6. print("Yes")
  7. else:
  8. print("No")
  9. if set(['Apple','Orange']).issubset(df.columns):
  10. print("Yes")
  11. else:
  12. print("No")

为特定列从 DataFrame 中查找 n-smallest 和 n-largest 值

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
  3. [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
  4. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  5. index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
  6. 'Basket5', 'Basket6'])
  7. print("\n----------- nsmallest -----------\n")
  8. print(df.nsmallest(2, ['Apple']))
  9. print("\n----------- nlargest -----------\n")
  10. print(df.nlargest(2, ['Apple']))

OutPut

  1. ----------- nsmallest -----------
  2. Apple Orange Banana Pear
  3. Basket6 5 4 9 2
  4. Basket2 7 14 21 28
  5. ----------- nlargest -----------
  6. Apple Orange Banana Pear
  7. Basket3 55 15 8 12
  8. Basket4 15 14 1 8

从 DataFrame 中查找所有列的最小值和最大值

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
  3. [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
  4. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  5. index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
  6. 'Basket5', 'Basket6'])
  7. print("\n----------- Minimum -----------\n")
  8. print(df[['Apple', 'Orange', 'Banana', 'Pear']].min())
  9. print("\n----------- Maximum -----------\n")
  10. print(df[['Apple', 'Orange', 'Banana', 'Pear']].max())

OutPut

  1. ----------- Minimum -----------
  2. Apple 5
  3. Orange 1
  4. Banana 1
  5. Pear 2
  6. dtype: int64
  7. ----------- Maximum -----------
  8. Apple 55
  9. Orange 20
  10. Banana 30
  11. Pear 40
  12. dtype: int64

在 DataFrame 中找到最小值和最大值所在的索引位置

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
  3. [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
  4. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  5. index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
  6. 'Basket5', 'Basket6'])
  7. print("\n----------- Minimum -----------\n")
  8. print(df[['Apple', 'Orange', 'Banana', 'Pear']].idxmin())
  9. print("\n----------- Maximum -----------\n")
  10. print(df[['Apple', 'Orange', 'Banana', 'Pear']].idxmax())

OutPut

  1. ----------- Minimum -----------
  2. Apple Basket6
  3. Orange Basket5
  4. Banana Basket4
  5. Pear Basket6
  6. dtype: object
  7. ----------- Maximum -----------
  8. Apple Basket3
  9. Orange Basket1
  10. Banana Basket1
  11. Pear Basket1
  12. dtype: object

计算 DataFrame Columns 的累积乘积和累积总和

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
  3. [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
  4. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  5. index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
  6. 'Basket5', 'Basket6'])
  7. print("\n----------- Cumulative Product -----------\n")
  8. print(df[['Apple', 'Orange', 'Banana', 'Pear']].cumprod())
  9. print("\n----------- Cumulative Sum -----------\n")
  10. print(df[['Apple', 'Orange', 'Banana', 'Pear']].cumsum())

OutPut

  1. ----------- Cumulative Product -----------
  2. Apple Orange Banana Pear
  3. Basket1 10 20 30 40
  4. Basket2 70 280 630 1120
  5. Basket3 3850 4200 5040 13440
  6. Basket4 57750 58800 5040 107520
  7. Basket5 404250 58800 5040 860160
  8. Basket6 2021250 235200 45360 1720320
  9. ----------- Cumulative Sum -----------
  10. Apple Orange Banana Pear
  11. Basket1 10 20 30 40
  12. Basket2 17 34 51 68
  13. Basket3 72 49 59 80
  14. Basket4 87 63 60 88
  15. Basket5 94 64 61 96
  16. Basket6 99 68 70 98

汇总统计

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
  3. [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
  4. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  5. index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
  6. 'Basket5', 'Basket6'])
  7. print("\n----------- Describe DataFrame -----------\n")
  8. print(df.describe())
  9. print("\n----------- Describe Column -----------\n")
  10. print(df[['Apple']].describe())

OutPut

  1. ----------- Describe DataFrame -----------
  2. Apple Orange Banana Pear
  3. count 6.000000 6.000000 6.000000 6.000000
  4. mean 16.500000 11.333333 11.666667 16.333333
  5. std 19.180719 7.257180 11.587349 14.555640
  6. min 5.000000 1.000000 1.000000 2.000000
  7. 25% 7.000000 6.500000 2.750000 8.000000
  8. 50% 8.500000 14.000000 8.500000 10.000000
  9. 75% 13.750000 14.750000 18.000000 24.000000
  10. max 55.000000 20.000000 30.000000 40.000000
  11. ----------- Describe Column -----------
  12. Apple
  13. count 6.000000
  14. mean 16.500000
  15. std 19.180719
  16. min 5.000000
  17. 25% 7.000000
  18. 50% 8.500000
  19. 75% 13.750000
  20. max 55.000000

查找 DataFrame 的均值、中值和众数

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
  3. [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
  4. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  5. index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
  6. 'Basket5', 'Basket6'])
  7. print("\n----------- Calculate Mean -----------\n")
  8. print(df.mean())
  9. print("\n----------- Calculate Median -----------\n")
  10. print(df.median())
  11. print("\n----------- Calculate Mode -----------\n")
  12. print(df.mode())

OutPut

  1. ----------- Calculate Mean -----------
  2. Apple 16.500000
  3. Orange 11.333333
  4. Banana 11.666667
  5. Pear 16.333333
  6. dtype: float64
  7. ----------- Calculate Median -----------
  8. Apple 8.5
  9. Orange 14.0
  10. Banana 8.5
  11. Pear 10.0
  12. dtype: float64
  13. ----------- Calculate Mode -----------
  14. Apple Orange Banana Pear
  15. 0 7 14 1 8

测量 DataFrame 列的方差和标准偏差

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
  3. [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
  4. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  5. index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
  6. 'Basket5', 'Basket6'])
  7. print("\n----------- Calculate Mean -----------\n")
  8. print(df.mean())
  9. print("\n----------- Calculate Median -----------\n")
  10. print(df.median())
  11. print("\n----------- Calculate Mode -----------\n")
  12. print(df.mode())

OutPut

  1. ----------- Measure Variance -----------
  2. Apple 367.900000
  3. Orange 52.666667
  4. Banana 134.266667
  5. Pear 211.866667
  6. dtype: float64
  7. ----------- Standard Deviation -----------
  8. Apple 19.180719
  9. Orange 7.257180
  10. Banana 11.587349
  11. Pear 14.555640
  12. dtype: float64

计算 DataFrame 列之间的协方差

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
  3. [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
  4. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  5. index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
  6. 'Basket5', 'Basket6'])
  7. print("\n----------- Calculating Covariance -----------\n")
  8. print(df.cov())
  9. print("\n----------- Between 2 columns -----------\n")
  10. # Covariance of Apple vs Orange
  11. print(df.Apple.cov(df.Orange))

OutPut

  1. ----------- Calculating Covariance -----------
  2. Apple Orange Banana Pear
  3. Apple 367.9 47.600000 -40.200000 -35.000000
  4. Orange 47.6 52.666667 54.333333 77.866667
  5. Banana -40.2 54.333333 134.266667 154.933333
  6. Pear -35.0 77.866667 154.933333 211.866667
  7. ----------- Between 2 columns -----------
  8. 47.60000000000001

计算 Pandas 中两个 DataFrame 对象之间的相关性

  1. import pandas as pd
  2. df1 = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
  3. [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
  4. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  5. index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
  6. 'Basket5', 'Basket6'])
  7. print("\n------ Calculating Correlation of one DataFrame Columns -----\n")
  8. print(df1.corr())
  9. df2 = pd.DataFrame([[52, 54, 58, 41], [14, 24, 51, 78], [55, 15, 8, 12],
  10. [15, 14, 1, 8], [7, 17, 18, 98], [15, 34, 29, 52]],
  11. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  12. index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
  13. 'Basket5', 'Basket6'])
  14. print("\n----- Calculating correlation between two DataFrame -------\n")
  15. print(df2.corrwith(other=df1))

OutPut

  1. ------ Calculating Correlation of one DataFrame Columns -----
  2. Apple Orange Banana Pear
  3. Apple 1.000000 0.341959 -0.180874 -0.125364
  4. Orange 0.341959 1.000000 0.646122 0.737144
  5. Banana -0.180874 0.646122 1.000000 0.918606
  6. Pear -0.125364 0.737144 0.918606 1.000000
  7. ----- Calculating correlation between two DataFrame -------
  8. Apple 0.678775
  9. Orange 0.354993
  10. Banana 0.920872
  11. Pear 0.076919
  12. dtype: float64

计算 DataFrame 列的每个单元格的百分比变化

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
  3. [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
  4. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  5. index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
  6. 'Basket5', 'Basket6'])
  7. print("\n------ Percent change at each cell of a Column -----\n")
  8. print(df[['Apple']].pct_change()[:3])
  9. print("\n------ Percent change at each cell of a DataFrame -----\n")
  10. print(df.pct_change()[:5])

OutPut

  1. ------ Percent change at each cell of a Column -----
  2. Apple
  3. Basket1 NaN
  4. Basket2 -0.300000
  5. Basket3 6.857143
  6. ------ Percent change at each cell of a DataFrame -----
  7. Apple Orange Banana Pear
  8. Basket1 NaN NaN NaN NaN
  9. Basket2 -0.300000 -0.300000 -0.300000 -0.300000
  10. Basket3 6.857143 0.071429 -0.619048 -0.571429
  11. Basket4 -0.727273 -0.066667 -0.875000 -0.333333
  12. Basket5 -0.533333 -0.928571 0.000000 0.000000

在 Pandas 中向前和向后填充 DataFrame 列的缺失值

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 30, 40], [], [15, 8, 12],
  3. [15, 14, 1, 8], [7, 8], [5, 4, 1]],
  4. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  5. index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
  6. 'Basket5', 'Basket6'])
  7. print("\n------ DataFrame with NaN -----\n")
  8. print(df)
  9. print("\n------ DataFrame with Forward Filling -----\n")
  10. print(df.ffill())
  11. print("\n------ DataFrame with Forward Filling -----\n")
  12. print(df.bfill())

OutPut

  1. ------ DataFrame with NaN -----
  2. Apple Orange Banana Pear
  3. Basket1 10.0 30.0 40.0 NaN
  4. Basket2 NaN NaN NaN NaN
  5. Basket3 15.0 8.0 12.0 NaN
  6. Basket4 15.0 14.0 1.0 8.0
  7. Basket5 7.0 8.0 NaN NaN
  8. Basket6 5.0 4.0 1.0 NaN
  9. ------ DataFrame with Forward Filling -----
  10. Apple Orange Banana Pear
  11. Basket1 10.0 30.0 40.0 NaN
  12. Basket2 10.0 30.0 40.0 NaN
  13. Basket3 15.0 8.0 12.0 NaN
  14. Basket4 15.0 14.0 1.0 8.0
  15. Basket5 7.0 8.0 1.0 8.0
  16. Basket6 5.0 4.0 1.0 8.0
  17. ------ DataFrame with Forward Filling -----
  18. Apple Orange Banana Pear
  19. Basket1 10.0 30.0 40.0 8.0
  20. Basket2 15.0 8.0 12.0 8.0
  21. Basket3 15.0 8.0 12.0 8.0
  22. Basket4 15.0 14.0 1.0 8.0
  23. Basket5 7.0 8.0 1.0 NaN
  24. Basket6 5.0 4.0 1.0 NaN

在 Pandas 中使用非分层索引使用 Stacking

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 30, 40], [], [15, 8, 12],
  3. [15, 14, 1, 8], [7, 8], [5, 4, 1]],
  4. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  5. index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
  6. 'Basket5', 'Basket6'])
  7. print("\n------ DataFrame-----\n")
  8. print(df)
  9. print("\n------ Stacking DataFrame -----\n")
  10. print(df.stack(level=-1))

OutPut

  1. ------ DataFrame-----
  2. Apple Orange Banana Pear
  3. Basket1 10.0 30.0 40.0 NaN
  4. Basket2 NaN NaN NaN NaN
  5. Basket3 15.0 8.0 12.0 NaN
  6. Basket4 15.0 14.0 1.0 8.0
  7. Basket5 7.0 8.0 NaN NaN
  8. Basket6 5.0 4.0 1.0 NaN
  9. ------ Stacking DataFrame -----
  10. Basket1 Apple 10.0
  11. Orange 30.0
  12. Banana 40.0
  13. Basket3 Apple 15.0
  14. Orange 8.0
  15. Banana 12.0
  16. Basket4 Apple 15.0
  17. Orange 14.0
  18. Banana 1.0
  19. Pear 8.0
  20. Basket5 Apple 7.0
  21. Orange 8.0
  22. Basket6 Apple 5.0
  23. Orange 4.0
  24. Banana 1.0
  25. dtype: float64

使用分层索引对 Pandas 进行拆分

  1. import pandas as pd
  2. df = pd.DataFrame([[10, 30, 40], [], [15, 8, 12],
  3. [15, 14, 1, 8], [7, 8], [5, 4, 1]],
  4. columns=['Apple', 'Orange', 'Banana', 'Pear'],
  5. index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
  6. 'Basket5', 'Basket6'])
  7. print("\n------ DataFrame-----\n")
  8. print(df)
  9. print("\n------ Unstacking DataFrame -----\n")
  10. print(df.unstack(level=-1))

OutPut

  1. ------ DataFrame-----
  2. Apple Orange Banana Pear
  3. Basket1 10.0 30.0 40.0 NaN
  4. Basket2 NaN NaN NaN NaN
  5. Basket3 15.0 8.0 12.0 NaN
  6. Basket4 15.0 14.0 1.0 8.0
  7. Basket5 7.0 8.0 NaN NaN
  8. Basket6 5.0 4.0 1.0 NaN
  9. ------ Unstacking DataFrame -----
  10. Apple Basket1 10.0
  11. Basket2 NaN
  12. Basket3 15.0
  13. Basket4 15.0
  14. Basket5 7.0
  15. Basket6 5.0
  16. Orange Basket1 30.0
  17. Basket2 NaN
  18. Basket3 8.0
  19. Basket4 14.0
  20. Basket5 8.0
  21. Basket6 4.0
  22. Banana Basket1 40.0
  23. Basket2 NaN
  24. Basket3 12.0
  25. Basket4 1.0
  26. Basket5 NaN
  27. Basket6 1.0
  28. Pear Basket1 NaN
  29. Basket2 NaN
  30. Basket3 NaN
  31. Basket4 8.0
  32. Basket5 NaN
  33. Basket6 NaN
  34. dtype: float64

Pandas 获取 HTML 页面上 table 数据

  1. import pandas as pd
  2. df pd.read_html("url")