- 90个Pandas案例">90个Pandas案例
- 在 DataFrame 中使用“isin”过滤多行
- 迭代 DataFrame 的行和列
- 如何通过名称或索引删除 DataFrame 的列
- 向 DataFrame 中新增列
- 如何从 DataFrame 中获取列标题列表
- 如何随机生成 DataFrame
- 如何选择 DataFrame 的多个列
- 如何将字典转换为 DataFrame
- 使用 ioc 进行切片
- 检查 DataFrame 中是否是空的
- 在创建 DataFrame 时指定索引和列名称
- 使用 iloc 进行切片
- iloc 和 loc 的区别
- 使用时间索引创建空 DataFrame
- 如何改变 DataFrame 列的排序
- 检查 DataFrame 列的数据类型
- 更改 DataFrame 指定列的数据类型
- 如何将列的数据类型转换为 DateTime 类型
- 将 DataFrame 列从 floats 转为 ints
- 如何把 dates 列转换为 DateTime 类型
- 两个 DataFrame 相加
- 在 DataFrame 末尾添加额外的行
- 为指定索引添加新行
- 如何使用 for 循环添加行
- 在 DataFrame 顶部添加一行
- 如何向 DataFrame 中动态添加行
- 在任意位置插入行
- 使用时间戳索引向 DataFrame 中添加行
- 为不同的行填充缺失值
- append, concat 和 combine_first 示例
- 获取行和列的平均值
- 计算行和列的总和
- 连接两列
- 过滤包含某字符串的行
- 过滤索引中包含某字符串的行
- 使用 AND 运算符过滤包含特定字符串值的行
- 查找包含某字符串的所有行
- 如果行中的值包含字符串,则创建与字符串相等的另一列
- 计算 pandas group 中每组的行数
- 检查字符串是否在 DataFrme 中
- 从 DataFrame 列中获取唯一行值
- 计算 DataFrame 列的不同值
- 删除具有重复索引的行
- 删除某些列具有重复值的行
- 从 DataFrame 单元格中获取值
- 使用 DataFrame 中的条件索引获取单元格上的标量值
- 设置 DataFrame 的特定单元格值
- 从 DataFrame 行获取单元格值
- 用字典替换 DataFrame 列中的值
- 统计基于某一列的一列的数值
- 处理 DataFrame 中的缺失值
- 删除包含任何缺失数据的行
- 删除 DataFrame 中缺失数据的列
- 按降序对索引值进行排序
- 按降序对列进行排序
- 使用 rank 方法查找 DataFrame 中元素的排名
- 在多列上设置索引
- 确定 DataFrame 的周期索引和列
- 导入 CSV 指定特定索引
- 将 DataFrame 写入 csv
- 使用 Pandas 读取 csv 文件的特定列
- Pandas 获取 CSV 列的列表
- 找到列值最大的行
- 使用查询方法进行复杂条件选择
- 检查 Pandas 中是否存在列
- 为特定列从 DataFrame 中查找 n-smallest 和 n-largest 值
- 从 DataFrame 中查找所有列的最小值和最大值
- 在 DataFrame 中找到最小值和最大值所在的索引位置
- 计算 DataFrame Columns 的累积乘积和累积总和
- 汇总统计
- 查找 DataFrame 的均值、中值和众数
- 测量 DataFrame 列的方差和标准偏差
- 计算 DataFrame 列之间的协方差
- 计算 Pandas 中两个 DataFrame 对象之间的相关性
- 计算 DataFrame 列的每个单元格的百分比变化
- 在 Pandas 中向前和向后填充 DataFrame 列的缺失值
- 在 Pandas 中使用非分层索引使用 Stacking
- 使用分层索引对 Pandas 进行拆分
- Pandas 获取 HTML 页面上 table 数据
90个Pandas案例
- 如何使用列表和字典创建 Series
- 使用列表创建 Series
- 使用 name 参数创建 Series
- 使用简写的列表创建 Series
- 使用字典创建 Series
- 如何使用 Numpy 函数创建 Series
- 如何获取 Series 的索引和值
- 如何在创建 Series 时指定索引
- 如何获取 Series 的大小和形状
- 如何获取 Series 开始或末尾几行数据
- Head()
- Tail()
- Take()
- 使用切片获取 Series 子集
- 如何创建 DataFrame
- 如何设置 DataFrame 的索引和列信息
- 如何重命名 DataFrame 的列名称
- 如何根据 Pandas 列中的值从 DataFrame 中选择或过滤行
- 在 DataFrame 中使用“isin”过滤多行
- 迭代 DataFrame 的行和列
- 如何通过名称或索引删除 DataFrame 的列
- 向 DataFrame 中新增列
- 如何从 DataFrame 中获取列标题列表
- 如何随机生成 DataFrame
- 如何选择 DataFrame 的多个列
- 如何将字典转换为 DataFrame
- 使用 ioc 进行切片
- 检查 DataFrame 中是否是空的
- 在创建 DataFrame 时指定索引和列名称
- 使用 iloc 进行切片
- iloc 和 loc 的区别
- 使用时间索引创建空 DataFrame
- 如何改变 DataFrame 列的排序
- 检查 DataFrame 列的数据类型
- 更改 DataFrame 指定列的数据类型
- 如何将列的数据类型转换为 DateTime 类型
- 将 DataFrame 列从 floats 转为 ints
- 如何把 dates 列转换为 DateTime 类型
- 两个 DataFrame 相加
- 在 DataFrame 末尾添加额外的行
- 为指定索引添加新行
- 如何使用 for 循环添加行
- 在 DataFrame 顶部添加一行
- 如何向 DataFrame 中动态添加行
- 在任意位置插入行
- 使用时间戳索引向 DataFrame 中添加行
- 为不同的行填充缺失值
- append, concat 和 combine_first 示例
- 获取行和列的平均值
- 计算行和列的总和
- 连接两列
- 过滤包含某字符串的行
- 过滤索引中包含某字符串的行
- 使用 AND 运算符过滤包含特定字符串值的行
- 查找包含某字符串的所有行
- 如果行中的值包含字符串,则创建与字符串相等的另一列
- 计算 pandas group 中每组的行数
- 检查字符串是否在 DataFrme 中
- 从 DataFrame 列中获取唯一行值
- 计算 DataFrame 列的不同值
- 删除具有重复索引的行
- 删除某些列具有重复值的行
- 从 DataFrame 单元格中获取值
- 使用 DataFrame 中的条件索引获取单元格上的标量值
- 设置 DataFrame 的特定单元格值
- 从 DataFrame 行获取单元格值
- 用字典替换 DataFrame 列中的值
- 统计基于某一列的一列的数值
- 处理 DataFrame 中的缺失值
- 删除包含任何缺失数据的行
- 删除 DataFrame 中缺失数据的列
- 按降序对索引值进行排序
- 按降序对列进行排序
- 使用 rank 方法查找 DataFrame 中元素的排名
- 在多列上设置索引
- 确定 DataFrame 的周期索引和列
- 导入 CSV 指定特定索引
- 将 DataFrame 写入 csv
- 使用 Pandas 读取 csv 文件的特定列
- Pandas 获取 CSV 列的列表
- 找到列值最大的行
- 使用查询方法进行复杂条件选择
- 检查 Pandas 中是否存在列
- 为特定列从 DataFrame 中查找 n-smallest 和 n-largest 值
- 从 DataFrame 中查找所有列的最小值和最大值
- 在 DataFrame 中找到最小值和最大值所在的索引位置
- 计算 DataFrame Columns 的累积乘积和累积总和
- 汇总统计
- 查找 DataFrame 的均值、中值和众数
- 测量 DataFrame 列的方差和标准偏差
- 计算 DataFrame 列之间的协方差
- 计算 Pandas 中两个 DataFrame 对象之间的相关性
- 计算 DataFrame 列的每个单元格的百分比变化
- 在 Pandas 中向前和向后填充 DataFrame 列的缺失值
- 在 Pandas 中使用非分层索引使用 Stacking
- 使用分层索引对 Pandas 进行拆分
- Pandas 获取 HTML 页面上 table 数据
如何使用列表和字典创建 Series
使用列表创建 Series
```python import pandas as pd
ser1 = pd.Series([1.5, 2.5, 3, 4.5, 5.0, 6]) print(ser1)
<a name="dY5xC"></a>#### OutPut```python0 1.51 2.52 3.03 4.54 5.05 6.0dtype: float64
使用 name 参数创建 Series
import pandas as pdser2 = pd.Series(["India", "Canada", "Germany"], name="Countries")print(ser2)
OutPut
0 India1 Canada2 GermanyName: Countries, dtype: object
使用简写的列表创建 Series
import pandas as pdser3 = pd.Series(["A"]*4)print(ser3)
OutPut
0 A1 A2 A3 Adtype: object
使用字典创建 Series
import pandas as pdser4 = pd.Series({"India": "New Delhi","Japan": "Tokyo","UK": "London"})print(ser4)
OutPut
India New DelhiJapan TokyoUK Londondtype: object
如何使用 Numpy 函数创建 Series
import pandas as pdimport numpy as npser1 = pd.Series(np.linspace(1, 10, 5))print(ser1)ser2 = pd.Series(np.random.normal(size=5))print(ser2)
OutPut
0 1.001 3.252 5.503 7.754 10.00dtype: float640 -1.6944521 -1.5700062 1.7137943 0.3382924 0.803511dtype: float64
如何获取 Series 的索引和值
import pandas as pdimport numpy as npser1 = pd.Series({"India": "New Delhi","Japan": "Tokyo","UK": "London"})print(ser1.values)print(ser1.index)print("\n")ser2 = pd.Series(np.random.normal(size=5))print(ser2.index)print(ser2.values)
OutPut
['New Delhi' 'Tokyo' 'London']Index(['India', 'Japan', 'UK'], dtype='object')RangeIndex(start=0, stop=5, step=1)[ 0.66265478 -0.72222211 0.3608642 1.40955436 1.3096732 ]
如何在创建 Series 时指定索引
import pandas as pdvalues = ["India", "Canada", "Australia","Japan", "Germany", "France"]code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]ser1 = pd.Series(values, index=code)print(ser1)
OutPut
IND IndiaCAN CanadaAUS AustraliaJAP JapanGER GermanyFRA Francedtype: object
如何获取 Series 的大小和形状
import pandas as pdvalues = ["India", "Canada", "Australia","Japan", "Germany", "France"]code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]ser1 = pd.Series(values, index=code)print(len(ser1))print(ser1.shape)print(ser1.size)
OutPut
6(6,)6
如何获取 Series 开始或末尾几行数据
Head()
import pandas as pdvalues = ["India", "Canada", "Australia","Japan", "Germany", "France"]code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]ser1 = pd.Series(values, index=code)print("-----Head()-----")print(ser1.head())print("\n\n-----Head(2)-----")print(ser1.head(2))
OutPut
-----Head()-----IND IndiaCAN CanadaAUS AustraliaJAP JapanGER Germanydtype: object-----Head(2)-----IND IndiaCAN Canadadtype: object
Tail()
import pandas as pdvalues = ["India", "Canada", "Australia","Japan", "Germany", "France"]code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]ser1 = pd.Series(values, index=code)print("-----Tail()-----")print(ser1.tail())print("\n\n-----Tail(2)-----")print(ser1.tail(2))
OutPut
-----Tail()-----CAN CanadaAUS AustraliaJAP JapanGER GermanyFRA Francedtype: object-----Tail(2)-----GER GermanyFRA Francedtype: object
Take()
import pandas as pdvalues = ["India", "Canada", "Australia","Japan", "Germany", "France"]code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]ser1 = pd.Series(values, index=code)print("-----Take()-----")print(ser1.take([2, 4, 5]))
OutPut
-----Take()-----AUS AustraliaGER GermanyFRA Francedtype: object
使用切片获取 Series 子集
import pandas as pdnum = [000, 100, 200, 300, 400, 500, 600, 700, 800, 900]idx = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']series = pd.Series(num, index=idx)print("\n [2:2] \n")print(series[2:4])print("\n [1:6:2] \n")print(series[1:6:2])print("\n [:6] \n")print(series[:6])print("\n [4:] \n")print(series[4:])print("\n [:4:2] \n")print(series[:4:2])print("\n [4::2] \n")print(series[4::2])print("\n [::-1] \n")print(series[::-1])
OutPut
[2:2]C 200D 300dtype: int64[1:6:2]B 100D 300F 500dtype: int64[:6]A 0B 100C 200D 300E 400F 500dtype: int64[4:]E 400F 500G 600H 700I 800J 900dtype: int64[:4:2]A 0C 200dtype: int64[4::2]E 400G 600I 800dtype: int64[::-1]J 900I 800H 700G 600F 500E 400D 300C 200B 100A 0dtype: int64
如何创建 DataFrame
import pandas as pdemployees = pd.DataFrame({'EmpCode': ['Emp001', 'Emp00'],'Name': ['John Doe', 'William Spark'],'Occupation': ['Chemist', 'Statistician'],'Date Of Join': ['2018-01-25', '2018-01-26'],'Age': [23, 24]})print(employees)
OutPut
Age Date Of Join EmpCode Name Occupation0 23 2018-01-25 Emp001 John Doe Chemist1 24 2018-01-26 Emp00 William Spark Statistician
如何设置 DataFrame 的索引和列信息
import pandas as pdemployees = pd.DataFrame(data={'Name': ['John Doe', 'William Spark'],'Occupation': ['Chemist', 'Statistician'],'Date Of Join': ['2018-01-25', '2018-01-26'],'Age': [23, 24]},index=['Emp001', 'Emp002'],columns=['Name', 'Occupation', 'Date Of Join', 'Age'])print(employees)
OutPut
Name Occupation Date Of Join AgeEmp001 John Doe Chemist 2018-01-25 23Emp002 William Spark Statistician 2018-01-26 24
如何重命名 DataFrame 的列名称
import pandas as pdemployees = pd.DataFrame({'EmpCode': ['Emp001', 'Emp00'],'Name': ['John Doe', 'William Spark'],'Occupation': ['Chemist', 'Statistician'],'Date Of Join': ['2018-01-25', '2018-01-26'],'Age': [23, 24]})employees.columns = ['EmpCode', 'EmpName', 'EmpOccupation', 'EmpDOJ', 'EmpAge']print(employees)
OutPut
EmpCode EmpName EmpOccupation EmpDOJ EmpAge0 23 2018-01-25 Emp001 John Doe Chemist1 24 2018-01-26 Emp00 William Spark Statistician
如何根据 Pandas 列中的值从 DataFrame 中选择或过滤行
import pandas as pdemployees = pd.DataFrame({'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],'Occupation': ['Chemist', 'Statistician', 'Statistician','Statistician', 'Programmer'],'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26','2018-03-16'],'Age': [23, 24, 34, 29, 40]})print("\nUse == operator\n")print(employees.loc[employees['Age'] == 23])print("\nUse < operator\n")print(employees.loc[employees['Age'] < 30])print("\nUse != operator\n")print(employees.loc[employees['Occupation'] != 'Statistician'])print("\nMultiple Conditions\n")print(employees.loc[(employees['Occupation'] != 'Statistician') &(employees['Name'] == 'John')])
OutPut
Use == operatorAge Date Of Join EmpCode Name Occupation0 23 2018-01-25 Emp001 John ChemistUse < operatorAge Date Of Join EmpCode Name Occupation0 23 2018-01-25 Emp001 John Chemist1 24 2018-01-26 Emp002 Doe Statistician3 29 2018-02-26 Emp004 Spark StatisticianUse != operatorAge Date Of Join EmpCode Name Occupation0 23 2018-01-25 Emp001 John Chemist4 40 2018-03-16 Emp005 Mark ProgrammerMultiple ConditionsAge Date Of Join EmpCode Name Occupation0 23 2018-01-25 Emp001 John Chemist
在 DataFrame 中使用“isin”过滤多行
import pandas as pdemployees = pd.DataFrame({'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],'Occupation': ['Chemist', 'Statistician', 'Statistician','Statistician', 'Programmer'],'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26','2018-03-16'],'Age': [23, 24, 34, 29, 40]})print("\nUse isin operator\n")print(employees.loc[employees['Occupation'].isin(['Chemist','Programmer'])])print("\nMultiple Conditions\n")print(employees.loc[(employees['Occupation'] == 'Chemist') |(employees['Name'] == 'John') &(employees['Age'] < 30)])
OutPut
Use isin operatorAge Date Of Join EmpCode Name Occupation0 23 2018-01-25 Emp001 John Chemist4 40 2018-03-16 Emp005 Mark ProgrammerMultiple ConditionsAge Date Of Join EmpCode Name Occupation0 23 2018-01-25 Emp001 John Chemist
迭代 DataFrame 的行和列
import pandas as pdemployees = pd.DataFrame({'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],'Occupation': ['Chemist', 'Statistician', 'Statistician','Statistician', 'Programmer'],'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26','2018-03-16'],'Age': [23, 24, 34, 29, 40]})print("\n Example iterrows \n")for index, col in employees.iterrows():print(col['Name'], "--", col['Age'])print("\n Example itertuples \n")for row in employees.itertuples(index=True, name='Pandas'):print(getattr(row, "Name"), "--", getattr(row, "Age"))
OutPut
Example iterrowsJohn -- 23Doe -- 24William -- 34Spark -- 29Mark -- 40Example itertuplesJohn -- 23Doe -- 24William -- 34Spark -- 29Mark -- 40
如何通过名称或索引删除 DataFrame 的列
import pandas as pdemployees = pd.DataFrame({'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],'Occupation': ['Chemist', 'Statistician', 'Statistician','Statistician', 'Programmer'],'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26','2018-03-16'],'Age': [23, 24, 34, 29, 40]})print(employees)print("\n Drop Column by Name \n")employees.drop('Age', axis=1, inplace=True)print(employees)print("\n Drop Column by Index \n")employees.drop(employees.columns[[0,1]], axis=1, inplace=True)print(employees)
OutPut
Age Date Of Join EmpCode Name Occupation0 23 2018-01-25 Emp001 John Chemist1 24 2018-01-26 Emp002 Doe Statistician2 34 2018-01-26 Emp003 William Statistician3 29 2018-02-26 Emp004 Spark Statistician4 40 2018-03-16 Emp005 Mark ProgrammerDrop Column by NameDate Of Join EmpCode Name Occupation0 2018-01-25 Emp001 John Chemist1 2018-01-26 Emp002 Doe Statistician2 2018-01-26 Emp003 William Statistician3 2018-02-26 Emp004 Spark Statistician4 2018-03-16 Emp005 Mark ProgrammerDrop Column by IndexName Occupation0 John Chemist1 Doe Statistician2 William Statistician3 Spark Statistician4 Mark Programmer
向 DataFrame 中新增列
import pandas as pdemployees = pd.DataFrame({'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],'Occupation': ['Chemist', 'Statistician', 'Statistician','Statistician', 'Programmer'],'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26','2018-03-16'],'Age': [23, 24, 34, 29, 40]})employees['City'] = ['London', 'Tokyo', 'Sydney', 'London', 'Toronto']print(employees)
OutPut
Age Date Of Join EmpCode Name Occupation City0 23 2018-01-25 Emp001 John Chemist London1 24 2018-01-26 Emp002 Doe Statistician Tokyo2 34 2018-01-26 Emp003 William Statistician Sydney3 29 2018-02-26 Emp004 Spark Statistician London4 40 2018-03-16 Emp005 Mark Programmer Toronto
如何从 DataFrame 中获取列标题列表
import pandas as pdemployees = pd.DataFrame({'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],'Occupation': ['Chemist', 'Statistician', 'Statistician','Statistician', 'Programmer'],'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26','2018-03-16'],'Age': [23, 24, 34, 29, 40]})print(list(employees))print(list(employees.columns.values))print(employees.columns.tolist())
OutPut
['Age', 'Date Of Join', 'EmpCode', 'Name', 'Occupation']['Age', 'Date Of Join', 'EmpCode', 'Name', 'Occupation']['Age', 'Date Of Join', 'EmpCode', 'Name', 'Occupation']
如何随机生成 DataFrame
import pandas as pdimport numpy as npnp.random.seed(5)df_random = pd.DataFrame(np.random.randint(100, size=(10, 6)),columns=list('ABCDEF'),index=['Row-{}'.format(i) for i in range(10)])print(df_random)s
OutPut
A B C D E FRow-0 99 78 61 16 73 8Row-1 62 27 30 80 7 76Row-2 15 53 80 27 44 77Row-3 75 65 47 30 84 86Row-4 18 9 41 62 1 82Row-5 16 78 5 58 0 80Row-6 4 36 51 27 31 2Row-7 68 38 83 19 18 7Row-8 30 62 11 67 65 55Row-9 3 91 78 27 29 33
如何选择 DataFrame 的多个列
import pandas as pdemployees = pd.DataFrame({'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],'Occupation': ['Chemist', 'Statistician', 'Statistician','Statistician', 'Programmer'],'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26','2018-03-16'],'Age': [23, 24, 34, 29, 40]})df = employees[['EmpCode', 'Age', 'Name']]print(df)
OutPut
EmpCode Age Name0 Emp001 23 John1 Emp002 24 Doe2 Emp003 34 William3 Emp004 29 Spark4 Emp005 40 Mark
如何将字典转换为 DataFrame
import pandas as pddata = ({'Age': [30, 20, 22, 40, 32, 28, 39],'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black','Red'],'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese','Melon', 'Beans'],'Height': [165, 70, 120, 80, 180, 172, 150],'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']})print(data)df = pd.DataFrame(data)print(df)
OutPut
{'Height': [165, 70, 120, 80, 180, 172, 150], 'Food': ['Steak', 'Lamb', 'Mango','Apple', 'Cheese', 'Melon', 'Beans'], 'Age': [30, 20, 22, 40, 32, 28, 39], 'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2], 'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black', 'Red'], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']}Age Color Food Height Score State0 30 Blue Steak 165 4.6 NY1 20 Green Lamb 70 8.3 TX2 22 Red Mango 120 9.0 FL3 40 White Apple 80 3.3 AL4 32 Gray Cheese 180 1.8 AK5 28 Black Melon 172 9.5 TX6 39 Red Beans 150 2.2 TX
使用 ioc 进行切片
import pandas as pddf = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black','Red'],'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese','Melon', 'Beans'],'Height': [165, 70, 120, 80, 180, 172, 150],'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean','Christina', 'Cornelia'])print("\n -- Selecting a single row with .loc with a string -- \n")print(df.loc['Penelope'])print("\n -- Selecting multiple rows with .loc with a list of strings -- \n")print(df.loc[['Cornelia', 'Jane', 'Dean']])print("\n -- Selecting multiple rows with .loc with slice notation -- \n")print(df.loc['Aaron':'Dean'])
OutPut
-- Selecting a single row with .loc with a string --Age 40Color WhiteFood AppleHeight 80Score 3.3State ALName: Penelope, dtype: object-- Selecting multiple rows with .loc with a list of strings --Age Color Food Height Score StateCornelia 39 Red Beans 150 2.2 TXJane 30 Blue Steak 165 4.6 NYDean 32 Gray Cheese 180 1.8 AK-- Selecting multiple rows with .loc with slice notation --Age Color Food Height Score StateAaron 22 Red Mango 120 9.0 FLPenelope 40 White Apple 80 3.3 ALDean 32 Gray Cheese 180 1.8 AK
检查 DataFrame 中是否是空的
import pandas as pddf = pd.DataFrame()if df.empty:print('DataFrame is empty!')
OutPut
DataFrame is empty!
在创建 DataFrame 时指定索引和列名称
import pandas as pdvalues = ["India", "Canada", "Australia","Japan", "Germany", "France"]code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]df = pd.DataFrame(values, index=code, columns=['Country'])print(df)
OutPut
CountryIND IndiaCAN CanadaAUS AustraliaJAP JapanGER GermanyFRA France
使用 iloc 进行切片
import pandas as pddf = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black','Red'],'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese','Melon', 'Beans'],'Height': [165, 70, 120, 80, 180, 172, 150],'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean','Christina', 'Cornelia'])print("\n -- Selecting a single row with .iloc with an integer -- \n")print(df.iloc[4])print("\n -- Selecting multiple rows with .iloc with a list of integers -- \n")print(df.iloc[[2, -2]])print("\n -- Selecting multiple rows with .iloc with slice notation -- \n")print(df.iloc[:5:3])
OutPut
-- Selecting a single row with .iloc with an integer --Age 32Color GrayFood CheeseHeight 180Score 1.8State AKName: Dean, dtype: object-- Selecting multiple rows with .iloc with a list of integers --Age Color Food Height Score StateAaron 22 Red Mango 120 9.0 FLChristina 28 Black Melon 172 9.5 TX-- Selecting multiple rows with .iloc with slice notation --Age Color Food Height Score StateJane 30 Blue Steak 165 4.6 NYPenelope 40 White Apple 80 3.3 AL
iloc 和 loc 的区别
- loc 索引器还可以进行布尔选择,例如,如果我们想查找 Age 小于 30 的所有行并仅返回 Color 和 Height 列,我们可以执行以下操作。我们可以用 iloc 复制它,但我们不能将它传递给一个布尔系列,必须将布尔系列转换为 numpy 数组
- loc 从索引中获取具有特定标签的行(或列)
- iloc 在索引中的特定位置获取行(或列)(因此它只需要整数) ```python import pandas as pd
df = pd.DataFrame({‘Age’: [30, 20, 22, 40, 32, 28, 39], ‘Color’: [‘Blue’, ‘Green’, ‘Red’, ‘White’, ‘Gray’, ‘Black’, ‘Red’], ‘Food’: [‘Steak’, ‘Lamb’, ‘Mango’, ‘Apple’, ‘Cheese’, ‘Melon’, ‘Beans’], ‘Height’: [165, 70, 120, 80, 180, 172, 150], ‘Score’: [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2], ‘State’: [‘NY’, ‘TX’, ‘FL’, ‘AL’, ‘AK’, ‘TX’, ‘TX’] }, index=[‘Jane’, ‘Nick’, ‘Aaron’, ‘Penelope’, ‘Dean’, ‘Christina’, ‘Cornelia’])
print(“\n — loc — \n”) print(df.loc[df[‘Age’] < 30, [‘Color’, ‘Height’]])
print(“\n — iloc — \n”) print(df.iloc[(df[‘Age’] < 30).values, [1, 3]])
<a name="Awmnw"></a>### OutPut```python-- loc --Color HeightNick Green 70Aaron Red 120Christina Black 172-- iloc --Color HeightNick Green 70Aaron Red 120Christina Black 172
使用时间索引创建空 DataFrame
import datetimeimport pandas as pdtodays_date = datetime.datetime.now().date()index = pd.date_range(todays_date, periods=10, freq='D')columns = ['A', 'B', 'C']df = pd.DataFrame(index=index, columns=columns)df = df.fillna(0)print(df)
OutPut
A B C2018-09-30 0 0 02018-10-01 0 0 02018-10-02 0 0 02018-10-03 0 0 02018-10-04 0 0 02018-10-05 0 0 02018-10-06 0 0 02018-10-07 0 0 02018-10-08 0 0 02018-10-09 0 0 0
如何改变 DataFrame 列的排序
import pandas as pddf = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black','Red'],'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese','Melon', 'Beans'],'Height': [165, 70, 120, 80, 180, 172, 150],'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean','Christina', 'Cornelia'])print("\n -- Change order using columns -- \n")new_order = [3, 2, 1, 4, 5, 0]df = df[df.columns[new_order]]print(df)print("\n -- Change order using reindex -- \n")df = df.reindex(['State', 'Color', 'Age', 'Food', 'Score', 'Height'], axis=1)print(df)
OutPut
-- Change order using columns --Height Food Color Score State AgeJane 165 Steak Blue 4.6 NY 30Nick 70 Lamb Green 8.3 TX 20Aaron 120 Mango Red 9.0 FL 22Penelope 80 Apple White 3.3 AL 40Dean 180 Cheese Gray 1.8 AK 32Christina 172 Melon Black 9.5 TX 28Cornelia 150 Beans Red 2.2 TX 39-- Change order using reindex --State Color Age Food Score HeightJane NY Blue 30 Steak 4.6 165Nick TX Green 20 Lamb 8.3 70Aaron FL Red 22 Mango 9.0 120Penelope AL White 40 Apple 3.3 80Dean AK Gray 32 Cheese 1.8 180Christina TX Black 28 Melon 9.5 172Cornelia TX Red 39 Beans 2.2 150
检查 DataFrame 列的数据类型
import pandas as pddf = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black','Red'],'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese','Melon', 'Beans'],'Height': [165, 70, 120, 80, 180, 172, 150],'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean','Christina', 'Cornelia'])print(df.dtypes)
OutPut
Age int64Color objectFood objectHeight int64Score float64State objectdtype: object
更改 DataFrame 指定列的数据类型
import pandas as pddf = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black','Red'],'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese','Melon', 'Beans'],'Height': [165, 70, 120, 80, 180, 172, 150],'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean','Christina', 'Cornelia'])print(df.dtypes)df['Age'] = df['Age'].astype(str)print(df.dtypes)
OutPut
Age int64Color objectFood objectHeight int64Score float64State objectdtype: objectAge objectColor objectFood objectHeight int64Score float64State objectdtype: object
如何将列的数据类型转换为 DateTime 类型
import pandas as pddf = pd.DataFrame({'DateOFBirth': [1349720105, 1349806505, 1349892905,1349979305, 1350065705, 1349792905,1349730105],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean','Christina', 'Cornelia'])print("\n----------------Before---------------\n")print(df.dtypes)print(df)df['DateOFBirth'] = pd.to_datetime(df['DateOFBirth'], unit='s')print("\n----------------After----------------\n")print(df.dtypes)print(df)
OutPut
----------------Before---------------DateOFBirth int64State objectdtype: objectDateOFBirth StateJane 1349720105 NYNick 1349806505 TXAaron 1349892905 FLPenelope 1349979305 ALDean 1350065705 AKChristina 1349792905 TXCornelia 1349730105 TX----------------After----------------DateOFBirth datetime64[ns]State objectdtype: objectDateOFBirth StateJane 2012-10-08 18:15:05 NYNick 2012-10-09 18:15:05 TXAaron 2012-10-10 18:15:05 FLPenelope 2012-10-11 18:15:05 ALDean 2012-10-12 18:15:05 AKChristina 2012-10-09 14:28:25 TXCornelia 2012-10-08 21:01:45 TX
将 DataFrame 列从 floats 转为 ints
import pandas as pddf = pd.DataFrame({'DailyExp': [75.7, 56.69, 55.69, 96.5, 84.9, 110.5,58.9],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean','Christina', 'Cornelia'])print("\n----------------Before---------------\n")print(df.dtypes)print(df)df['DailyExp'] = df['DailyExp'].astype(int)print("\n----------------After----------------\n")print(df.dtypes)print(df)
OutPut
----------------Before---------------DailyExp float64State objectdtype: objectDailyExp StateJane 75.70 NYNick 56.69 TXAaron 55.69 FLPenelope 96.50 ALDean 84.90 AKChristina 110.50 TXCornelia 58.90 TX----------------After----------------DailyExp int32State objectdtype: objectDailyExp StateJane 75 NYNick 56 TXAaron 55 FLPenelope 96 ALDean 84 AKChristina 110 TXCornelia 58 TX
如何把 dates 列转换为 DateTime 类型
import pandas as pddf = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01','1986-06-01', '1983-06-04', '1990-03-07','1999-07-09'],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean','Christina', 'Cornelia'])print("\n----------------Before---------------\n")print(df.dtypes)df['DateOfBirth'] = df['DateOfBirth'].astype('datetime64')print("\n----------------After----------------\n")print(df.dtypes)
OutPut
----------------Before---------------DateOfBirth objectState objectdtype: object----------------After----------------DateOfBirth datetime64[ns]State objectdtype: object
两个 DataFrame 相加
import pandas as pddf1 = pd.DataFrame({'Age': [30, 20, 22, 40], 'Height': [165, 70, 120, 80],'Score': [4.6, 8.3, 9.0, 3.3], 'State': ['NY', 'TX','FL', 'AL']},index=['Jane', 'Nick', 'Aaron', 'Penelope'])df2 = pd.DataFrame({'Age': [32, 28, 39], 'Color': ['Gray', 'Black', 'Red'],'Food': ['Cheese', 'Melon', 'Beans'],'Score': [1.8, 9.5, 2.2], 'State': ['AK', 'TX', 'TX']},index=['Dean', 'Christina', 'Cornelia'])df3 = df1.append(df2, sort=True)print(df3)
OutPut
Age Color Food Height Score StateJane 30 NaN NaN 165.0 4.6 NYNick 20 NaN NaN 70.0 8.3 TXAaron 22 NaN NaN 120.0 9.0 FLPenelope 40 NaN NaN 80.0 3.3 ALDean 32 Gray Cheese NaN 1.8 AKChristina 28 Black Melon NaN 9.5 TXCornelia 39 Red Beans NaN 2.2 TX
在 DataFrame 末尾添加额外的行
import pandas as pdemployees = pd.DataFrame({'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],'Occupation': ['Chemist', 'Statistician', 'Statistician','Statistician', 'Programmer'],'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26','2018-03-16'],'Age': [23, 24, 34, 29, 40]})print("\n------------ BEFORE ----------------\n")print(employees)employees.loc[len(employees)] = [45, '2018-01-25', 'Emp006', 'Sunny','Programmer']print("\n------------ AFTER ----------------\n")print(employees)
OutPut
------------ BEFORE ----------------Age Date Of Join EmpCode Name Occupation0 23 2018-01-25 Emp001 John Chemist1 24 2018-01-26 Emp002 Doe Statistician2 34 2018-01-26 Emp003 William Statistician3 29 2018-02-26 Emp004 Spark Statistician4 40 2018-03-16 Emp005 Mark Programmer------------ AFTER ----------------Age Date Of Join EmpCode Name Occupation0 23 2018-01-25 Emp001 John Chemist1 24 2018-01-26 Emp002 Doe Statistician2 34 2018-01-26 Emp003 William Statistician3 29 2018-02-26 Emp004 Spark Statistician4 40 2018-03-16 Emp005 Mark Programmer5 45 2018-01-25 Emp006 Sunny Programmer
为指定索引添加新行
import pandas as pdemployees = pd.DataFrame(data={'Name': ['John Doe', 'William Spark'],'Occupation': ['Chemist', 'Statistician'],'Date Of Join': ['2018-01-25', '2018-01-26'],'Age': [23, 24]},index=['Emp001', 'Emp002'],columns=['Name', 'Occupation', 'Date Of Join', 'Age'])print("\n------------ BEFORE ----------------\n")print(employees)employees.loc['Emp003'] = ['Sunny', 'Programmer', '2018-01-25', 45]print("\n------------ AFTER ----------------\n")print(employees)
OutPut
------------ BEFORE ----------------Name Occupation Date Of Join AgeEmp001 John Doe Chemist 2018-01-25 23Emp002 William Spark Statistician 2018-01-26 24------------ AFTER ----------------Name Occupation Date Of Join AgeEmp001 John Doe Chemist 2018-01-25 23Emp002 William Spark Statistician 2018-01-26 24Emp003 Sunny Programmer 2018-01-25 45
如何使用 for 循环添加行
import pandas as pdcols = ['Zip']lst = []zip = 32100for a in range(10):lst.append([zip])zip = zip + 1df = pd.DataFrame(lst, columns=cols)print(df)
OutPut
Zip0 321001 321012 321023 321034 321045 321056 321067 321078 321089 32109
在 DataFrame 顶部添加一行
import pandas as pdemployees = pd.DataFrame({'EmpCode': ['Emp002', 'Emp003', 'Emp004'],'Name': ['John', 'Doe', 'William'],'Occupation': ['Chemist', 'Statistician', 'Statistician'],'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26'],'Age': [23, 24, 34]})print("\n------------ BEFORE ----------------\n")print(employees)# New lineline = pd.DataFrame({'Name': 'Dean', 'Age': 45, 'EmpCode': 'Emp001','Date Of Join': '2018-02-26', 'Occupation': 'Chemist'}, index=[0])# Concatenate two dataframeemployees = pd.concat([line,employees.ix[:]]).reset_index(drop=True)print("\n------------ AFTER ----------------\n")print(employees)
OutPut
------------ BEFORE ----------------Age Date Of Join EmpCode Name Occupation0 23 2018-01-25 Emp002 John Chemist1 24 2018-01-26 Emp003 Doe Statistician2 34 2018-01-26 Emp004 William Statistician------------ AFTER ----------------Age Date Of Join EmpCode Name Occupation0 45 2018-02-26 Emp001 Dean Chemist1 23 2018-01-25 Emp002 John Chemist2 24 2018-01-26 Emp003 Doe Statistician3 34 2018-01-26 Emp004 William Statistician
如何向 DataFrame 中动态添加行
import pandas as pddf = pd.DataFrame(columns=['Name', 'Age'])df.loc[1, 'Name'] = 'Rocky'df.loc[1, 'Age'] = 23df.loc[2, 'Name'] = 'Sunny'print(df)
OutPut
Name Age1 Rocky 232 Sunny NaN
在任意位置插入行
import pandas as pddf = pd.DataFrame(columns=['Name', 'Age'])df.loc[1, 'Name'] = 'Rocky'df.loc[1, 'Age'] = 21df.loc[2, 'Name'] = 'Sunny'df.loc[2, 'Age'] = 22df.loc[3, 'Name'] = 'Mark'df.loc[3, 'Age'] = 25df.loc[4, 'Name'] = 'Taylor'df.loc[4, 'Age'] = 28print("\n------------ BEFORE ----------------\n")print(df)line = pd.DataFrame({"Name": "Jack", "Age": 24}, index=[2.5])df = df.append(line, ignore_index=False)df = df.sort_index().reset_index(drop=True)df = df.reindex(['Name', 'Age'], axis=1)print("\n------------ AFTER ----------------\n")print(df)
OutPut
------------ BEFORE ----------------Name Age1 Rocky 212 Sunny 223 Mark 254 Taylor 28------------ AFTER ----------------Name Age0 Rocky 211 Sunny 222 Jack 243 Mark 254 Taylor 28
使用时间戳索引向 DataFrame 中添加行
import pandas as pddf = pd.DataFrame(columns=['Name', 'Age'])df.loc['2014-05-01 18:47:05', 'Name'] = 'Rocky'df.loc['2014-05-01 18:47:05', 'Age'] = 21df.loc['2014-05-02 18:47:05', 'Name'] = 'Sunny'df.loc['2014-05-02 18:47:05', 'Age'] = 22df.loc['2014-05-03 18:47:05', 'Name'] = 'Mark'df.loc['2014-05-03 18:47:05', 'Age'] = 25print("\n------------ BEFORE ----------------\n")print(df)line = pd.to_datetime("2014-05-01 18:50:05", format="%Y-%m-%d %H:%M:%S")new_row = pd.DataFrame([['Bunny', 26]], columns=['Name', 'Age'], index=[line])df = pd.concat([df, pd.DataFrame(new_row)], ignore_index=False)print("\n------------ AFTER ----------------\n")print(df)
OutPut
------------ BEFORE ----------------Name Age2014-05-01 18:47:05 Rocky 212014-05-02 18:47:05 Sunny 222014-05-03 18:47:05 Mark 25------------ AFTER ----------------Name Age2014-05-01 18:47:05 Rocky 212014-05-02 18:47:05 Sunny 222014-05-03 18:47:05 Mark 252014-05-01 18:50:05 Bunny 26
为不同的行填充缺失值
import pandas as pda = {'A': 10, 'B': 20}b = {'B': 30, 'C': 40, 'D': 50}df1 = pd.DataFrame(a, index=[0])df2 = pd.DataFrame(b, index=[1])df = pd.DataFrame()df = df.append(df1)df = df.append(df2).fillna(0)print(df)
OutPut
A B C D0 10.0 20 0.0 0.01 0.0 30 40.0 50.0
append, concat 和 combine_first 示例
import pandas as pda = {'A': 10, 'B': 20}b = {'B': 30, 'C': 40, 'D': 50}df1 = pd.DataFrame(a, index=[0])df2 = pd.DataFrame(b, index=[1])d1 = pd.DataFrame()d1 = d1.append(df1)d1 = d1.append(df2).fillna(0)print("\n------------ append ----------------\n")print(d1)d2 = pd.concat([df1, df2]).fillna(0)print("\n------------ concat ----------------\n")print(d2)d3 = pd.DataFrame()d3 = d3.combine_first(df1).combine_first(df2).fillna(0)print("\n------------ combine_first ----------------\n")print(d3)
OutPut
------------ append ----------------A B C D0 10.0 20 0.0 0.01 0.0 30 40.0 50.0------------ concat ----------------A B C D0 10.0 20 0.0 0.01 0.0 30 40.0 50.0------------ combine_first ----------------A B C D0 10.0 20.0 0.0 0.01 0.0 30.0 40.0 50.0
获取行和列的平均值
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3'])df['Mean Basket'] = df.mean(axis=1)df.loc['Mean Fruit'] = df.mean()print(df)
OutPut
Apple Orange Banana Pear Mean BasketBasket1 10.000000 20.0 30.0 40.000000 25.0Basket2 7.000000 14.0 21.0 28.000000 17.5Basket3 5.000000 5.0 0.0 0.000000 2.5Mean Fruit 7.333333 13.0 17.0 22.666667 15.0
计算行和列的总和
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3'])df['Sum Basket'] = df.sum(axis=1)df.loc['Sum Fruit'] = df.sum()print(df)
OutPut
Apple Orange Banana Pear Sum BasketBasket1 10 20 30 40 100Basket2 7 14 21 28 70Basket3 5 5 0 0 10Sum Fruit 22 39 51 68 180
连接两列
import pandas as pddf = pd.DataFrame(columns=['Name', 'Age'])df.loc[1, 'Name'] = 'Rocky'df.loc[1, 'Age'] = 21df.loc[2, 'Name'] = 'Sunny'df.loc[2, 'Age'] = 22df.loc[3, 'Name'] = 'Mark'df.loc[3, 'Age'] = 25df.loc[4, 'Name'] = 'Taylor'df.loc[4, 'Age'] = 28print('\n------------ BEFORE ----------------\n')print(df)df['Employee'] = df['Name'].map(str) + ' - ' + df['Age'].map(str)df = df.reindex(['Employee'], axis=1)print('\n------------ AFTER ----------------\n')print(df)
OutPut
------------ BEFORE ----------------Name Age1 Rocky 212 Sunny 223 Mark 254 Taylor 28------------ AFTER ----------------Employee1 Rocky - 212 Sunny - 223 Mark - 254 Taylor - 28
过滤包含某字符串的行
import pandas as pddf = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01','1986-06-01', '1983-06-04', '1990-03-07','1999-07-09'],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean','Christina', 'Cornelia'])print(df)print("\n---- Filter with State contains TX ----\n")df1 = df[df['State'].str.contains("TX")]print(df1)
OutPut
DateOfBirth StateJane 1986-11-11 NYNick 1999-05-12 TXAaron 1976-01-01 FLPenelope 1986-06-01 ALDean 1983-06-04 AKChristina 1990-03-07 TXCornelia 1999-07-09 TX---- Filter with State contains TX ----DateOfBirth StateNick 1999-05-12 TXChristina 1990-03-07 TXCornelia 1999-07-09 TX
过滤索引中包含某字符串的行
import pandas as pddf = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01','1986-06-01', '1983-06-04', '1990-03-07','1999-07-09'],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane','Christina', 'Cornelia'])print(df)print("\n---- Filter Index contains ane ----\n")df.index = df.index.astype('str')df1 = df[df.index.str.contains('ane')]print(df1)
OutPut
DateOfBirth StateJane 1986-11-11 NYPane 1999-05-12 TXAaron 1976-01-01 FLPenelope 1986-06-01 ALFrane 1983-06-04 AKChristina 1990-03-07 TXCornelia 1999-07-09 TX---- Filter Index contains ane ----DateOfBirth StateJane 1986-11-11 NYPane 1999-05-12 TXFrane 1983-06-04 AK
使用 AND 运算符过滤包含特定字符串值的行
import pandas as pddf = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01','1986-06-01', '1983-06-04', '1990-03-07','1999-07-09'],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane','Christina', 'Cornelia'])print(df)print("\n---- Filter DataFrame using & ----\n")df.index = df.index.astype('str')df1 = df[df.index.str.contains('ane') & df['State'].str.contains("TX")]print(df1)
OutPut
DateOfBirth StateJane 1986-11-11 NYPane 1999-05-12 TXAaron 1976-01-01 FLPenelope 1986-06-01 ALFrane 1983-06-04 AKChristina 1990-03-07 TXCornelia 1999-07-09 TX---- Filter DataFrame using & ----DateOfBirth StatePane 1999-05-12 TX
查找包含某字符串的所有行
import pandas as pddf = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01','1986-06-01', '1983-06-04', '1990-03-07','1999-07-09'],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane','Christina', 'Cornelia'])print(df)print("\n---- Filter DataFrame using & ----\n")df.index = df.index.astype('str')df1 = df[df.index.str.contains('ane') | df['State'].str.contains("TX")]print(df1)
OutPut
DateOfBirth StateJane 1986-11-11 NYPane 1999-05-12 TXAaron 1976-01-01 FLPenelope 1986-06-01 ALFrane 1983-06-04 AKChristina 1990-03-07 TXCornelia 1999-07-09 TX---- Filter DataFrame using & ----DateOfBirth StateJane 1986-11-11 NYPane 1999-05-12 TXFrane 1983-06-04 AKChristina 1990-03-07 TXCornelia 1999-07-09 TX
如果行中的值包含字符串,则创建与字符串相等的另一列
import pandas as pdimport numpy as npdf = pd.DataFrame({'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],'Occupation': ['Chemist', 'Accountant', 'Statistician','Statistician', 'Programmer'],'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26','2018-03-16'],'Age': [23, 24, 34, 29, 40]})df['Department'] = pd.np.where(df.Occupation.str.contains("Chemist"), "Science",pd.np.where(df.Occupation.str.contains("Statistician"), "Economics",pd.np.where(df.Occupation.str.contains("Programmer"), "Computer", "General")))print(df)
OutPut
Age Date Of Join EmpCode Name Occupation Department0 23 2018-01-25 Emp001 John Chemist Science1 24 2018-01-26 Emp002 Doe Accountant General2 34 2018-01-26 Emp003 William Statistician Economics3 29 2018-02-26 Emp004 Spark Statistician Economics4 40 2018-03-16 Emp005 Mark Programmer Computer
计算 pandas group 中每组的行数
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0],[6, 6, 6, 6], [8, 8, 8, 8], [5, 5, 0, 0]],columns=['Apple', 'Orange', 'Rice', 'Oil'],index=['Basket1', 'Basket2', 'Basket3','Basket4', 'Basket5', 'Basket6'])print(df)print("\n ----------------------------- \n")print(df[['Apple', 'Orange', 'Rice', 'Oil']].groupby(['Apple']).agg(['mean', 'count']))
OutPut
Apple Orange Rice OilBasket1 10 20 30 40Basket2 7 14 21 28Basket3 5 5 0 0Basket4 6 6 6 6Basket5 8 8 8 8Basket6 5 5 0 0-----------------------------Orange Rice Oilmean count mean count mean countApple5 5 2 0 2 0 26 6 1 6 1 6 17 14 1 21 1 28 18 8 1 8 1 8 110 20 1 30 1 40 1
检查字符串是否在 DataFrme 中
import pandas as pddf = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01','1986-06-01', '1983-06-04', '1990-03-07','1999-07-09'],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane','Christina', 'Cornelia'])if df['State'].str.contains('TX').any():print("TX is there")
OutPut
TX is there
从 DataFrame 列中获取唯一行值
import pandas as pddf = pd.DataFrame({'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean','Christina', 'Cornelia'])print(df)print("\n----------------\n")print(df["State"].unique())
OutPut
StateJane NYNick TXAaron FLPenelope ALDean AKChristina TXCornelia TX----------------['NY' 'TX' 'FL' 'AL' 'AK']
计算 DataFrame 列的不同值
import pandas as pddf = pd.DataFrame({'Age': [30, 20, 22, 40, 20, 30, 20, 25],'Height': [165, 70, 120, 80, 162, 72, 124, 81],'Score': [4.6, 8.3, 9.0, 3.3, 4, 8, 9, 3],'State': ['NY', 'TX', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Jaane', 'Nicky', 'Armour', 'Ponting'])print(df.Age.value_counts())
OutPut
20 330 225 122 140 1Name: Age, dtype: int64
删除具有重复索引的行
import pandas as pddf = pd.DataFrame({'Age': [30, 30, 22, 40, 20, 30, 20, 25],'Height': [165, 165, 120, 80, 162, 72, 124, 81],'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3],'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky','Armour', 'Ponting'])print("\n -------- Duplicate Rows ----------- \n")print(df)df1 = df.reset_index().drop_duplicates(subset='index',keep='first').set_index('index')print("\n ------- Unique Rows ------------ \n")print(df1)
OutPut
-------- Duplicate Rows -----------Age Height Score StateJane 30 165 4.6 NYJane 30 165 4.6 NYAaron 22 120 9.0 FLPenelope 40 80 3.3 ALJaane 20 162 4.0 NYNicky 30 72 8.0 TXArmour 20 124 9.0 FLPonting 25 81 3.0 AL------- Unique Rows ------------Age Height Score StateindexJane 30 165 4.6 NYAaron 22 120 9.0 FLPenelope 40 80 3.3 ALJaane 20 162 4.0 NYNicky 30 72 8.0 TXArmour 20 124 9.0 FLPonting 25 81 3.0 AL
删除某些列具有重复值的行
import pandas as pddf = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],'Height': [120, 162, 120, 120, 120, 72, 120, 81],'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3],'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky','Armour', 'Ponting'])print("\n -------- Duplicate Rows ----------- \n")print(df)df1 = df.reset_index().drop_duplicates(subset=['Age','Height'],keep='first').set_index('index')print("\n ------- Unique Rows ------------ \n")print(df1)
OutPut
-------- Duplicate Rows -----------Age Height Score StateJane 30 120 4.6 NYJane 40 162 4.6 NYAaron 30 120 9.0 FLPenelope 40 120 3.3 ALJaane 30 120 4.0 NYNicky 30 72 8.0 TXArmour 20 120 9.0 FLPonting 25 81 3.0 AL------- Unique Rows ------------Age Height Score StateindexJane 30 120 4.6 NYJane 40 162 4.6 NYPenelope 40 120 3.3 ALNicky 30 72 8.0 TXArmour 20 120 9.0 FLPonting 25 81 3.0 AL
从 DataFrame 单元格中获取值
import pandas as pddf = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],'Height': [120, 162, 120, 120, 120, 72, 120, 81],'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3],'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky','Armour', 'Ponting'])print(df.loc['Nicky', 'Age'])
OutPut
30
使用 DataFrame 中的条件索引获取单元格上的标量值
import pandas as pddf = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],'Height': [120, 162, 120, 120, 120, 72, 120, 81],'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3],'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky','Armour', 'Ponting'])print("\nGet Height where Age is 20")print(df.loc[df['Age'] == 20, 'Height'].values[0])print("\nGet State where Age is 30")print(df.loc[df['Age'] == 30, 'State'].values[0])
OutPut
Get Height where Age is 20120Get State where Age is 30NY
设置 DataFrame 的特定单元格值
import pandas as pddf = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],'Height': [120, 162, 120, 120, 120, 72, 120, 81]},index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky','Armour', 'Ponting'])print("\n--------------Before------------\n")print(df)df.iat[0, 0] = 90df.iat[0, 1] = 91df.iat[1, 1] = 92df.iat[2, 1] = 93df.iat[7, 1] = 99print("\n--------------After------------\n")print(df)
OutPut
--------------Before------------Age HeightJane 30 120Jane 40 162Aaron 30 120Penelope 40 120Jaane 30 120Nicky 30 72Armour 20 120Ponting 25 81--------------After------------Age HeightJane 90 91Jane 40 92Aaron 30 93Penelope 40 120Jaane 30 120Nicky 30 72Armour 20 120Ponting 25 99
从 DataFrame 行获取单元格值
import pandas as pddf = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],'Height': [120, 162, 120, 120, 120, 72, 120, 81]},index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky','Armour', 'Ponting'])print(df.loc[df.Age == 30,'Height'].tolist())
OutPut
[120, 120, 120, 72]
用字典替换 DataFrame 列中的值
import pandas as pddf = pd.DataFrame({'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean','Christina', 'Cornelia'])print(df)dict = {"NY": 1, "TX": 2, "FL": 3, "AL": 4, "AK": 5}df1 = df.replace({"State": dict})print("\n\n")print(df1)
OutPut
StateJane NYNick TXAaron FLPenelope ALDean AKChristina TXCornelia TXStateJane 1Nick 2Aaron 3Penelope 4Dean 5Christina 2Cornelia 2
统计基于某一列的一列的数值
import pandas as pddf = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01','1986-06-01', '1983-06-04', '1990-03-07','1999-07-09'],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean','Christina', 'Cornelia'])print(df.groupby('State').DateOfBirth.nunique())
OutPut
StateAK 1AL 1FL 1NY 1TX 3Name: DateOfBirth, dtype: int64
处理 DataFrame 中的缺失值
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5,]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3'])print("\n--------- DataFrame ---------\n")print(df)print("\n--------- Use of isnull() ---------\n")print(df.isnull())print("\n--------- Use of notnull() ---------\n")print(df.notnull())
OutPut
--------- DataFrame ---------Apple Orange Banana PearBasket1 10 20.0 30.0 40.0Basket2 7 14.0 21.0 28.0Basket3 5 NaN NaN NaN--------- Use of isnull() ---------Apple Orange Banana PearBasket1 False False False FalseBasket2 False False False FalseBasket3 False True True True--------- Use of notnull() ---------Apple Orange Banana PearBasket1 True True True TrueBasket2 True True True TrueBasket3 True False False False
删除包含任何缺失数据的行
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5,]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3'])print("\n--------- DataFrame ---------\n")print(df)print("\n--------- Use of dropna() ---------\n")print(df.dropna())
OutPut
--------- DataFrame ---------Apple Orange Banana PearBasket1 10 20.0 30.0 40.0Basket2 7 14.0 21.0 28.0Basket3 5 NaN NaN NaN--------- Use of dropna() ---------Apple Orange Banana PearBasket1 10 20.0 30.0 40.0Basket2 7 14.0 21.0 28.0
删除 DataFrame 中缺失数据的列
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5,]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3'])print("\n--------- DataFrame ---------\n")print(df)print("\n--------- Drop Columns) ---------\n")print(df.dropna(1))
OutPut
--------- DataFrame ---------Apple Orange Banana PearBasket1 10 20.0 30.0 40.0Basket2 7 14.0 21.0 28.0Basket3 5 NaN NaN NaN--------- Drop Columns) ---------AppleBasket1 10Basket2 7Basket3 5
按降序对索引值进行排序
import pandas as pddf = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01','1986-06-01', '1983-06-04', '1990-03-07','1999-07-09'],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane','Christina', 'Cornelia'])print(df.sort_index(ascending=False))
OutPut
DateOfBirth StatePenelope 1986-06-01 ALPane 1999-05-12 TXJane 1986-11-11 NYFrane 1983-06-04 AKCornelia 1999-07-09 TXChristina 1990-03-07 TXAaron 1976-01-01 FL
按降序对列进行排序
import pandas as pdemployees = pd.DataFrame({'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],'Occupation': ['Chemist', 'Statistician', 'Statistician','Statistician', 'Programmer'],'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26','2018-03-16'],'Age': [23, 24, 34, 29, 40]})print(employees.sort_index(axis=1, ascending=False))
OutPut
Occupation Name EmpCode Date Of Join Age0 Chemist John Emp001 2018-01-25 231 Statistician Doe Emp002 2018-01-26 242 Statistician William Emp003 2018-01-26 343 Statistician Spark Emp004 2018-02-26 294 Programmer Mark Emp005 2018-03-16 40
使用 rank 方法查找 DataFrame 中元素的排名
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3'])print("\n--------- DataFrame Values--------\n")print(df)print("\n--------- DataFrame Values by Rank--------\n")print(df.rank())
OutPut
--------- DataFrame Values--------Apple Orange Banana PearBasket1 10 20 30 40Basket2 7 14 21 28Basket3 5 5 0 0--------- DataFrame Values by Rank--------Apple Orange Banana PearBasket1 3.0 3.0 3.0 3.0Basket2 2.0 2.0 2.0 2.0Basket3 1.0 1.0 1.0 1.0
在多列上设置索引
import pandas as pdemployees = pd.DataFrame({'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],'Occupation': ['Chemist', 'Statistician', 'Statistician','Statistician', 'Programmer'],'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26','2018-03-16'],'Age': [23, 24, 34, 29, 40]})print("\n --------- Before Index ----------- \n")print(employees)print("\n --------- Multiple Indexing ----------- \n")print(employees.set_index(['Occupation', 'Age']))
OutPut
Date Of Join EmpCode NameOccupation AgeChemist 23 2018-01-25 Emp001 JohnStatistician 24 2018-01-26 Emp002 Doe34 2018-01-26 Emp003 William29 2018-02-26 Emp004 SparkProgrammer 40 2018-03-16 Emp005 Mark
确定 DataFrame 的周期索引和列
import pandas as pdvalues = ["India", "Canada", "Australia","Japan", "Germany", "France"]pidx = pd.period_range('2015-01-01', periods=6)df = pd.DataFrame(values, index=pidx, columns=['Country'])print(df)
OutPut
Country2015-01-01 India2015-01-02 Canada2015-01-03 Australia2015-01-04 Japan2015-01-05 Germany2015-01-06 France
导入 CSV 指定特定索引
import pandas as pddf = pd.read_csv('test.csv', index_col="DateTime")print(df)
OutPut
Wheat Rice OilDateTime10/10/2016 10.500 12.500 16.50010/11/2016 11.250 12.750 17.15010/12/2016 10.000 13.150 15.50010/13/2016 12.000 14.500 16.10010/14/2016 13.000 14.825 15.60010/15/2016 13.075 15.465 15.31510/16/2016 13.650 16.105 15.03010/17/2016 14.225 16.745 14.74510/18/2016 14.800 17.385 14.46010/19/2016 15.375 18.025 14.175
将 DataFrame 写入 csv
import pandas as pddf = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01','1986-06-01', '1983-06-04', '1990-03-07','1999-07-09'],'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane','Christina', 'Cornelia'])df.to_csv('test.csv', encoding='utf-8', index=True)
OutPut
检查本地文件
使用 Pandas 读取 csv 文件的特定列
import pandas as pddf = pd.read_csv("test.csv", usecols = ['Wheat','Oil'])print(df)
Pandas 获取 CSV 列的列表
import pandas as pdcols = list(pd.read_csv("test.csv", nrows =1))print(cols)
OutPut
['DateTime', 'Wheat', 'Rice', 'Oil']
找到列值最大的行
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3'])print(df.ix[df['Apple'].idxmax()])
OutPut
Apple 55Orange 15Banana 8Pear 12Name: Basket3, dtype: int64
使用查询方法进行复杂条件选择
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3'])print(df)print("\n ----------- Filter data using query method ------------- \n")df1 = df.ix[df.query('Apple > 50 & Orange <= 15 & Banana < 15 & Pear == 12').index]print(df1)
OutPut
Apple Orange Banana PearBasket1 10 20 30 40Basket2 7 14 21 28Basket3 55 15 8 12----------- Filter data using query method -------------Apple Orange Banana PearBasket3 55 15 8 12
检查 Pandas 中是否存在列
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3'])if 'Apple' in df.columns:print("Yes")else:print("No")if set(['Apple','Orange']).issubset(df.columns):print("Yes")else:print("No")
为特定列从 DataFrame 中查找 n-smallest 和 n-largest 值
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])print("\n----------- nsmallest -----------\n")print(df.nsmallest(2, ['Apple']))print("\n----------- nlargest -----------\n")print(df.nlargest(2, ['Apple']))
OutPut
----------- nsmallest -----------Apple Orange Banana PearBasket6 5 4 9 2Basket2 7 14 21 28----------- nlargest -----------Apple Orange Banana PearBasket3 55 15 8 12Basket4 15 14 1 8
从 DataFrame 中查找所有列的最小值和最大值
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])print("\n----------- Minimum -----------\n")print(df[['Apple', 'Orange', 'Banana', 'Pear']].min())print("\n----------- Maximum -----------\n")print(df[['Apple', 'Orange', 'Banana', 'Pear']].max())
OutPut
----------- Minimum -----------Apple 5Orange 1Banana 1Pear 2dtype: int64----------- Maximum -----------Apple 55Orange 20Banana 30Pear 40dtype: int64
在 DataFrame 中找到最小值和最大值所在的索引位置
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])print("\n----------- Minimum -----------\n")print(df[['Apple', 'Orange', 'Banana', 'Pear']].idxmin())print("\n----------- Maximum -----------\n")print(df[['Apple', 'Orange', 'Banana', 'Pear']].idxmax())
OutPut
----------- Minimum -----------Apple Basket6Orange Basket5Banana Basket4Pear Basket6dtype: object----------- Maximum -----------Apple Basket3Orange Basket1Banana Basket1Pear Basket1dtype: object
计算 DataFrame Columns 的累积乘积和累积总和
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])print("\n----------- Cumulative Product -----------\n")print(df[['Apple', 'Orange', 'Banana', 'Pear']].cumprod())print("\n----------- Cumulative Sum -----------\n")print(df[['Apple', 'Orange', 'Banana', 'Pear']].cumsum())
OutPut
----------- Cumulative Product -----------Apple Orange Banana PearBasket1 10 20 30 40Basket2 70 280 630 1120Basket3 3850 4200 5040 13440Basket4 57750 58800 5040 107520Basket5 404250 58800 5040 860160Basket6 2021250 235200 45360 1720320----------- Cumulative Sum -----------Apple Orange Banana PearBasket1 10 20 30 40Basket2 17 34 51 68Basket3 72 49 59 80Basket4 87 63 60 88Basket5 94 64 61 96Basket6 99 68 70 98
汇总统计
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])print("\n----------- Describe DataFrame -----------\n")print(df.describe())print("\n----------- Describe Column -----------\n")print(df[['Apple']].describe())
OutPut
----------- Describe DataFrame -----------Apple Orange Banana Pearcount 6.000000 6.000000 6.000000 6.000000mean 16.500000 11.333333 11.666667 16.333333std 19.180719 7.257180 11.587349 14.555640min 5.000000 1.000000 1.000000 2.00000025% 7.000000 6.500000 2.750000 8.00000050% 8.500000 14.000000 8.500000 10.00000075% 13.750000 14.750000 18.000000 24.000000max 55.000000 20.000000 30.000000 40.000000----------- Describe Column -----------Applecount 6.000000mean 16.500000std 19.180719min 5.00000025% 7.00000050% 8.50000075% 13.750000max 55.000000
查找 DataFrame 的均值、中值和众数
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])print("\n----------- Calculate Mean -----------\n")print(df.mean())print("\n----------- Calculate Median -----------\n")print(df.median())print("\n----------- Calculate Mode -----------\n")print(df.mode())
OutPut
----------- Calculate Mean -----------Apple 16.500000Orange 11.333333Banana 11.666667Pear 16.333333dtype: float64----------- Calculate Median -----------Apple 8.5Orange 14.0Banana 8.5Pear 10.0dtype: float64----------- Calculate Mode -----------Apple Orange Banana Pear0 7 14 1 8
测量 DataFrame 列的方差和标准偏差
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])print("\n----------- Calculate Mean -----------\n")print(df.mean())print("\n----------- Calculate Median -----------\n")print(df.median())print("\n----------- Calculate Mode -----------\n")print(df.mode())
OutPut
----------- Measure Variance -----------Apple 367.900000Orange 52.666667Banana 134.266667Pear 211.866667dtype: float64----------- Standard Deviation -----------Apple 19.180719Orange 7.257180Banana 11.587349Pear 14.555640dtype: float64
计算 DataFrame 列之间的协方差
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])print("\n----------- Calculating Covariance -----------\n")print(df.cov())print("\n----------- Between 2 columns -----------\n")# Covariance of Apple vs Orangeprint(df.Apple.cov(df.Orange))
OutPut
----------- Calculating Covariance -----------Apple Orange Banana PearApple 367.9 47.600000 -40.200000 -35.000000Orange 47.6 52.666667 54.333333 77.866667Banana -40.2 54.333333 134.266667 154.933333Pear -35.0 77.866667 154.933333 211.866667----------- Between 2 columns -----------47.60000000000001
计算 Pandas 中两个 DataFrame 对象之间的相关性
import pandas as pddf1 = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])print("\n------ Calculating Correlation of one DataFrame Columns -----\n")print(df1.corr())df2 = pd.DataFrame([[52, 54, 58, 41], [14, 24, 51, 78], [55, 15, 8, 12],[15, 14, 1, 8], [7, 17, 18, 98], [15, 34, 29, 52]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])print("\n----- Calculating correlation between two DataFrame -------\n")print(df2.corrwith(other=df1))
OutPut
------ Calculating Correlation of one DataFrame Columns -----Apple Orange Banana PearApple 1.000000 0.341959 -0.180874 -0.125364Orange 0.341959 1.000000 0.646122 0.737144Banana -0.180874 0.646122 1.000000 0.918606Pear -0.125364 0.737144 0.918606 1.000000----- Calculating correlation between two DataFrame -------Apple 0.678775Orange 0.354993Banana 0.920872Pear 0.076919dtype: float64
计算 DataFrame 列的每个单元格的百分比变化
import pandas as pddf = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])print("\n------ Percent change at each cell of a Column -----\n")print(df[['Apple']].pct_change()[:3])print("\n------ Percent change at each cell of a DataFrame -----\n")print(df.pct_change()[:5])
OutPut
------ Percent change at each cell of a Column -----AppleBasket1 NaNBasket2 -0.300000Basket3 6.857143------ Percent change at each cell of a DataFrame -----Apple Orange Banana PearBasket1 NaN NaN NaN NaNBasket2 -0.300000 -0.300000 -0.300000 -0.300000Basket3 6.857143 0.071429 -0.619048 -0.571429Basket4 -0.727273 -0.066667 -0.875000 -0.333333Basket5 -0.533333 -0.928571 0.000000 0.000000
在 Pandas 中向前和向后填充 DataFrame 列的缺失值
import pandas as pddf = pd.DataFrame([[10, 30, 40], [], [15, 8, 12],[15, 14, 1, 8], [7, 8], [5, 4, 1]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])print("\n------ DataFrame with NaN -----\n")print(df)print("\n------ DataFrame with Forward Filling -----\n")print(df.ffill())print("\n------ DataFrame with Forward Filling -----\n")print(df.bfill())
OutPut
------ DataFrame with NaN -----Apple Orange Banana PearBasket1 10.0 30.0 40.0 NaNBasket2 NaN NaN NaN NaNBasket3 15.0 8.0 12.0 NaNBasket4 15.0 14.0 1.0 8.0Basket5 7.0 8.0 NaN NaNBasket6 5.0 4.0 1.0 NaN------ DataFrame with Forward Filling -----Apple Orange Banana PearBasket1 10.0 30.0 40.0 NaNBasket2 10.0 30.0 40.0 NaNBasket3 15.0 8.0 12.0 NaNBasket4 15.0 14.0 1.0 8.0Basket5 7.0 8.0 1.0 8.0Basket6 5.0 4.0 1.0 8.0------ DataFrame with Forward Filling -----Apple Orange Banana PearBasket1 10.0 30.0 40.0 8.0Basket2 15.0 8.0 12.0 8.0Basket3 15.0 8.0 12.0 8.0Basket4 15.0 14.0 1.0 8.0Basket5 7.0 8.0 1.0 NaNBasket6 5.0 4.0 1.0 NaN
在 Pandas 中使用非分层索引使用 Stacking
import pandas as pddf = pd.DataFrame([[10, 30, 40], [], [15, 8, 12],[15, 14, 1, 8], [7, 8], [5, 4, 1]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])print("\n------ DataFrame-----\n")print(df)print("\n------ Stacking DataFrame -----\n")print(df.stack(level=-1))
OutPut
------ DataFrame-----Apple Orange Banana PearBasket1 10.0 30.0 40.0 NaNBasket2 NaN NaN NaN NaNBasket3 15.0 8.0 12.0 NaNBasket4 15.0 14.0 1.0 8.0Basket5 7.0 8.0 NaN NaNBasket6 5.0 4.0 1.0 NaN------ Stacking DataFrame -----Basket1 Apple 10.0Orange 30.0Banana 40.0Basket3 Apple 15.0Orange 8.0Banana 12.0Basket4 Apple 15.0Orange 14.0Banana 1.0Pear 8.0Basket5 Apple 7.0Orange 8.0Basket6 Apple 5.0Orange 4.0Banana 1.0dtype: float64
使用分层索引对 Pandas 进行拆分
import pandas as pddf = pd.DataFrame([[10, 30, 40], [], [15, 8, 12],[15, 14, 1, 8], [7, 8], [5, 4, 1]],columns=['Apple', 'Orange', 'Banana', 'Pear'],index=['Basket1', 'Basket2', 'Basket3', 'Basket4','Basket5', 'Basket6'])print("\n------ DataFrame-----\n")print(df)print("\n------ Unstacking DataFrame -----\n")print(df.unstack(level=-1))
OutPut
------ DataFrame-----Apple Orange Banana PearBasket1 10.0 30.0 40.0 NaNBasket2 NaN NaN NaN NaNBasket3 15.0 8.0 12.0 NaNBasket4 15.0 14.0 1.0 8.0Basket5 7.0 8.0 NaN NaNBasket6 5.0 4.0 1.0 NaN------ Unstacking DataFrame -----Apple Basket1 10.0Basket2 NaNBasket3 15.0Basket4 15.0Basket5 7.0Basket6 5.0Orange Basket1 30.0Basket2 NaNBasket3 8.0Basket4 14.0Basket5 8.0Basket6 4.0Banana Basket1 40.0Basket2 NaNBasket3 12.0Basket4 1.0Basket5 NaNBasket6 1.0Pear Basket1 NaNBasket2 NaNBasket3 NaNBasket4 8.0Basket5 NaNBasket6 NaNdtype: float64
Pandas 获取 HTML 页面上 table 数据
import pandas as pddf pd.read_html("url")
