一、创建Excel文件

  1. import pandas as pd
  2. df = pd.DataFrame({'ID':[1,2,3], 'Name':['Tim','Victor','Nick']})
  3. df = df.set_index('ID')
  4. print(df)
  5. df.to_excel('D:\Python\output.xlsx')
  6. print('Done!')
  • 使用pd.DataFrame()创建一个DataFrame
  • 使用df.set_index()方法设置“ID”列为索引列
  • df.to_excel()方法存储为Excel文件

    二、读取文件

People.xlsx

(一)查看数据信息

  1. import pandas as pd
  2. people = pd.read_excel('D:\Python\pandas\pandas_excel\People.xlsx')
  3. print(people.shape)
  4. print(people.head())
  5. print(people.columns)
  6. print('*'*100)
  7. print(people.tail())
  • shape()方法查看DataFrame形状,即几行几列
  • head()方法查看前几行
  • columns()方法查看列标题
  • tail()方法查看最后几列数据

    (二)特殊标题行的处理

    1、标题行前行中数据为空或有脏数据

    ```python import pandas as pd

people = pd.read_excel(‘D:\Python\pandas\pandas_excel\People.xlsx’,header= 1)

  1. - 如果为空,pandas 会自动跳过空行或使用header= 1,指定标题行所在行
  2. - 如有脏数据,使用header= 1,指定标题行所在行
  3. <a name="Ga60c"></a>
  4. ### 2、原始数据无标题(无columns)
  5. ```python
  6. import pandas as pd
  7. people = pd.read_excel('D:\Python\pandas\pandas_excel\People.xlsx',header=None)
  8. people.columns= ['ID','Type','Title','Fullname','Middlename', 'Lastname']
  9. people.set_index('ID',inplace=True)
  10. people.to_excel('D:\Python\pandas\pandas_excel\output.xlsx')
  11. print(people.columns)
  12. print('Done!')
  • header = None,pandas 自动生成columns
  • columns()方法为数据添加列标题
  • set_index()方法设置“ID”列为索引列,参数inplace=True,表示原地修改原数据,如省略则生成一个新的DataFrame。

    (三)读取文件时,直接设置索引列

  1. import pandas as pd
  2. people = pd.read_excel('D:\Python\pandas\pandas_excel\output.xlsx',index_col='ID')
  3. print(people.head())
  • index_col参数可指定特定列为索引列

    三、行、列、单元格

    (一)创建series

    1、使用dict创建series

  1. import pandas as pd
  2. d = {'x':1, 'y':2, 'z':3}
  3. df = pd.Series(d)
  4. print(df)
  5. print(df.name)
  6. print(df.index)

2、使用list创建series

  1. import pandas as pd
  2. l1 = ['x', 'y', 'z']
  3. l2 = [1, 2, 3]
  4. df = pd.Series(l2, index = l1)
  5. print(df)

3、使用series创建DataFrame

  1. 以列的形式创建 ```python import pandas as pd

s1 = pd.Series([1, 2, 3],index = [1, 2, 3], name = ‘A’) s2 = pd.Series([10, 20, 30], index = [1, 2, 3], name =’B’) s3 = pd.Series([100, 200, 300], index = [1, 2, 3], name = ‘C’)

df = pd.DataFrame({s1.name:s1, s2.name:s2, s3.name:s3})

print(df) print(df.dtypes)

  1. 2. 以行的形式创建
  2. ```python
  3. import pandas as pd
  4. s1 = pd.Series([1, 2, 3],index = [1, 2, 3], name = 'A')
  5. s2 = pd.Series([10, 20, 30], index = [1, 2, 3], name ='B')
  6. s3 = pd.Series([100, 200, 300], index = [1, 2, 3], name = 'C')
  7. df = pd.DataFrame({s1, s2, s3})
  8. print(df)
  9. print(df.dtypes)

四、数据区域读取及数字填充

(一)跳过空行空列,填充空值

Books.xlsx

  1. # 数据区域读取,数字填充
  2. import pandas as pd
  3. from datetime import date,timedelta
  4. books = pd.read_excel('D:\Python\pandas\pandas_excel\Books.xlsx',skiprows=3,usecols='C:F',index_col=None,dtype={'ID':str,'InStore':str})
  5. start = date(2018,1,1)
  6. for i in books.index:
  7. books['ID'].at[i] =i +1
  8. books.InStore.at[i] = 'Yes' if i%2 ==0 else 'No'
  9. books['Date'].at[i] = start
  10. print(books)
  • 跳过空行:参数skiprow = 3,跳过三行
  • 跳过空列:参数usecols = ‘C:F’,使用C:F列的数据
  • 原始数据中,ID、InStore、Date三列数据为空,默认读取为Nan。Nan数据类型为float,如需填充为数字或字符串,需先转换数据类型,参dtype={‘ID’:str,’InStore’:str}),指定数据类型。
  • 日期类型需导入datetime 模块
  • 使用for 循环填充空值

    (二)日期序列填充

  1. import pandas as pd
  2. from datetime import date,timedelta
  3. books = pd.read_excel('D:\Python\pandas\pandas_excel\Books.xlsx',skiprows=3,usecols='C:F',index_col=None,dtype={'ID':str,'InStore':str})
  4. start = date(2018,1,1)
  5. for i in books.index:
  6. books['ID'].at[i] =i +1
  7. books.InStore.at[i] = 'Yes' if i%2 ==0 else 'No'
  8. books['Date'].at[i] = start + timedelta(days=i)
  9. print(books)
  • timedelta(days=i)按天填充
  1. #%%
  2. # 数据区域读取,数字填充
  3. import pandas as pd
  4. from datetime import date,timedelta
  5. books = pd.read_excel('D:\Python\pandas\pandas_excel\Books.xlsx',skiprows=3,usecols='C:F',index_col=None,dtype={'ID':str,'InStore':str})
  6. start = date(2018,1,1)
  7. for i in books.index:
  8. books['ID'].at[i] =i +1
  9. books.InStore.at[i] = 'Yes' if i%2 ==0 else 'No'
  10. books['Date'].at[i] = date(start.year +i,start.month, start.day)
  11. print(books)
  • date(start.year +i,start.month, start.day)按年填充
  1. from datetime import date,timedelta
  2. def add_month(d, md):
  3. yd = md // 12
  4. m = d.month + md % 12
  5. if m != 12:
  6. yd += m // 12
  7. m = m % 12
  8. return date(d.year + yd, m, d.day)
  9. books = pd.read_excel('D:\Python\pandas\pandas_excel\Books.xlsx',skiprows=3,usecols='C:F',index_col=None,dtype={'ID':str,'InStore':str})
  10. start = date(2018,1,1)
  11. for i in books.index:
  12. books['ID'].at[i] =i +1
  13. books.InStore.at[i] = 'Yes' if i%2 ==0 else 'No'
  14. books['Date'].at[i] = add_month(start,i)
  15. print(books)
  • 按月填充

    五、函数填充

    (一)整列填充

    ```python import pandas as pd

books = pd.read_excel(‘D:\CodeProjects\pandas_vs_Excel\Books.xlsx’,index_col=’ID’) books[‘Price’] = books[‘ListPrice’] * books[‘Discount’] print(books)

  1. - books['Price'] = books['ListPrice'] * books['Discount']整列填充
  2. <a name="jqXXe"></a>
  3. ## (二)循环
  4. ```python
  5. import pandas as pd
  6. books = pd.read_excel('D:\CodeProjects\pandas_vs_Excel\Books.xlsx',index_col='ID')
  7. for i in books.index:
  8. books['Price'].at[i] = books['ListPrice'].at[i] * books['Discount'].at[i]
  9. print(books)

(三)apply方法,定义函数

  1. import pandas as pd
  2. books = pd.read_excel('D:\CodeProjects\pandas_vs_Excel\Books.xlsx',index_col='ID')
  3. books['ListPrice'] = books['ListPrice'].apply(lambda x: x+2)
  4. print(books)

六、排序

List.xlsx

(一)单条件排序

  1. import pandas as pd
  2. products = pd.read_excel(r'D:\CodeProjects\pandas_vs_Excel\007\List.xlsx',index_col='ID')
  3. products.sort_values(by='Price',inplace= True,ascending= False)
  4. print(products)
  • 按值排序使用方法sort_values
  • 参数by,指定排序列
  • 参数inplace=True,原地替换
  • 参数ascending= False,按降序排列,默认为True。

    (二)多条件排序

  1. import pandas as pd
  2. products = pd.read_excel(r'D:\CodeProjects\pandas_vs_Excel\007\List.xlsx',index_col='ID')
  3. products.sort_values(by=['Worthy','Price'],inplace= True,ascending= [True,False])
  4. print(products)
  • 多个条件排序使用list

    七、筛选

Students.xlsx

  1. # %%
  2. import pandas as pd
  3. def age_18_to_30(a):
  4. return 18<= a < 30
  5. def level_a(s):
  6. return 85<= s <= 100
  7. students = pd.read_excel(r'D:\CodeProjects\pandas_vs_Excel\008\Students.xlsx',index_col='ID')
  8. students = students.loc[students['Score'].apply(level_a)].loc[students['Age'].apply(age_18_to_30)]
  9. print(students)
  • loc方法筛选
  • 自定义筛选函数
  • 通过apply方法应用函数
  • 自定义函数还可换为lambda表达式 : lambda a: 18 <= a < 30

    八、多表合并(vlookup)

Student_Score.xlsx

(一)merge方法

  1. import pandas as pd
  2. students = pd.read_excel(r'D:\CodeProjects\pandas_vs_Excel\016\Student_Score.xlsx')
  3. scores = pd.read_excel(r'D:\CodeProjects\pandas_vs_Excel\016\Student_Score.xlsx',sheetname = 'Scores')
  4. table = students.merge(scores,how='left',on='ID').fillna(0)
  5. table.Score = table.Score.astype(int)
  6. print(table)
  • 参数sheetname指定读取工作表
  • merge方法多表合并,参数how指定合并方式,参数on指定用于连接的列名,参数on可省略,pandas会自动匹配
  • fillna方法填充Nan
  • astype方法转换数据类型

    (二)join方法

  1. import pandas as pd
  2. students = pd.read_excel(r'D:\CodeProjects\pandas_vs_Excel\016\Student_Score.xlsx',index_col='ID')
  3. scores = pd.read_excel(r'D:\CodeProjects\pandas_vs_Excel\016\Student_Score.xlsx',sheetname = 'Scores',index_col='ID')
  4. table = students.join(scores,how='left',on='ID').fillna(0)
  5. table.Score = table.Score.astype(int)
  6. print(table)
  • join方法使用index 作为连接列