一、创建Excel文件
import pandas as pd
df = pd.DataFrame({'ID':[1,2,3], 'Name':['Tim','Victor','Nick']})
df = df.set_index('ID')
print(df)
df.to_excel('D:\Python\output.xlsx')
print('Done!')
(一)查看数据信息
import pandas as pd
people = pd.read_excel('D:\Python\pandas\pandas_excel\People.xlsx')
print(people.shape)
print(people.head())
print(people.columns)
print('*'*100)
print(people.tail())
- shape()方法查看DataFrame形状,即几行几列
- head()方法查看前几行
- columns()方法查看列标题
- tail()方法查看最后几列数据
(二)特殊标题行的处理
1、标题行前行中数据为空或有脏数据
```python import pandas as pd
people = pd.read_excel(‘D:\Python\pandas\pandas_excel\People.xlsx’,header= 1)
- 如果为空,pandas 会自动跳过空行或使用header= 1,指定标题行所在行
- 如有脏数据,使用header= 1,指定标题行所在行
<a name="Ga60c"></a>
### 2、原始数据无标题(无columns)
```python
import pandas as pd
people = pd.read_excel('D:\Python\pandas\pandas_excel\People.xlsx',header=None)
people.columns= ['ID','Type','Title','Fullname','Middlename', 'Lastname']
people.set_index('ID',inplace=True)
people.to_excel('D:\Python\pandas\pandas_excel\output.xlsx')
print(people.columns)
print('Done!')
- header = None,pandas 自动生成columns
- columns()方法为数据添加列标题
- set_index()方法设置“ID”列为索引列,参数inplace=True,表示原地修改原数据,如省略则生成一个新的DataFrame。
(三)读取文件时,直接设置索引列
import pandas as pd
people = pd.read_excel('D:\Python\pandas\pandas_excel\output.xlsx',index_col='ID')
print(people.head())
import pandas as pd
d = {'x':1, 'y':2, 'z':3}
df = pd.Series(d)
print(df)
print(df.name)
print(df.index)
2、使用list创建series
import pandas as pd
l1 = ['x', 'y', 'z']
l2 = [1, 2, 3]
df = pd.Series(l2, index = l1)
print(df)
3、使用series创建DataFrame
- 以列的形式创建 ```python import pandas as pd
s1 = pd.Series([1, 2, 3],index = [1, 2, 3], name = ‘A’) s2 = pd.Series([10, 20, 30], index = [1, 2, 3], name =’B’) s3 = pd.Series([100, 200, 300], index = [1, 2, 3], name = ‘C’)
df = pd.DataFrame({s1.name:s1, s2.name:s2, s3.name:s3})
print(df) print(df.dtypes)
2. 以行的形式创建
```python
import pandas as pd
s1 = pd.Series([1, 2, 3],index = [1, 2, 3], name = 'A')
s2 = pd.Series([10, 20, 30], index = [1, 2, 3], name ='B')
s3 = pd.Series([100, 200, 300], index = [1, 2, 3], name = 'C')
df = pd.DataFrame({s1, s2, s3})
print(df)
print(df.dtypes)
四、数据区域读取及数字填充
(一)跳过空行空列,填充空值
# 数据区域读取,数字填充
import pandas as pd
from datetime import date,timedelta
books = pd.read_excel('D:\Python\pandas\pandas_excel\Books.xlsx',skiprows=3,usecols='C:F',index_col=None,dtype={'ID':str,'InStore':str})
start = date(2018,1,1)
for i in books.index:
books['ID'].at[i] =i +1
books.InStore.at[i] = 'Yes' if i%2 ==0 else 'No'
books['Date'].at[i] = start
print(books)
- 跳过空行:参数skiprow = 3,跳过三行
- 跳过空列:参数usecols = ‘C:F’,使用C:F列的数据
- 原始数据中,ID、InStore、Date三列数据为空,默认读取为Nan。Nan数据类型为float,如需填充为数字或字符串,需先转换数据类型,参dtype={‘ID’:str,’InStore’:str}),指定数据类型。
- 日期类型需导入datetime 模块
- 使用for 循环填充空值
(二)日期序列填充
import pandas as pd
from datetime import date,timedelta
books = pd.read_excel('D:\Python\pandas\pandas_excel\Books.xlsx',skiprows=3,usecols='C:F',index_col=None,dtype={'ID':str,'InStore':str})
start = date(2018,1,1)
for i in books.index:
books['ID'].at[i] =i +1
books.InStore.at[i] = 'Yes' if i%2 ==0 else 'No'
books['Date'].at[i] = start + timedelta(days=i)
print(books)
- timedelta(days=i)按天填充
#%%
# 数据区域读取,数字填充
import pandas as pd
from datetime import date,timedelta
books = pd.read_excel('D:\Python\pandas\pandas_excel\Books.xlsx',skiprows=3,usecols='C:F',index_col=None,dtype={'ID':str,'InStore':str})
start = date(2018,1,1)
for i in books.index:
books['ID'].at[i] =i +1
books.InStore.at[i] = 'Yes' if i%2 ==0 else 'No'
books['Date'].at[i] = date(start.year +i,start.month, start.day)
print(books)
- date(start.year +i,start.month, start.day)按年填充
from datetime import date,timedelta
def add_month(d, md):
yd = md // 12
m = d.month + md % 12
if m != 12:
yd += m // 12
m = m % 12
return date(d.year + yd, m, d.day)
books = pd.read_excel('D:\Python\pandas\pandas_excel\Books.xlsx',skiprows=3,usecols='C:F',index_col=None,dtype={'ID':str,'InStore':str})
start = date(2018,1,1)
for i in books.index:
books['ID'].at[i] =i +1
books.InStore.at[i] = 'Yes' if i%2 ==0 else 'No'
books['Date'].at[i] = add_month(start,i)
print(books)
books = pd.read_excel(‘D:\CodeProjects\pandas_vs_Excel\Books.xlsx’,index_col=’ID’) books[‘Price’] = books[‘ListPrice’] * books[‘Discount’] print(books)
- books['Price'] = books['ListPrice'] * books['Discount']整列填充
<a name="jqXXe"></a>
## (二)循环
```python
import pandas as pd
books = pd.read_excel('D:\CodeProjects\pandas_vs_Excel\Books.xlsx',index_col='ID')
for i in books.index:
books['Price'].at[i] = books['ListPrice'].at[i] * books['Discount'].at[i]
print(books)
(三)apply方法,定义函数
import pandas as pd
books = pd.read_excel('D:\CodeProjects\pandas_vs_Excel\Books.xlsx',index_col='ID')
books['ListPrice'] = books['ListPrice'].apply(lambda x: x+2)
print(books)
六、排序
(一)单条件排序
import pandas as pd
products = pd.read_excel(r'D:\CodeProjects\pandas_vs_Excel\007\List.xlsx',index_col='ID')
products.sort_values(by='Price',inplace= True,ascending= False)
print(products)
import pandas as pd
products = pd.read_excel(r'D:\CodeProjects\pandas_vs_Excel\007\List.xlsx',index_col='ID')
products.sort_values(by=['Worthy','Price'],inplace= True,ascending= [True,False])
print(products)
# %%
import pandas as pd
def age_18_to_30(a):
return 18<= a < 30
def level_a(s):
return 85<= s <= 100
students = pd.read_excel(r'D:\CodeProjects\pandas_vs_Excel\008\Students.xlsx',index_col='ID')
students = students.loc[students['Score'].apply(level_a)].loc[students['Age'].apply(age_18_to_30)]
print(students)
(一)merge方法
import pandas as pd
students = pd.read_excel(r'D:\CodeProjects\pandas_vs_Excel\016\Student_Score.xlsx')
scores = pd.read_excel(r'D:\CodeProjects\pandas_vs_Excel\016\Student_Score.xlsx',sheetname = 'Scores')
table = students.merge(scores,how='left',on='ID').fillna(0)
table.Score = table.Score.astype(int)
print(table)
- 参数sheetname指定读取工作表
- merge方法多表合并,参数how指定合并方式,参数on指定用于连接的列名,参数on可省略,pandas会自动匹配
- fillna方法填充Nan
- astype方法转换数据类型
(二)join方法
import pandas as pd
students = pd.read_excel(r'D:\CodeProjects\pandas_vs_Excel\016\Student_Score.xlsx',index_col='ID')
scores = pd.read_excel(r'D:\CodeProjects\pandas_vs_Excel\016\Student_Score.xlsx',sheetname = 'Scores',index_col='ID')
table = students.join(scores,how='left',on='ID').fillna(0)
table.Score = table.Score.astype(int)
print(table)
- join方法使用index 作为连接列