Pandas 是 Python 的核心数据分析支持库,提供了快速、灵活、明确的数据结构,旨在简单、直观地处理关系型、标记型数据。Pandas 的目标是成为 Python 数据分析实践与实战的必备高级工具,其长远目标是成为最强大、最灵活、可以支持任何语言的开源数据分析工具。经过多年不懈的努力,Pandas 离这个目标已经越来越近了。
Pandas 适用于处理以下类型的数据:
- 与 SQL 或 Excel 表类似的,含异构列的表格数据;
- 有序和无序(非固定频率)的时间序列数据;
- 带行列标签的矩阵数据,包括同构或异构型数据;
- 任意其它形式的观测、统计数据集, 数据转入 Pandas 数据结构时不必事先标记。
数据结构
| 维数 | 名称 | 描述 |
|---|---|---|
| 1 | Series | 带标签的一维同构数组 |
| 2 | DataFrame | 带标签的,大小可变的,二维异构表格 |
Series数据结构
# Series 是带有标签的一维数组,可以保存任何数据类型(整数,字符串,浮点数,Python对象等),轴标签统称为索引import numpy as npimport pandas as pd# 导入numpy、pandas模块ar = np.random.rand(5)print(ar)s = pd.Series(ar)print(s)print(type(s))# 查看数据、数据类型print(s.index,type(s.index))print(s.values,type(s.values))# .index查看series索引,类型为rangeindex# .values查看series值,类型是ndarray# 核心:series相比于ndarray,是一个自带索引index的数组 → 一维数组 + 对应索引# 所以当只看series的值的时候,就是一个ndarray# series和ndarray较相似,索引切片功能差别不大# series和dict相比,series更像一个有顺序的字典(dict本身不存在顺序),其索引原理与字典相似(一个用key,一个用index)------------------------------------[0.57879139 0.74160843 0.96094125 0.51813092 0.47592715]0 0.5787911 0.7416082 0.9609413 0.5181314 0.475927dtype: float64<class 'pandas.core.series.Series'>RangeIndex(start=0, stop=5, step=1) <class 'pandas.core.indexes.range.RangeIndex'>[0.57879139 0.74160843 0.96094125 0.51813092 0.47592715] <class 'numpy.ndarray'>
创建数据
pd.Series()
参数
- data : 类似数组的、可迭代的、dict或标量值包含按序列存储的数据。
- index: 类数组或索引(1d)
值必须是hashable和有相同的长度的“数据”。允许使用非唯一索引值。将默认为RangeIndex(0,1,2,…, n)如果未提供。如果字典和索引都是,则索引将覆盖在字典
- dtype:
- name:name为Series的一个参数,创建一个数组的 名称
- copy:
# Series 名称属性:names1 = pd.Series(np.random.randn(5))print(s1)print('-----')s2 = pd.Series(np.random.randn(5),name = 'test')print(s2)print(s1.name, s2.name,type(s2.name))# name为Series的一个参数,创建一个数组的 名称# .name方法:输出数组的名称,输出格式为str,如果没用定义输出名称,输出为Nones3 = s2.rename('hehehe')print(s3)print(s3.name, s2.name)# .rename()重命名一个数组的名称,并且新指向一个数组,原数组不变--------------------------------------0 -0.4030841 1.3693832 1.1343193 -0.6350504 1.680211dtype: float64-----0 -0.1200141 1.9676482 1.1426263 0.2340794 0.761357Name: test, dtype: float64None test <class 'str'>0 -0.1200141 1.9676482 1.1426263 0.2340794 0.761357Name: hehehe, dtype: float64hehehe test
Series 创建方法一:由字典创建,字典的key就是index,values就是values
dic = {'a':1 ,'b':2 , 'c':3, '4':4, '5':5}s = pd.Series(dic)print(s)# 注意:key肯定是字符串,假如values类型不止一个会怎么样? → dic = {'a':1 ,'b':'hello' , 'c':3, '4':4, '5':5}-------------------------------------4 45 5a 1b 2c 3dtype: int64
Series 创建方法二:由数组创建(一维数组)
arr = np.random.randn(5)s = pd.Series(arr)print(arr)print(s)# 默认index是从0开始,步长为1的数字s = pd.Series(arr, index = ['a','b','c','d','e'],dtype = np.object)print(s)# index参数:设置index,长度保持一致# dtype参数:设置数值类型-------------------------------------[ 0.11206121 0.1324684 0.59930544 0.34707543 -0.15652941]0 0.1120611 0.1324682 0.5993053 0.3470754 -0.156529dtype: float64a 0.112061b 0.132468c 0.599305d 0.347075e -0.156529dtype: object
Series 创建方法三:由标量创建
s = pd.Series(10, index = range(4))print(s)# 如果data是标量值,则必须提供索引。该值会重复,来匹配索引的长度---------------------------0 101 102 103 10dtype: int64
下标索引
位置下标,类似序列
s = pd.Series(np.random.rand(5))print(s)print(s[0],type(s[0]),s[0].dtype)print(float(s[0]),type(float(s[0])))#print(s[-1])# 位置下标从0开始# 输出结果为numpy.float格式,# 可以通过float()函数转换为python float格式# numpy.float与float占用字节不同# s[-1]结果如何?-----------------------------0 0.9245751 0.9886542 0.4263333 0.2165044 0.453570dtype: float640.924575004833 <class 'numpy.float64'> float640.9245750048328816 <class 'float'>
标签索引
#s = pd.Series(np.random.rand(5), index = ['a','b','c','d','e'])print(s)print(s['a'],type(s['a']),s['a'].dtype)# 方法类似下标索引,用[]表示,内写上index,注意index是字符串sci = s[['a','b','e']]print(sci,type(sci))# 如果需要选择多个标签的值,用[[]]来表示(相当于[]中包含一个列表)# 多标签索引结果是新的数组--------------------------a 0.714630b 0.213957c 0.172188d 0.972158e 0.875175dtype: float640.714630383451 <class 'numpy.float64'> float64a 0.714630b 0.213957e 0.875175dtype: float64 <class 'pandas.core.series.Series'>
切片索引
s1 = pd.Series(np.random.rand(5))s2 = pd.Series(np.random.rand(5), index = ['a','b','c','d','e'])print(s1[1:4],s1[4])print(s2['a':'c'],s2['c'])print(s2[0:3],s2[3])print('-----')# 注意:用index做切片是末端包含print(s2[:-1])print(s2[::2])# 下标索引做切片,和list写法一样-----------------------------------1 0.8659672 0.1145003 0.369301dtype: float64 0.411702342342a 0.717378b 0.642561c 0.391091dtype: float64 0.39109096261a 0.717378b 0.642561c 0.391091dtype: float64 0.998978363818-----a 0.717378b 0.642561c 0.391091d 0.998978dtype: float64a 0.717378c 0.391091e 0.957639dtype: float64
布尔索引
s = pd.Series(np.random.rand(3)*100)s[4] = None # 添加一个空值print(s)bs1 = s > 50bs2 = s.isnull()bs3 = s.notnull()print(bs1, type(bs1), bs1.dtype)print(bs2, type(bs2), bs2.dtype)print(bs3, type(bs3), bs3.dtype)print('-----')# 数组做判断之后,返回的是一个由布尔值组成的新的数组# .isnull() / .notnull() 判断是否为空值 (None代表空值,NaN代表有问题的数值,两个都会识别为空值)print(s[s > 50])print(s[bs3])# 布尔型索引方法:用[判断条件]表示,其中判断条件可以是 一个语句,或者是 一个布尔型数组!--------------------------------------0 2.038021 40.39892 25.20014 Nonedtype: object0 False1 False2 False4 Falsedtype: bool <class 'pandas.core.series.Series'> bool0 False1 False2 False4 Truedtype: bool <class 'pandas.core.series.Series'> bool0 True1 True2 True4 Falsedtype: bool <class 'pandas.core.series.Series'> bool-----Series([], dtype: object)0 2.038021 40.39892 25.2001dtype: object
数据查看
重新索引reindex
numpy.reindex将会根据索引重新排序,如果当前索引不存在,则引入缺失值
s = pd.Series(np.random.rand(3), index = ['a','b','c'])print(s)s1 = s.reindex(['c','b','a','d'])print(s1)# .reindex()中也是写列表# 这里'd'索引不存在,所以值为NaNs2 = s.reindex(['c','b','a','d'], fill_value = 0)print(s2)# fill_value参数:填充缺失值的值-----------------------------------a 0.343718b 0.322228c 0.746720dtype: float64c 0.746720b 0.322228a 0.343718d NaNdtype: float64c 0.746720b 0.322228a 0.343718d 0.000000dtype: float64
Series对齐
s1 = pd.Series(np.random.rand(3), index = ['Jack','Marry','Tom'])s2 = pd.Series(np.random.rand(3), index = ['Wang','Jack','Marry'])print(s1)print(s2)print(s1+s2)# Series 和 ndarray 之间的主要区别是,Series 上的操作会根据标签自动对齐# index顺序不会影响数值计算,以标签来计算# 空值和任何值计算结果扔为空值--------------------------------------Jack 0.753732Marry 0.180223Tom 0.283704dtype: float64Wang 0.309128Jack 0.533997Marry 0.626126dtype: float64Jack 1.287729Marry 0.806349Tom NaNWang NaNdtype: float64
删除Drop
s = pd.Series(np.random.rand(5), index = list('ngjur'))print(s)s1 = s.drop('n')s2 = s.drop(['g','j'])print(s1)print(s2)print(s)# drop 删除元素之后返回副本(inplace=False)----------------------------------------n 0.876587g 0.594053j 0.628232u 0.360634r 0.454483dtype: float64g 0.594053j 0.628232u 0.360634r 0.454483dtype: float64n 0.876587u 0.360634r 0.454483dtype: float64n 0.876587g 0.594053j 0.628232u 0.360634r 0.454483dtype: float64
添加
s1 = pd.Series(np.random.rand(5))s2 = pd.Series(np.random.rand(5), index = list('ngjur'))print(s1)print(s2)s1[5] = 100s2['a'] = 100print(s1)print(s2)print('-----')# 直接通过下标索引/标签index添加值s3 = s1.append(s2)print(s3)print(s1)# 通过.append方法,直接添加一个数组# .append方法生成一个新的数组,不改变之前的数组----------------------------------0 0.5164471 0.6993822 0.4695133 0.5898214 0.402188dtype: float64n 0.615641g 0.451192j 0.022328u 0.977568r 0.902041dtype: float640 0.5164471 0.6993822 0.4695133 0.5898214 0.4021885 100.000000dtype: float64n 0.615641g 0.451192j 0.022328u 0.977568r 0.902041a 100.000000dtype: float64-----0 0.5164471 0.6993822 0.4695133 0.5898214 0.4021885 100.000000n 0.615641g 0.451192j 0.022328u 0.977568r 0.902041a 100.000000dtype: float640 0.5164471 0.6993822 0.4695133 0.5898214 0.4021885 100.000000dtype: float64
修改
s = pd.Series(np.random.rand(3), index = ['a','b','c'])print(s)s['a'] = 100s[['b','c']] = 200print(s)# 通过索引直接修改,类似序列----------------------------a 0.873604b 0.244707c 0.888685dtype: float64a 100.0b 200.0c 200.0dtype: float64
DataFrame
Dataframe 数据结构
Dataframe是一个表格型的数据结构,“带有标签的二维数组”。
Dataframe带有index(行标签)和columns(列标签)
data = {'name':['Jack','Tom','Mary'],'age':[18,19,20],'gender':['m','m','w']}frame = pd.DataFrame(data)print(frame)print(type(frame))print(frame.index,'\n该数据类型为:',type(frame.index))print(frame.columns,'\n该数据类型为:',type(frame.columns))print(frame.values,'\n该数据类型为:',type(frame.values))# 查看数据,数据类型为dataframe# .index查看行标签# .columns查看列标签# .values查看值,数据类型为ndarray-----------------------------------------------age gender name0 18 m Jack1 19 m Tom2 20 w Mary<class 'pandas.core.frame.DataFrame'>RangeIndex(start=0, stop=3, step=1)该数据类型为: <class 'pandas.indexes.range.RangeIndex'>Index(['age', 'gender', 'name'], dtype='object')该数据类型为: <class 'pandas.indexes.base.Index'>[[18 'm' 'Jack'][19 'm' 'Tom'][20 'w' 'Mary']]该数据类型为: <class 'numpy.ndarray'>
基本查询

排序
语法
df.sort_values(by="Count_AnimalName",ascending=True)
by为以Count_AnimalName名字的列排序 ascending:排列顺序,默认为True升序,反之为降序
索引
选择行与列
取行或列的注意
- 方括号写数组,表示取行,对行进行操作
- 方括号写字符串,表示的取列索引,对列进行操作
还有更多的经过pandas优化过的选择方式:
- df.loc 通过标签索引行数据
- df.iloc 通过位置获取行数据赋值更改数据的过程:
```python
df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
print(df) print(‘——-‘)index = ['one','two','three'],columns = ['a','b','c','d'])
data1 = df[‘a’] data2 = df[[‘a’,’c’]] print(data1,type(data1)) print(data2,type(data2)) print(‘——-‘)
按照列名选择列,只选择一列输出Series,选择多列输出Dataframe
data3 = df.loc[‘one’] data4 = df.loc[[‘one’,’two’]] print(data2,type(data3)) print(data3,type(data4))
按照index选择行,只选择一行输出Series,选择多行输出Dataframe
a b c d
one 72.615321 49.816987 57.485645 84.226944 two 46.295674 34.480439 92.267989 17.111412
three 14.699591 92.754997 39.683577 93.255880
one 72.615321
two 46.295674
three 14.699591
Name: a, dtype: float64
three 14.699591 39.683577
a c
one 72.615321 57.485645
two 46.295674 92.267989
three 14.699591 39.683577
<a name="PJjUO"></a>#### df[] - 选择列一般用于选择列,也可以选择行```pythondf = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,index = ['one','two','three'],columns = ['a','b','c','d'])print(df)print('-----')data1 = df['a']data2 = df[['b','c']] # 尝试输入 data2 = df[['b','c','e']]print(data1)print(data2)# df[]默认选择列,[]中写列名(所以一般数据colunms都会单独制定,不会用默认数字列名,以免和index冲突)# 单选列为Series,print结果为Series格式# 多选列为Dataframe,print结果为Dataframe格式data3 = df[:1]#data3 = df[0]#data3 = df['one']print(data3,type(data3))# df[]中为数字时,默认选择行,且只能进行切片的选择,不能单独选择(df[0])# 输出结果为Dataframe,即便只选择一行# df[]不能通过索引标签名来选择行(df['one'])# 核心笔记:df[col]一般用于选择列,[]中写列名-----------------------------------a b c done 88.490183 93.588825 1.605172 74.610087two 45.905361 49.257001 87.852426 97.490521three 95.801001 97.991028 74.451954 64.290587-----one 88.490183two 45.905361three 95.801001Name: a, dtype: float64b cone 93.588825 1.605172two 49.257001 87.852426three 97.991028 74.451954a b c done 88.490183 93.588825 1.605172 74.610087 <class 'pandas.core.frame.DataFrame'>
df.loc[] - 按index选择行
df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,index = ['one','two','three','four'],columns = ['a','b','c','d'])df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,columns = ['a','b','c','d'])print(df1)print(df2)print('-----')data1 = df1.loc['one']data2 = df2.loc[1]print(data1)print(data2)print('单标签索引\n-----')# 单个标签索引,返回Seriesdata3 = df1.loc[['two','three','five']]data4 = df2.loc[[3,2,1]]print(data3)print(data4)print('多标签索引\n-----')# 多个标签索引,如果标签不存在,则返回NaN# 顺序可变data5 = df1.loc['one':'three']data6 = df2.loc[1:3]print(data5)print(data6)print('切片索引')# 可以做切片对象# 末端包含# 核心笔记:df.loc[label]主要针对index选择行,同时支持指定index,及默认数字index--------------------------------a b c done 73.070679 7.169884 80.820532 62.299367two 34.025462 77.849955 96.160170 55.159017three 27.897582 39.595687 69.280955 49.477429four 76.723039 44.995970 22.408450 23.273089a b c d0 93.871055 28.031989 57.093181 34.6952931 22.882809 47.499852 86.466393 86.1409092 80.840336 98.120735 84.495414 8.4130393 59.695834 1.478707 15.069485 48.775008-----a 73.070679b 7.169884c 80.820532d 62.299367Name: one, dtype: float64a 22.882809b 47.499852c 86.466393d 86.140909Name: 1, dtype: float64单标签索引-----a b c dtwo 34.025462 77.849955 96.160170 55.159017three 27.897582 39.595687 69.280955 49.477429five NaN NaN NaN NaNa b c d3 59.695834 1.478707 15.069485 48.7750082 80.840336 98.120735 84.495414 8.4130391 22.882809 47.499852 86.466393 86.140909多标签索引-----a b c done 73.070679 7.169884 80.820532 62.299367two 34.025462 77.849955 96.160170 55.159017three 27.897582 39.595687 69.280955 49.477429a b c d1 22.882809 47.499852 86.466393 86.1409092 80.840336 98.120735 84.495414 8.4130393 59.695834 1.478707 15.069485 48.775008切片索引
df.iloc[] - 按照整数位置(从轴的0到length-1)选择行
类似list的索引,其顺序就是dataframe的整数位置,从0开始计
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,index = ['one','two','three','four'],columns = ['a','b','c','d'])print(df)print('------')print(df.iloc[0])print(df.iloc[-1])#print(df.iloc[4])print('单位置索引\n-----')# 单位置索引# 和loc索引不同,不能索引超出数据行数的整数位置print(df.iloc[[0,2]])print(df.iloc[[3,2,1]])print('多位置索引\n-----')# 多位置索引# 顺序可变print(df.iloc[1:3])print(df.iloc[::2])print('切片索引')# 切片索引# 末端不包含---------------------------------------a b c done 21.848926 2.482328 17.338355 73.014166two 99.092794 0.601173 18.598736 61.166478three 87.183015 85.973426 48.839267 99.930097four 75.007726 84.208576 69.445779 75.546038------a 21.848926b 2.482328c 17.338355d 73.014166Name: one, dtype: float64a 75.007726b 84.208576c 69.445779d 75.546038Name: four, dtype: float64单位置索引-----a b c done 21.848926 2.482328 17.338355 73.014166three 87.183015 85.973426 48.839267 99.930097a b c dfour 75.007726 84.208576 69.445779 75.546038three 87.183015 85.973426 48.839267 99.930097two 99.092794 0.601173 18.598736 61.166478多位置索引-----a b c dtwo 99.092794 0.601173 18.598736 61.166478three 87.183015 85.973426 48.839267 99.930097a b c done 21.848926 2.482328 17.338355 73.014166three 87.183015 85.973426 48.839267 99.930097切片索引
布尔型索引
和Series原理相同
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,index = ['one','two','three','four'],columns = ['a','b','c','d'])print(df)print('------')b1 = df < 20print(b1,type(b1))print(df[b1]) # 也可以书写为 df[df < 20]print('------')# 不做索引则会对数据每个值进行判断# 索引结果保留 所有数据:True返回原数据,False返回值为NaNb2 = df['a'] > 50print(b2,type(b2))print(df[b2]) # 也可以书写为 df[df['a'] > 50]print('------')# 单列做判断# 索引结果保留 单列判断为True的行数据,包括其他列b3 = df[['a','b']] > 50print(b3,type(b3))print(df[b3]) # 也可以书写为 df[df[['a','b']] > 50]print('------')# 多列做判断# 索引结果保留 所有数据:True返回原数据,False返回值为NaNb4 = df.loc[['one','three']] < 50print(b4,type(b4))print(df[b4]) # 也可以书写为 df[df.loc[['one','three']] < 50]print('------')# 多行做判断# 索引结果保留 所有数据:True返回原数据,False返回值为NaN-------------------------------a b c done 19.185849 20.303217 21.800384 45.189534two 50.105112 28.478878 93.669529 90.029489three 35.496053 19.248457 74.811841 20.711431four 24.604478 57.731456 49.682717 82.132866------a b c done True False False Falsetwo False False False Falsethree False True False Falsefour False False False False <class 'pandas.core.frame.DataFrame'>a b c done 19.185849 NaN NaN NaNtwo NaN NaN NaN NaNthree NaN 19.248457 NaN NaNfour NaN NaN NaN NaN------one Falsetwo Truethree Falsefour FalseName: a, dtype: bool <class 'pandas.core.series.Series'>a b c dtwo 50.105112 28.478878 93.669529 90.029489------a bone False Falsetwo True Falsethree False Falsefour False True <class 'pandas.core.frame.DataFrame'>a b c done NaN NaN NaN NaNtwo 50.105112 NaN NaN NaNthree NaN NaN NaN NaNfour NaN 57.731456 NaN NaN------a b c done True True True Truethree True True False True <class 'pandas.core.frame.DataFrame'>a b c done 19.185849 20.303217 21.800384 45.189534two NaN NaN NaN NaNthree 35.496053 19.248457 NaN 20.711431four NaN NaN NaN NaN------
多重索引:比如同时索引行和列
先选择列再选择行 —— 相当于对于一个数据,先筛选字段,再选择数据量
