数据结构
DataFrame
按照列名选择列，只选择一列输出Series，选择多列输出Dataframe
按照index选择行，只选择一行输出Series，选择多行输出Dataframe
- three 14.699591 92.754997 39.683577 93.255880
- three 14.699591 39.683577

Pandas 是 Python 的核心数据分析支持库，提供了快速、灵活、明确的数据结构，旨在简单、直观地处理关系型、标记型数据。Pandas 的目标是成为 Python 数据分析实践与实战的必备高级工具，其长远目标是成为最强大、最灵活、可以支持任何语言的开源数据分析工具。经过多年不懈的努力，Pandas 离这个目标已经越来越近了。
Pandas 适用于处理以下类型的数据：

与 SQL 或 Excel 表类似的，含异构列的表格数据;
有序和无序（非固定频率）的时间序列数据;
带行列标签的矩阵数据，包括同构或异构型数据;
任意其它形式的观测、统计数据集, 数据转入 Pandas 数据结构时不必事先标记。

数据结构

维数	名称	描述
1	Series	带标签的一维同构数组
2	DataFrame	带标签的，大小可变的，二维异构表格

Series数据结构

# Series 是带有标签的一维数组，可以保存任何数据类型（整数，字符串，浮点数，Python对象等）,轴标签统称为索引
import numpy as np
import pandas as pd  
# 导入numpy、pandas模块
ar = np.random.rand(5)
print(ar)
s = pd.Series(ar)
print(s)
print(type(s))
# 查看数据、数据类型
print(s.index,type(s.index))
print(s.values,type(s.values))
# .index查看series索引，类型为rangeindex
# .values查看series值，类型是ndarray
# 核心：series相比于ndarray，是一个自带索引index的数组 → 一维数组 + 对应索引
# 所以当只看series的值的时候，就是一个ndarray
# series和ndarray较相似，索引切片功能差别不大
# series和dict相比，series更像一个有顺序的字典（dict本身不存在顺序），其索引原理与字典相似（一个用key，一个用index）
------------------------------------
[0.57879139 0.74160843 0.96094125 0.51813092 0.47592715]
0    0.578791
1    0.741608
2    0.960941
3    0.518131
4    0.475927
dtype: float64
<class 'pandas.core.series.Series'>
RangeIndex(start=0, stop=5, step=1) <class 'pandas.core.indexes.range.RangeIndex'>
[0.57879139 0.74160843 0.96094125 0.51813092 0.47592715] <class 'numpy.ndarray'>

创建数据

pd.Series()
参数

data ： 类似数组的、可迭代的、dict或标量值包含按序列存储的数据。
index: 类数组或索引(1d)

值必须是hashable和有相同的长度的“数据”。允许使用非唯一索引值。将默认为RangeIndex(0,1,2，…， n)如果未提供。如果字典和索引都是，则索引将覆盖在字典

dtype:
name：name为Series的一个参数，创建一个数组的名称
copy：

# Series 名称属性：name
s1 = pd.Series(np.random.randn(5))
print(s1)
print('-----')
s2 = pd.Series(np.random.randn(5),name = 'test')
print(s2)
print(s1.name, s2.name,type(s2.name))
# name为Series的一个参数，创建一个数组的 名称
# .name方法：输出数组的名称，输出格式为str，如果没用定义输出名称，输出为None
s3 = s2.rename('hehehe')
print(s3)
print(s3.name, s2.name)
# .rename()重命名一个数组的名称，并且新指向一个数组，原数组不变
--------------------------------------
0   -0.403084
1    1.369383
2    1.134319
3   -0.635050
4    1.680211
dtype: float64
-----
0   -0.120014
1    1.967648
2    1.142626
3    0.234079
4    0.761357
Name: test, dtype: float64
None test <class 'str'>
0   -0.120014
1    1.967648
2    1.142626
3    0.234079
4    0.761357
Name: hehehe, dtype: float64
hehehe test

Series 创建方法一：由字典创建，字典的key就是index，values就是values

dic = {'a':1 ,'b':2 , 'c':3, '4':4, '5':5}
s = pd.Series(dic)
print(s)
# 注意：key肯定是字符串，假如values类型不止一个会怎么样？ → dic = {'a':1 ,'b':'hello' , 'c':3, '4':4, '5':5}
-------------------------------------
4    4
5    5
a    1
b    2
c    3
dtype: int64

Series 创建方法二：由数组创建(一维数组)

arr = np.random.randn(5)
s = pd.Series(arr)
print(arr)
print(s)
# 默认index是从0开始，步长为1的数字
s = pd.Series(arr, index = ['a','b','c','d','e'],dtype = np.object)
print(s)
# index参数：设置index，长度保持一致
# dtype参数：设置数值类型
-------------------------------------
[ 0.11206121  0.1324684   0.59930544  0.34707543 -0.15652941]
0    0.112061
1    0.132468
2    0.599305
3    0.347075
4   -0.156529
dtype: float64
a    0.112061
b    0.132468
c    0.599305
d    0.347075
e   -0.156529
dtype: object

Series 创建方法三：由标量创建

s = pd.Series(10, index = range(4))
print(s)
# 如果data是标量值，则必须提供索引。该值会重复，来匹配索引的长度
---------------------------
0    10
1    10
2    10
3    10
dtype: int64

下标索引

位置下标，类似序列

s = pd.Series(np.random.rand(5))
print(s)
print(s[0],type(s[0]),s[0].dtype)
print(float(s[0]),type(float(s[0])))
#print(s[-1])
# 位置下标从0开始
# 输出结果为numpy.float格式，
# 可以通过float()函数转换为python float格式
# numpy.float与float占用字节不同
# s[-1]结果如何？
-----------------------------
0    0.924575
1    0.988654
2    0.426333
3    0.216504
4    0.453570
dtype: float64
0.924575004833 <class 'numpy.float64'> float64
0.9245750048328816 <class 'float'>

标签索引

# 
s = pd.Series(np.random.rand(5), index = ['a','b','c','d','e'])
print(s)
print(s['a'],type(s['a']),s['a'].dtype)
# 方法类似下标索引，用[]表示，内写上index，注意index是字符串
sci = s[['a','b','e']]
print(sci,type(sci))
# 如果需要选择多个标签的值，用[[]]来表示（相当于[]中包含一个列表）
# 多标签索引结果是新的数组
--------------------------
a    0.714630
b    0.213957
c    0.172188
d    0.972158
e    0.875175
dtype: float64
0.714630383451 <class 'numpy.float64'> float64
a    0.714630
b    0.213957
e    0.875175
dtype: float64 <class 'pandas.core.series.Series'>

切片索引

s1 = pd.Series(np.random.rand(5))
s2 = pd.Series(np.random.rand(5), index = ['a','b','c','d','e'])
print(s1[1:4],s1[4])
print(s2['a':'c'],s2['c'])
print(s2[0:3],s2[3])
print('-----')
# 注意：用index做切片是末端包含
print(s2[:-1])
print(s2[::2])
# 下标索引做切片，和list写法一样
-----------------------------------
1    0.865967
2    0.114500
3    0.369301
dtype: float64 0.411702342342
a    0.717378
b    0.642561
c    0.391091
dtype: float64 0.39109096261
a    0.717378
b    0.642561
c    0.391091
dtype: float64 0.998978363818
-----
a    0.717378
b    0.642561
c    0.391091
d    0.998978
dtype: float64
a    0.717378
c    0.391091
e    0.957639
dtype: float64

布尔索引

s = pd.Series(np.random.rand(3)*100)
s[4] = None  # 添加一个空值
print(s)
bs1 = s > 50
bs2 = s.isnull()
bs3 = s.notnull()
print(bs1, type(bs1), bs1.dtype)
print(bs2, type(bs2), bs2.dtype)
print(bs3, type(bs3), bs3.dtype)
print('-----')
# 数组做判断之后，返回的是一个由布尔值组成的新的数组
# .isnull() / .notnull() 判断是否为空值 (None代表空值，NaN代表有问题的数值，两个都会识别为空值)
print(s[s > 50])
print(s[bs3])
# 布尔型索引方法：用[判断条件]表示，其中判断条件可以是 一个语句，或者是 一个布尔型数组！
--------------------------------------
0    2.03802
1    40.3989
2    25.2001
4       None
dtype: object
0    False
1    False
2    False
4    False
dtype: bool <class 'pandas.core.series.Series'> bool
0    False
1    False
2    False
4     True
dtype: bool <class 'pandas.core.series.Series'> bool
0     True
1     True
2     True
4    False
dtype: bool <class 'pandas.core.series.Series'> bool
-----
Series([], dtype: object)
0    2.03802
1    40.3989
2    25.2001
dtype: object

数据查看

重新索引reindex

numpy.reindex将会根据索引重新排序，如果当前索引不存在，则引入缺失值

s = pd.Series(np.random.rand(3), index = ['a','b','c'])
print(s)
s1 = s.reindex(['c','b','a','d'])
print(s1)
# .reindex()中也是写列表
# 这里'd'索引不存在，所以值为NaN
s2 = s.reindex(['c','b','a','d'], fill_value = 0)
print(s2)
# fill_value参数：填充缺失值的值
-----------------------------------
a    0.343718
b    0.322228
c    0.746720
dtype: float64
c    0.746720
b    0.322228
a    0.343718
d         NaN
dtype: float64
c    0.746720
b    0.322228
a    0.343718
d    0.000000
dtype: float64

Series对齐

s1 = pd.Series(np.random.rand(3), index = ['Jack','Marry','Tom'])
s2 = pd.Series(np.random.rand(3), index = ['Wang','Jack','Marry'])
print(s1)
print(s2)
print(s1+s2)
# Series 和 ndarray 之间的主要区别是，Series 上的操作会根据标签自动对齐
# index顺序不会影响数值计算，以标签来计算
# 空值和任何值计算结果扔为空值
--------------------------------------
Jack     0.753732
Marry    0.180223
Tom      0.283704
dtype: float64
Wang     0.309128
Jack     0.533997
Marry    0.626126
dtype: float64
Jack     1.287729
Marry    0.806349
Tom           NaN
Wang          NaN
dtype: float64

删除Drop

s = pd.Series(np.random.rand(5), index = list('ngjur'))
print(s)
s1 = s.drop('n')
s2 = s.drop(['g','j'])
print(s1)
print(s2)
print(s)
# drop 删除元素之后返回副本(inplace=False)
----------------------------------------
n    0.876587
g    0.594053
j    0.628232
u    0.360634
r    0.454483
dtype: float64
g    0.594053
j    0.628232
u    0.360634
r    0.454483
dtype: float64
n    0.876587
u    0.360634
r    0.454483
dtype: float64
n    0.876587
g    0.594053
j    0.628232
u    0.360634
r    0.454483
dtype: float64

添加

s1 = pd.Series(np.random.rand(5))
s2 = pd.Series(np.random.rand(5), index = list('ngjur'))
print(s1)
print(s2)
s1[5] = 100
s2['a'] = 100
print(s1)
print(s2)
print('-----')
# 直接通过下标索引/标签index添加值
s3 = s1.append(s2)
print(s3)
print(s1)
# 通过.append方法，直接添加一个数组
# .append方法生成一个新的数组，不改变之前的数组
----------------------------------
0    0.516447
1    0.699382
2    0.469513
3    0.589821
4    0.402188
dtype: float64
n    0.615641
g    0.451192
j    0.022328
u    0.977568
r    0.902041
dtype: float64
0      0.516447
1      0.699382
2      0.469513
3      0.589821
4      0.402188
5    100.000000
dtype: float64
n      0.615641
g      0.451192
j      0.022328
u      0.977568
r      0.902041
a    100.000000
dtype: float64
-----
0      0.516447
1      0.699382
2      0.469513
3      0.589821
4      0.402188
5    100.000000
n      0.615641
g      0.451192
j      0.022328
u      0.977568
r      0.902041
a    100.000000
dtype: float64
0      0.516447
1      0.699382
2      0.469513
3      0.589821
4      0.402188
5    100.000000
dtype: float64

修改

s = pd.Series(np.random.rand(3), index = ['a','b','c'])
print(s)
s['a'] = 100
s[['b','c']] = 200
print(s)
# 通过索引直接修改，类似序列
----------------------------
a    0.873604
b    0.244707
c    0.888685
dtype: float64
a    100.0
b    200.0
c    200.0
dtype: float64

DataFrame

Dataframe 数据结构
Dataframe是一个表格型的数据结构，“带有标签的二维数组”。
Dataframe带有index（行标签）和columns（列标签）

data = {'name':['Jack','Tom','Mary'],
        'age':[18,19,20],
       'gender':['m','m','w']}
frame = pd.DataFrame(data)
print(frame)  
print(type(frame))
print(frame.index,'\n该数据类型为：',type(frame.index))
print(frame.columns,'\n该数据类型为：',type(frame.columns))
print(frame.values,'\n该数据类型为：',type(frame.values))
# 查看数据，数据类型为dataframe
# .index查看行标签
# .columns查看列标签
# .values查看值，数据类型为ndarray
-----------------------------------------------
   age gender  name
0   18      m  Jack
1   19      m   Tom
2   20      w  Mary
<class 'pandas.core.frame.DataFrame'>
RangeIndex(start=0, stop=3, step=1) 
该数据类型为： <class 'pandas.indexes.range.RangeIndex'>
Index(['age', 'gender', 'name'], dtype='object') 
该数据类型为： <class 'pandas.indexes.base.Index'>
[[18 'm' 'Jack']
 [19 'm' 'Tom']
 [20 'w' 'Mary']] 
该数据类型为： <class 'numpy.ndarray'>

基本查询

排序

语法

df.sort_values(by="Count_AnimalName",ascending=True)

by为以Count_AnimalName名字的列排序 ascending：排列顺序，默认为True升序，反之为降序

索引

选择行与列

取行或列的注意

方括号写数组，表示取行，对行进行操作
方括号写字符串，表示的取列索引，对列进行操作

还有更多的经过pandas优化过的选择方式：

df.loc 通过标签索引行数据
df.iloc 通过位置获取行数据赋值更改数据的过程： ```python df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
```
             index = ['one','two','three'],
             columns = ['a','b','c','d'])
```
print(df) print(‘——-‘)

data1 = df[‘a’] data2 = df[[‘a’,’c’]] print(data1,type(data1)) print(data2,type(data2)) print(‘——-‘)

按照列名选择列，只选择一列输出Series，选择多列输出Dataframe

data3 = df.loc[‘one’] data4 = df.loc[[‘one’,’two’]] print(data2,type(data3)) print(data3,type(data4))

按照index选择行，只选择一行输出Series，选择多行输出Dataframe

           a          b          c          d

one 72.615321 49.816987 57.485645 84.226944 two 46.295674 34.480439 92.267989 17.111412

three 14.699591 92.754997 39.683577 93.255880

one 72.615321 two 46.295674 three 14.699591 Name: a, dtype: float64 a c one 72.615321 57.485645 two 46.295674 92.267989

three 14.699591 39.683577

           a          c

one 72.615321 57.485645 two 46.295674 92.267989 three 14.699591 39.683577 a 72.615321 b 49.816987 c 57.485645 d 84.226944 Name: one, dtype: float64

<a name="PJjUO"></a>
#### df[] - 选择列
一般用于选择列，也可以选择行
```python
df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                   index = ['one','two','three'],
                   columns = ['a','b','c','d'])
print(df)
print('-----')
data1 = df['a']
data2 = df[['b','c']]  # 尝试输入 data2 = df[['b','c','e']]
print(data1)
print(data2)
# df[]默认选择列，[]中写列名（所以一般数据colunms都会单独制定，不会用默认数字列名，以免和index冲突）
# 单选列为Series，print结果为Series格式
# 多选列为Dataframe，print结果为Dataframe格式
data3 = df[:1]
#data3 = df[0]
#data3 = df['one']
print(data3,type(data3))
# df[]中为数字时，默认选择行，且只能进行切片的选择，不能单独选择（df[0]）
# 输出结果为Dataframe，即便只选择一行
# df[]不能通过索引标签名来选择行(df['one'])
# 核心笔记：df[col]一般用于选择列，[]中写列名
-----------------------------------
               a          b          c          d
one    88.490183  93.588825   1.605172  74.610087
two    45.905361  49.257001  87.852426  97.490521
three  95.801001  97.991028  74.451954  64.290587
-----
one      88.490183
two      45.905361
three    95.801001
Name: a, dtype: float64
               b          c
one    93.588825   1.605172
two    49.257001  87.852426
three  97.991028  74.451954
             a          b         c          d
one  88.490183  93.588825  1.605172  74.610087 <class 'pandas.core.frame.DataFrame'>

df.loc[] - 按index选择行

df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
print(df1)
print(df2)
print('-----')
data1 = df1.loc['one']
data2 = df2.loc[1]
print(data1)
print(data2)
print('单标签索引\n-----')
# 单个标签索引，返回Series
data3 = df1.loc[['two','three','five']]
data4 = df2.loc[[3,2,1]]
print(data3)
print(data4)
print('多标签索引\n-----')
# 多个标签索引，如果标签不存在，则返回NaN
# 顺序可变
data5 = df1.loc['one':'three']
data6 = df2.loc[1:3]
print(data5)
print(data6)
print('切片索引')
# 可以做切片对象
# 末端包含
# 核心笔记：df.loc[label]主要针对index选择行，同时支持指定index，及默认数字index
--------------------------------
               a          b          c          d
one    73.070679   7.169884  80.820532  62.299367
two    34.025462  77.849955  96.160170  55.159017
three  27.897582  39.595687  69.280955  49.477429
four   76.723039  44.995970  22.408450  23.273089
           a          b          c          d
0  93.871055  28.031989  57.093181  34.695293
1  22.882809  47.499852  86.466393  86.140909
2  80.840336  98.120735  84.495414   8.413039
3  59.695834   1.478707  15.069485  48.775008
-----
a    73.070679
b     7.169884
c    80.820532
d    62.299367
Name: one, dtype: float64
a    22.882809
b    47.499852
c    86.466393
d    86.140909
Name: 1, dtype: float64
单标签索引
-----
               a          b          c          d
two    34.025462  77.849955  96.160170  55.159017
three  27.897582  39.595687  69.280955  49.477429
five         NaN        NaN        NaN        NaN
           a          b          c          d
3  59.695834   1.478707  15.069485  48.775008
2  80.840336  98.120735  84.495414   8.413039
1  22.882809  47.499852  86.466393  86.140909
多标签索引
-----
               a          b          c          d
one    73.070679   7.169884  80.820532  62.299367
two    34.025462  77.849955  96.160170  55.159017
three  27.897582  39.595687  69.280955  49.477429
           a          b          c          d
1  22.882809  47.499852  86.466393  86.140909
2  80.840336  98.120735  84.495414   8.413039
3  59.695834   1.478707  15.069485  48.775008
切片索引

df.iloc[] - 按照整数位置（从轴的0到length-1）选择行

类似list的索引，其顺序就是dataframe的整数位置，从0开始计


df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)
print('------')
print(df.iloc[0])
print(df.iloc[-1])
#print(df.iloc[4])
print('单位置索引\n-----')
# 单位置索引
# 和loc索引不同，不能索引超出数据行数的整数位置
print(df.iloc[[0,2]])
print(df.iloc[[3,2,1]])
print('多位置索引\n-----')
# 多位置索引
# 顺序可变
print(df.iloc[1:3])
print(df.iloc[::2])
print('切片索引')
# 切片索引
# 末端不包含
---------------------------------------
               a          b          c          d
one    21.848926   2.482328  17.338355  73.014166
two    99.092794   0.601173  18.598736  61.166478
three  87.183015  85.973426  48.839267  99.930097
four   75.007726  84.208576  69.445779  75.546038
------
a    21.848926
b     2.482328
c    17.338355
d    73.014166
Name: one, dtype: float64
a    75.007726
b    84.208576
c    69.445779
d    75.546038
Name: four, dtype: float64
单位置索引
-----
               a          b          c          d
one    21.848926   2.482328  17.338355  73.014166
three  87.183015  85.973426  48.839267  99.930097
               a          b          c          d
four   75.007726  84.208576  69.445779  75.546038
three  87.183015  85.973426  48.839267  99.930097
two    99.092794   0.601173  18.598736  61.166478
多位置索引
-----
               a          b          c          d
two    99.092794   0.601173  18.598736  61.166478
three  87.183015  85.973426  48.839267  99.930097
               a          b          c          d
one    21.848926   2.482328  17.338355  73.014166
three  87.183015  85.973426  48.839267  99.930097
切片索引

布尔型索引

和Series原理相同

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)
print('------')
b1 = df < 20
print(b1,type(b1))
print(df[b1])  # 也可以书写为 df[df < 20]
print('------')
# 不做索引则会对数据每个值进行判断
# 索引结果保留 所有数据：True返回原数据，False返回值为NaN
b2 = df['a'] > 50
print(b2,type(b2))
print(df[b2])  # 也可以书写为 df[df['a'] > 50]
print('------')
# 单列做判断
# 索引结果保留 单列判断为True的行数据，包括其他列
b3 = df[['a','b']] > 50
print(b3,type(b3))
print(df[b3])  # 也可以书写为 df[df[['a','b']] > 50]
print('------')
# 多列做判断
# 索引结果保留 所有数据：True返回原数据，False返回值为NaN
b4 = df.loc[['one','three']] < 50
print(b4,type(b4))
print(df[b4])  # 也可以书写为 df[df.loc[['one','three']] < 50]
print('------')
# 多行做判断
# 索引结果保留 所有数据：True返回原数据，False返回值为NaN
-------------------------------
               a          b          c          d
one    19.185849  20.303217  21.800384  45.189534
two    50.105112  28.478878  93.669529  90.029489
three  35.496053  19.248457  74.811841  20.711431
four   24.604478  57.731456  49.682717  82.132866
------
           a      b      c      d
one     True  False  False  False
two    False  False  False  False
three  False   True  False  False
four   False  False  False  False <class 'pandas.core.frame.DataFrame'>
               a          b   c   d
one    19.185849        NaN NaN NaN
two          NaN        NaN NaN NaN
three        NaN  19.248457 NaN NaN
four         NaN        NaN NaN NaN
------
one      False
two       True
three    False
four     False
Name: a, dtype: bool <class 'pandas.core.series.Series'>
             a          b          c          d
two  50.105112  28.478878  93.669529  90.029489
------
           a      b
one    False  False
two     True  False
three  False  False
four   False   True <class 'pandas.core.frame.DataFrame'>
               a          b   c   d
one          NaN        NaN NaN NaN
two    50.105112        NaN NaN NaN
three        NaN        NaN NaN NaN
four         NaN  57.731456 NaN NaN
------
          a     b      c     d
one    True  True   True  True
three  True  True  False  True <class 'pandas.core.frame.DataFrame'>
               a          b          c          d
one    19.185849  20.303217  21.800384  45.189534
two          NaN        NaN        NaN        NaN
three  35.496053  19.248457        NaN  20.711431
four         NaN        NaN        NaN        NaN
------

多重索引：比如同时索引行和列

先选择列再选择行 —— 相当于对于一个数据，先筛选字段，再选择数据量