Series
- 基本介绍
- 创建Series
  - 通过列表创建
DataFrame
- 基本介绍
- 创建DataFrame
  - 通过列表创建
按行创建，没有对应部分的数据为NaN
- 通过ndarray创建
关于整列的简单操作
索引
- 基本介绍
- Series的索引操作
‘’’
方式一：行索引
- print(ser[‘a’]) # 0，可以使用ser.loc[‘a’]
方式二：切片索引
- ‘’’
方式三：不连续索引
- ‘’’
方式四：布尔索引
- loc和iloc的高级使用
对齐运算
- Series的对齐运算
- DataFrame的对齐运算
使用函数
处理缺失数据

Series

基本介绍

Series是一种类似于一维数组的对象，由一组数据（各种Numpy数据类型）以及一组与之对应的索引（数据标签）组成
由数据和索引组成，索引(index)在左，数据(values)在右，索引是自动创建的

创建Series
语法格式：pandas.Series( data, index, dtype, name, copy)
参数说明
- data：一组数据(ndarray类型)
- index：索引标签，如果不指定，默认从0开始
- dtype：数据类型，默认会自己判断
- name：设置名称
- copy：拷贝数据，默认为False
  通过列表创建
```python var1 = pd.Series(range(100, 103), name=’hhhh’) print(var1.head(2)) ‘’’ 0 100 1 101 Name: hhhh, dtype: int64 ‘’’ print(var1[0]) # 100 print(var1.index) # RangeIndex(start=0, stop=3, step=1) print(var1.values) # [100 101 102]

var2 = pd.Series([‘Alan’, ‘Bob’, ‘Cindy’], index=[‘x’, ‘y’, ‘z’]) print(var2) ‘’’ x Alan y Bob z Cindy dtype: object ‘’’ print(var2[‘x’]) # Alan

<a name="rcV9o"></a>
#### 通过字典创建
```python
var1 = pd.Series({'x': "Google", 'y': "Runoob", 'z': "Wiki"})
print(var1)
'''
x    Google
y    Runoob
z      Wiki
dtype: object
'''
---------------------------------------------------------
var2 = pd.Series({'x': "Google", 'y': "Runoob", 'z': "Wiki"}, index=['x', 'z'])
print(var2)
'''
x    Google
z      Wiki
dtype: object
'''

DataFrame

基本介绍

DataFrame是一个表格型的数据结构，它含有一组有序的列，每列可以是不同类型的值。DataFrame既有行索引也有列索引，它可以看做是由Series组成的字典（共用同一个索引），数据是以二维结构存放的

创建DataFrame

语法格式：pandas.DataFrame( data, index, columns, dtype, copy)
参数说明
- data：一组数据(ndarray、series、map、list、dict等类型)
- index：索引值，也可以称为行标签，默认为RangeIndex(0, 1, 2, …, n)
- columns：列标签，默认为RangeIndex(0, 1, 2, …, n)
- dtype：数据类型
- copy：拷贝数据，默认为False
  通过列表创建
```python print(pd.DataFrame(np.arange(6).reshape(2, 3))) ‘’’ 0 1 2 0 0 1 2 1 3 4 5 ‘’’

data = [[‘Google’], [‘Runoob’, 12], [‘Wiki’, 13]] print(pd.DataFrame(data, columns=[‘Site’, ‘Age’], dtype=float))

按行创建，没有对应部分的数据为NaN

‘’’ Site Age 0 Google NaN 1 Runoob 12.0 2 Wiki 13.0 ‘’’

<a name="spm7c"></a>
#### 通过字典创建
```python
data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
print(pd.DataFrame(data))
# 按行创建，没有对应部分的数据为NaN
'''
   a   b     c
0  1   2   NaN
1  5  10  20.0
'''

通过ndarray创建

按列创建，每列的长度必须相同

data = {
    'calories': [420, 380, 390],
    'duration': [50, 40, 45]
}
print(pd.DataFrame(data, index=['day1', 'day2', 'day3']))
'''
      calories  duration
day1       420        50
day2       380        40
day3       390        45
'''
---------------------------------------------------------
data = {'A': 1,
        'B': pd.Timestamp('20220709'),
        'C': pd.Series(1, index=list(range(4)), dtype='float32'),
        'D': np.array([1, 2, 3, 4], dtype='int32'),
        'E': ["Python", "Java", "C++", "C"],
        'F': 'program'}
print(pd.DataFrame(data))
'''
   A          B    C  D       E        F
0  1 2022-07-09  1.0  1  Python  program
1  1 2022-07-09  1.0  2    Java  program
2  1 2022-07-09  1.0  3     C++  program
3  1 2022-07-09  1.0  4       C  program
'''

关于整列的简单操作

data = {
    'calories': [420, 380, 390],
    'duration': [50, 40, 45]
}
df = pd.DataFrame(data, index=['day1', 'day2', 'day3'])
print(df)
'''
      calories  duration
day1       420        50
day2       380        40
day3       390        45
'''
---------------------------------------------------------
# 操作一：通过列索引获取整列数据
print(df['calories'])  # 等价于print(df.calories)
'''
day1    420
day2    380
day3    390
Name: calories, dtype: int64
'''
print(type(df['calories']))  # <class 'pandas.core.series.Series'>
---------------------------------------------------------
# 操作二：增加整列数据
df['A'] = df['calories'] + 100
print(df)
'''
      calories  duration    A
day1       420        50  520
day2       380        40  480
day3       390        45  490
'''
---------------------------------------------------------
# 操作三：删除整列
del(df['A'])
print(df)
'''
      calories  duration
day1       420        50
day2       380        40
day3       390        45
'''

索引

基本介绍

首先需要明确的是：Series和DataFrame中的索引都是Index对象

索引对象不可变，保证了数据的安全
常见的索引类型
- Int64Index：整数索引
- MultiIndex：层级索引
- DatetimeIndex：时间戳索引
  Series的索引操作
  注意：loc是基于标签名的索引，iloc是基于索引编号的索引 ```python import pandas as pd

ser = pd.Series(range(5), index=[‘a’, ‘b’, ‘c’, ‘d’, ‘e’]) print(ser) ‘’’ a 0 b 1 c 2 d 3 e 4 dtype: int64

‘’’

方式一：行索引

print(ser[0]) # 0，可以使用ser.iloc[0]

print(ser[‘a’]) # 0，可以使用ser.loc[‘a’]

方式二：切片索引

print(ser[1:3]) # 可以使用ser.iloc[1:3] ‘’’ b 1 c 2 dtype: int64 ‘’’ print(ser[‘b’:’d’]) # 可以使用ser.loc[‘b’:’d’] ‘’’ b 1 c 2 d 3 dtype: int64

‘’’

方式三：不连续索引

print(ser[[0, 2, 4]]) # 可以使用ser.iloc[[0, 2, 4]] ‘’’ a 0 c 2 e 4 dtype: int64 ‘’’ print(ser[[‘a’, ‘e’]]) # 可以使用ser.loc[[‘a’, ‘e’]] ‘’’ a 0 e 4 dtype: int64

‘’’

方式四：布尔索引

print(ser > 2) ‘’’ a False b False c False d True e True dtype: bool ‘’’ print(ser[ser > 2]) ‘’’ d 3 e 4 dtype: int64 ‘’’

<a name="qiPHd"></a>
### DataFrame的索引操作
注意：loc是基于标签名的索引，iloc是基于索引编号的索引
```python
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape(3, 4), index=['a', 'b', 'c'], columns=['A', 'B', 'C', 'D'])
print(df)
'''
   A  B   C   D
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11
'''
---------------------------------------------------------
# 方式一：列索引，没有行索引
print(df['A'])  # 不能使用df[0]
'''
a    0
b    4
c    8
Name: A, dtype: int32
'''
print(type(df['A']))  # <class 'pandas.core.series.Series'>
--------------------
print(df[['A']])  # 不能使用df[[0]]
'''
   A
a  0
b  4
c  8
'''
print(type(df[['A']]))  # <class 'pandas.core.frame.DataFrame'>
---------------------------------------------------------
# 方式二：切片索引，只能用于行索引，不能用于列索引
print(df['a':'a'])  # # df['a':'a']等价于df[0:1]
'''
   A  B  C  D
a  0  1  2  3
'''
--------------------
print(df['a':'b'])  # df['a':'b']等价于df[0:2]
'''
   A  B  C  D
a  0  1  2  3
b  4  5  6  7
'''
print(type(df['a':'b']))  # <class 'pandas.core.frame.DataFrame'>
---------------------------------------------------------
# 方式三：不连续索引，只能用于列索引，不能用于行索引
print(df[['A', 'C']])  # 不能使用df[[0, 2]]
# 
'''
   A   C
a  0   2
b  4   6
c  8  10
'''
print(type(df[['A', 'C']]))  # <class 'pandas.core.frame.DataFrame'>

loc和iloc的高级使用

对于DataFrame，loc和iloc是等价的，因此以iloc为例

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape(3, 4), index=['a', 'b', 'c'], columns=['A', 'B', 'C', 'D'])
print(df)
'''
   A  B   C   D
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11
'''
print(df.iloc[[0, 1], 0:3])
'''
   A  B  C
a  0  1  2
b  4  5  6
'''
print(df.iloc[0:2, 0:3])
'''
   A  B  C
a  0  1  2
b  4  5  6
'''
print(df.iloc[[0, 2]])
'''
   A  B   C   D
a  0  1   2   3
c  8  9  10  11
'''
print(df.iloc[:, 0:1])
'''
   A
a  0
b  4
c  8
'''
print(df.iloc[:, 0])
'''
a    0
b    4
c    8
Name: A, dtype: int32
'''
print(df.iloc[0])
'''
A    0
B    1
C    2
D    3
Name: a, dtype: int32
'''
print(df.iloc[0, 0:2])
'''
A    0
B    1
Name: a, dtype: int32
'''

对齐运算

Series的对齐运算

import pandas as pd
import numpy as np
s1 = pd.Series(range(0, 3), index=range(3))
s2 = pd.Series(range(10, 15), index=range(5))
print(s1)
'''
0    0
1    1
2    2
dtype: int64
'''
print(s2)
'''
0    10
1    11
2    12
3    13
4    14
dtype: int64
'''
print(s1 + s2)
'''
0    10.0
1    12.0
2    14.0
3     NaN
4     NaN
dtype: float64
'''
print(s1.add(s2, fill_value=0))
'''
0    10.0
1    12.0
2    14.0
3    13.0
4    14.0
dtype: float64
'''

DataFrame的对齐运算

import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.ones((2, 2)), columns=['a', 'b'])
df2 = pd.DataFrame(np.ones((3, 3)), columns=['a', 'b', 'c'])
print(df1)
'''
     a    b
0  1.0  1.0
1  1.0  1.0
'''
print(df2)
'''
     a    b    c
0  1.0  1.0  1.0
1  1.0  1.0  1.0
2  1.0  1.0  1.0
'''
print(df1 + df2)
'''
     a    b   c
0  2.0  2.0 NaN
1  2.0  2.0 NaN
2  NaN  NaN NaN
'''
print(df1.add(df2, fill_value=0))
'''
     a    b    c
0  2.0  2.0  1.0
1  2.0  2.0  1.0
2  1.0  1.0  1.0
'''

使用函数

对于Series的操作和DataFrame基本类似，所以以DataFrame为例

可直接使用numpy的函数

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(2, 3) - 1)
print(df)
'''
          0         1         2
0 -1.249877  1.732052 -2.438650
1 -1.240503 -0.739373 -1.734294
'''
print(np.abs(df))
'''
          0         1         2
0  1.249877  1.732052  2.438650
1  1.240503  0.739373  1.734294
'''

通过apply将函数应用到行或列上

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(2, 3) - 1)
print(df)
'''
          0         1         2
0 -0.961372 -1.083071 -0.950709
1  0.259989  1.168447  0.265873
'''
print(df.apply(lambda x: x.max()))  # 默认是轴0，即自上而下
'''
0    0.259989
1    1.168447
2    0.265873
dtype: float64
'''
print(df.apply(lambda x: x.max(), axis=1))  # 指定为轴1，即自左而右
'''
0   -0.950709
1    1.168447
dtype: float64
'''

通过applymap将函数应用到每个数据上

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(2, 3) - 1)
print(df)
'''
          0         1         2
0 -0.171078 -0.951804 -2.561237
1 -0.944916 -2.440723 -1.218804
'''
print(df.applymap(lambda x: '%.2f' % x))
'''
       0      1      2
0  -0.17  -0.95  -2.56
1  -0.94  -2.44  -1.22
'''
print(df.applymap(lambda x: np.abs(x)))
'''
          0         1         2
0  0.171078  0.951804  2.561237
1  0.944916  2.440723  1.218804
'''

排序操作

按索引进行排序

通过函数sort_index()，默认使用升序、按轴0进行排序，ascending=False为降序排序

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3),
                   index=np.random.randint(3, size=3),
                   columns=np.random.randint(3, size=3))
print(df)
'''
          1         2         0
2  0.497309 -2.689734 -0.653441
0  1.245357 -0.563550  0.057072
2 -0.954434 -1.055197 -0.875219
'''
print(df.sort_index())
'''
          1         2         0
0  1.245357 -0.563550  0.057072
2  0.497309 -2.689734 -0.653441
2 -0.954434 -1.055197 -0.875219
'''
print(df.sort_index(axis=1, ascending=False))
'''
          2         1         0
2 -2.689734  0.497309 -0.653441
0 -0.563550  1.245357  0.057072
2 -1.055197 -0.954434 -0.875219
'''

按值进行排序

通过函数sort_values()，需要通过by指定轴的名字；默认是按轴0进行排序，可以通过axis=1指定按轴1进行排序；默认是升序排序，可以通过ascending=False指定降序排序

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), columns=['A', 'B', 'C'])
print(df)
'''
          A         B         C
0 -0.042903 -0.862099  1.253066
1  0.372508  0.492648 -0.416308
2  0.222194 -1.288317 -0.340111
'''
print(df.sort_values(by='A'))
'''
          A         B         C
0 -0.042903 -0.862099  1.253066
2  0.222194 -1.288317 -0.340111
1  0.372508  0.492648 -0.416308
'''
print(df.sort_values(by=2, axis=1, ascending=False))
'''
          A         C         B
0 -0.042903  1.253066 -0.862099
1  0.372508 -0.416308  0.492648
2  0.222194 -0.340111 -1.288317
'''

处理缺失数据

is_null()：判断是否为NaN
dropna(axis=0)：丢弃包含NaN的行或列
fillna(num)：用num替换其中的NaN ```python import pandas as pd import numpy as np

df = pd.DataFrame([np.random.randn(3), [1, 2, np.nan], [np.nan, 4, np.nan]])

print(df) ‘’’ 0 1 2 0 -1.628242 -0.226639 1.508853 1 1.000000 2.000000 NaN 2 NaN 4.000000 NaN ‘’’

print(df.isnull()) ‘’’ 0 1 2 0 False False False 1 False False True 2 True False True ‘’’

print(df.dropna()) ‘’’ 0 1 2 0 -1.628242 -0.226639 1.508853 ‘’’

print(df.dropna(axis=1)) ‘’’ 1 0 -0.226639 1 2.000000 2 4.000000 ‘’’

print(df.fillna(100)) ‘’’ 0 1 2 0 -1.628242 -0.226639 1.508853 1 1.000000 2.000000 100.000000 2 100.000000 4.000000 100.000000 ‘’’ ```

数据分析

4. Pandas上

Series

基本介绍

创建Series

通过列表创建

DataFrame

基本介绍

创建DataFrame

通过列表创建

按行创建，没有对应部分的数据为NaN

通过ndarray创建

关于整列的简单操作

索引

基本介绍

Series的索引操作

‘’’

方式一：行索引

print(ser[‘a’]) # 0，可以使用ser.loc[‘a’]

方式二：切片索引

‘’’

方式三：不连续索引

‘’’

方式四：布尔索引

loc和iloc的高级使用

对齐运算

Series的对齐运算

DataFrame的对齐运算

使用函数

可直接使用numpy的函数

通过apply将函数应用到行或列上

通过applymap将函数应用到每个数据上

排序操作

按索引进行排序

按值进行排序

处理缺失数据