1. 基本数据结构
- 1.1 Series
- 1.2 DataFrame — 二维表结构
数据
—-OUTPUT—-
—-OUTPUT—-
- 1.2.4 DataFrame对缺失值的操作
- 1.2.5 DataFrame文件操作
2. 实用进阶

1. 基本数据结构

1.1 Series

Series是个定长的字典序列。
存储时相当于两个ndarray。

1.1.1 初始化的3种方式

Parameters
----------
data : array-like, Iterable, dict, or scalar value
    Contains data stored in Series.
    .. versionchanged :: 0.23.0
       If data is a dict, argument order is maintained for Python 3.6
       and later.
index : array-like or Index (1d)
    Values must be hashable and have the same length as `data`.
    Non-unique index values are allowed. Will default to
    RangeIndex (0, 1, 2, ..., n) if not provided. If both a dict and index
    sequence are used, the index will override the keys found in the
    dict.
dtype : str, numpy.dtype, or ExtensionDtype, optional
    dtype for the output Series. If not specified, this will be
    inferred from `data`.
    See the :ref:`user guide <basics.dtypes>` for more usages.
copy : bool, default False
    Copy input data.

from pandas import Series
import numpy as np
# 方式1：以列表进行初始化，列表值为列值
li = [1,2,3,4,5]
x1 = Series(li)   # index 默认从0起始
# ---OUTPUT---
0   1
1   2
2   3
3   4
4   5
dtype: int64
# 方式2：以 numpy.ndarray 进行初始化
n = np.random.random(5)
x1 = Series(n, index=['a', 'b', 'c', 'd'], dtype=str)
# ---OUTPUT---
a    0.377831
b    0.629109
c    0.841532
d   -0.429905
dtype: float64
# 方式3：以字典方式创建Series
d = {'a':5, 'b':6, 'c':7, 'd':8}
x2 = Series(d)
# ---OUTPUT---
a    5
b    6
c    7
d    8
dtype: int64

1.1.2 修改Index

# 修改index
s1 = pd.Series([1,2,3])
# ---OUTPUT---
0   1
1   2
2   3
dtype: int64
s1.index = ['a', 'b', 'c']
# ---OUTPUT---
a   1
b   2
c   3
dtype: int64

1.1.3 纵向拼接

s1.append(s2)

1.1.4 按指定索引删除元素

si.drop('e')  # 删除index为'e'的元素信息

1.1.5 按索引值索引元素

# 修改索引值为'A'的值为6
si['A'] = 6

1.1.6 Series的切片操作

si[:3]  # si的前三个索引元素的切片，类型仍然为Series

1.1.7 Series的运算符

# S3
a    1
b    2
c    3
d    4
e    5
dtype: int64
# S4
a    1
b    2
c    3
d    4
A    6
B    1
C    2
D    3
E    4
dtype: int64

加法：加法/减法/乘法/除法运算是按照索引计算，如果索引不同则填充为 NaN（空值）.

S4.add(S3)
# --OUTPUT---
A    NaN
B    NaN
C    NaN
D    NaN
E    NaN
a    2.0
b    4.0
c    6.0
d    8.0
e    NaN
dtype: float64

S4.sub(S3)
# --OUTPUT---
A    NaN
B    NaN
C    NaN
D    NaN
E    NaN
a    0.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64

1.1.8 ufunc

# 中位数
si.median()
# 求和
si.sum()
# 最大值
si.max()
# 最小值
si.min()

1.2 DataFrame — 二维表结构

DataFrame数据结构类似于数据库表，包含了行索引和列索引，可以将其视为由相同索引的Series组成的字典类型。

1.2.1 初始化的2种方式

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, or list-like objects
    .. versionchanged :: 0.23.0
       If data is a dict, argument order is maintained for Python 3.6
       and later.
index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided
columns : Index or array-like
    Column labels to use for resulting frame. Will default to
    RangeIndex (0, 1, 2, ..., n) if no column labels are provided
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer
copy : boolean, default False
    Copy data from inputs. Only affects DataFrame / 2d ndarray input

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
# 方式1：由字典创造DataFrame, 字典的每个key-value对，为一列数据, 列头值已定
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4
# 由numpy.ndarray创造DataFrame, ndarray仅为值, columns实为列头值, index为索引值
df2 = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'], index=['1st', '2nd', '3rd'])
# ---OUTPUT---
     a  b  c
1st  1  2  3
2nd  4  5  6
3rd  7  8  9

1.2.2 DataFrame的类型

df.dtypes
# ---OUTPUT---
animal       object
age         float64
visits        int64
priority     object
dtype: object

1.2.3 DataFrame的基本操作

df
# ---OUTPUT---
    animal    age        visits    priority
a    cat      2.5        1        yes
b    cat        3.0        3        yes
c    snake    0.5        2        no
d    dog        NaN        3        yes
e    dog        5.0        2        no
f    cat        2.0        3        no
g    snake    4.5        1        no
h    cat        NaN        1        yes
i    dog        7.0        2        no
j    dog        3.0        1        no

1.2.3.1 预览头N（默认5）行数据

df.head(N)

1.2.3.2 预览尾N（默认5）行数据

df.tail(N)

1.2.3.3 查看DataFrame的索引

df.index
# ---OUTPUT---
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

1.2.3.4 查看DataFrame的列名

df.columns
# ---OUTPUT---
Index(['animal', 'age', 'visits', 'priority'], dtype='object')

1.2.3.5 查看DataFrame的数值

df.values
# ---OUTPUT---
array([['cat', 2.5, 1, 'yes'],
       ['cat', 3.0, 3, 'yes'],
       ['snake', 0.5, 2, 'no'],
       ['dog', nan, 3, 'yes'],
       ['dog', 5.0, 2, 'no'],
       ['cat', 2.0, 3, 'no'],
       ['snake', 4.5, 1, 'no'],
       ['cat', nan, 1, 'yes'],
       ['dog', 7.0, 2, 'no'],
       ['dog', 3.0, 1, 'no']], dtype=object)

1.2.3.6 查看DataFrame的统计数据

df.describe()
# ---OUTPUT---
        age            visits
count    8.000000    10.000000
mean    3.437500    1.900000
std        2.007797    0.875595
min        0.500000    1.000000
25%        2.375000    1.000000
50%        3.000000    2.000000
75%        4.625000    2.750000
max        7.000000    3.000000

1.2.3.7 DataFrame的转置操作（横列互换）

df.T  # 类型仍为 pandas.core.frame.DataFrame

1.2.3.8 对DataFrame进行排序

df.sort_values(by='age')

sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’) method of pandas.core.frame.DataFrame instance.
Parameters:
———————
by : str or list of str
Name or list of names to sort by.
- if axis is 0 or 'index' then by may contain index
levels and/or column labels
- if axis is 1 or 'columns' then by may contain column
levels and/or index labels
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Axis to be sorted
ascending : bool or list of bool, default True
Sort ascending vs. descending. Specify list for multiple sort
orders. If this is a list of bools, must match the length of
the by.
inplace : bool, default False
if True, perform operation in-place
kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’
Choice of sorting algorithm. See also ndarray.np.sort for more
information. mergesort is the only stable algorithm. For
DataFrames, this option is only applied when sorting on a single
column or label.
na_position : {‘first’, ‘last’}, default ‘last’
first puts NaNs at the beginning, last puts NaNs at the end

TIPS:
若参数axis为0，则是依据纵列值进行排序，by取值应为column的label；
若参数axis为1，则是依据横排值进行排序，by取值应为index的label。

1.2.3.9 DataFrame的切片 — N排

df[1:3]

1.2.3.10 DataFrame 获取rows of columns — DataFrame.loc[input, col]

Access a group of rows and columns by label(s) or a boolean array.

input：

single label : ‘a’ …
A list or array of labels : [‘a’, ‘b’, ‘c’]…
A slice object with labels : ‘a’ : ‘f’ (including)
A boolean array of the same length as the axis being sliced : [True, True, False]
A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)

col : 可选

single col name — Series type
a list of col names — DataFrame

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
# ---OUTPUT---
    animal    age    visits    priority
a    cat        2.5        1        yes
b    cat        3.0        3        yes
c    snake    0.5        2        no
d    dog        NaN        3        yes
e    dog        5.0        2        no
f    cat        2.0        3        no
g    snake    4.5        1        no
h    cat        NaN        1        yes
i    dog        7.0        2        no
j    dog        3.0        1        no
# input : single label
df.loc['a']
# ---OUTPUT---
animal      cat
age         2.5
visits        1
priority    yes
Name: a, dtype: object
# input : a list of labels
df.loc[['a', 'b']]
# ---OUTPUT---
    animal    age    visits    priority
a    cat        2.5        1        yes
b    cat        3.0        3        yes
# input : a slice object with labels
df.loc['b' : 'c']
# ---OUTPUT---
    animal    age    visits    priority
b    cat        3.0        3        yes
c    snake    0.5        2        no
# A boolean array
df.loc[df['age'] > 4]
# ---OUTPUT---
    animal    age    visits    priority
e    dog        5.0        2        no
g    snake    4.5        1        no
i    dog        7.0        2        no
# input : A callable function
def func(DataFrame):
    return 'a'
df.loc[func]
# ---OUTPUT---
animal      cat
age         2.5
visits        1
priority    yes
Name: a, dtype: object
# col : single col name
df.loc['a':'c', 'visits']
# ---OUTPUT---
a    1
b    3
c    2
Name: visits, dtype: int64
# col : a list of col names
df.loc['a':'c', ['visits', 'age']]
    visits    age
a        1    2.5
b        3    3.0
c        2    0.5

1.2.3.11 DataFrame 获取rows of columns — DataFrame.iloc[input, col]

Purely integer-location based indexing for selection by position from 0 to length-1 of the axis.

input :

An integer : 3
A list or array of integers : [2,4,7]
A slice object with ints : 2 : 5 (excluding)
a boolean array
A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)

col :

An integer
a list of array of integers
a slice object with ints

# 沿用上例的数据
# input : An integer
df.iloc[2]
# ---OUTPUT---
animal      snake
age           0.5
visits          2
priority       no
Name: c, dtype: object
# input : A list of array of integers
df.iloc[[2, 4]]
# ---OUTPUT---
    animal    age    visits    priority
c    snake    0.5        2        no
e    dog        5.0        2        no
# input : A slice object with ints
df.iloc[2:4]
# ---OUTPUT---
    animal    age    visits    priority
c    snake    0.5        2        no
d    dog        NaN        3        yes
# intput : a boolean array -- 不能使用 df['age']>6这种方式
df.iloc[[False, True]]
# ---OUTPUT---
    animal    age    visits    priority
b    cat        3.0        3        yes
# input : func
df.iloc[lambda df : [4, 5]]
# ---OUTPUT---
    animal    age    visits    priority
e    dog        5.0        2        no
f    cat        2.0        3        no
# loc : An integer
df.iloc[0:3, 1]
# ---OUTPUT---
a    2.5
b    3.0
c    0.5
Name: age, dtype: float64
        a    2.5
b    3.0
c    0.5
Name: age, dtype: float64
# loc : A list of integers
df.iloc[0:3, [1,2]]
# ---OUTPUT---
    age    visits
a    2.5        1
b    3.0        3
c    0.5        2
# loc : a slice object with ints
df.iloc[0:3, 1:3]
# ---OUTPUT---
    age    visits
a    2.5        1
b    3.0        3
c    0.5        2

1.2.3.12 生成DataFrame副本

df_bak = df.copy()

1.2.3.13 判别DataFrame元素为空

df.isnull()
返回一个与df相同size的DataFrame，若元素之为None或numpy.NaN，则返回True，否则返回Flase

1.2.3.14 添加列数据

col_add = pd.Series([0,1,2,3,4,5,6,7,8,9], index=df.index)
df['No.'] = col_add

1.2.3.15 修改value

修改DataFrame的value的方法：

通过绝对坐标进行修改 df.iat(x,y)
通过索引 index,col 进行修改 df.loc(index, col)

# a. 通过绝对坐标进行修改
df.iat[0, 1] = 3
# b. 通过索引 index,col 进行修改
df['f', 'age'] = 15

DataFrame.iat : Access a single value for a row/column pair by integer position.
DataFrame.at : Access a single value for a row/column label pair

DataFrame.iloc : Access a group of rows and columns by integer position(s)
DataFrame.loc : Access a group of rows and columns by label(s)

1.2.3.16 DataFrame合并

merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, 
      right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, 
      indicator=False, validate=None)
    Merge DataFrame or named Series objects with a database-style join.
    The join is done on columns or indexes. If joining columns on
    columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes
    on indexes or indexes on a column or columns, the index will be passed on.
    Parameters
    ----------
    left : DataFrame
    right : DataFrame or named Series
        Object to merge with.
    how : {'left', 'right', 'outer', 'inner'}, default 'inner'
        Type of merge to be performed.
        * left: use only keys from left frame, similar to a SQL left outer join;
          preserve key order.
        * right: use only keys from right frame, similar to a SQL right outer join;
          preserve key order.
        * outer: use union of keys from both frames, similar to a SQL full outer
          join; sort keys lexicographically.
        * inner: use intersection of keys from both frames, similar to a SQL inner
          join; preserve the order of the left keys.
    on : label or list
        Column or index level names to join on. These must be found in both
        DataFrames. If `on` is None and not merging on indexes then this defaults
        to the intersection of the columns in both DataFrames.
    left_on : label or list, or array-like
        Column or index level names to join on in the left DataFrame. Can also
        be an array or list of arrays of the length of the left DataFrame.
        These arrays are treated as if they are columns.
    right_on : label or list, or array-like
        Column or index level names to join on in the right DataFrame. Can also
        be an array or list of arrays of the length of the right DataFrame.
        These arrays are treated as if they are columns.
    left_index : bool, default False
        Use the index from the left DataFrame as the join key(s). If it is a
        MultiIndex, the number of keys in the other DataFrame (either the index
        or a number of columns) must match the number of levels.
    right_index : bool, default False
        Use the index from the right DataFrame as the join key. Same caveats as
        left_index.
    sort : bool, default False
        Sort the join keys lexicographically in the result DataFrame. If False,
        the order of the join keys depends on the join type (how keyword).
    suffixes : tuple of (str, str), default ('_x', '_y')
        Suffix to apply to overlapping column names in the left and right
        side, respectively. To raise an exception on overlapping columns use
        (False, False).
    copy : bool, default True
        If False, avoid copy if possible.
    indicator : bool or str, default False
        If True, adds a column to output DataFrame called "_merge" with
        information on the source of each row.
        If string, column with information on source of each row will be added to
        output DataFrame, and column will be named value of string.
        Information column is Categorical-type and takes on a value of "left_only"
        for observations whose merge key only appears in 'left' DataFrame,
        "right_only" for observations whose merge key only appears in 'right'
        DataFrame, and "both" if the observation's merge key is found in both.
    validate : str, optional
        If specified, checks if merge is of specified type.
        * "one_to_one" or "1:1": check if merge keys are unique in both
          left and right datasets.
        * "one_to_many" or "1:m": check if merge keys are unique in left
          dataset.
        * "many_to_one" or "m:1": check if merge keys are unique in right
          dataset.
        * "many_to_many" or "m:m": allowed, but does not result in checks.
    Returns
    -------
    DataFrame
        A DataFrame of the two merged objects.
    See Also
    --------
    merge_ordered : Merge with optional filling/interpolation.
    merge_asof : Merge on nearest keys.
    DataFrame.join : Similar method using indices.

left = pd.DataFrame({'key': ['foo1', 'foo2'], 'one': [1, 2]})
right = pd.DataFrame({'key': ['foo2', 'foo3'], 'two': [4, 5]})
print(left)
print(right)
# ---OUTPUT---
    key  one
0  foo1    1
1  foo2    2
    key  two
0  foo2    4
1  foo3    5
pd.merge(left, right, on='key')
    key        one        two
0    foo2    2        4
pd.merge(left, right, on='key', how='outer')
    key        one        two
0    foo1    1.0        NaN
1    foo2    2.0        4.0
2    foo3    NaN        5.0
pd.merge(left, right, on='key', how='left')
    key        one        two
0    foo1    1        NaN
1    foo2    2        4.0
pd.merge(left, right, on='key', how='right')
    key        one        two
0    foo2    2.0        4
1    foo3    NaN        5

1.2.3.17 分组groupby

单一维度，列有重复值，按列分类 ```python
数据
df = pd.DataFrame({‘Animal’ : [‘Falcon’, ‘Falcon’,
```
                          'Parrot', 'Parrot'],
             'Max Speed' : [380., 370., 24., 26.]})
```
—-OUTPUT—-
Animal Max Speed 0 Falcon 380.0 1 Falcon 370.0 2 Parrot 24.0 3 Parrot 26.0

df.groupby(‘Animal’).size()

—-OUTPUT—-

Animal Falcon 2 Parrot 2 dtype: int64


- 多维度Index的分组
```python
# 数据
arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
          ['Capitve', 'Wild', 'Capitve', 'Wild']]
index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
df = pd.DataFrame({'Max Speed' : [390., 350., 30., 20.]}, index=index)
# ---OUTPUT---
                Max Speed
Animal Type
Falcon Capitve      390.0
       Wild         350.0 
Parrot Capitve       30.0
       Wild          20.0
df.groupby(level=0).size()
# ---OUTPUT---
Animal
Falcon    2
Parrot    2
dtype: int64
df.groupby(level=1).size()
# ---OUTPUT---
Type
Capitve    2
Wild       2
dtype: int64

1.2.4 DataFrame对缺失值的操作

对缺失值填充

fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs) method of pandas.core.series.Series instance

Parameters:
——————
inplace : boolean, default False

# 将age列NaN填充为3
df['age'].fillna(3)
# 将整个DataFrame的NaN填充为0
df.fillna(0)
# inplace=True 执行原地替换
# 各列替换为不同的值
df.fillna({'A':0, 'B':1, 'C':2, 'D':3})

删除存在缺失的行

dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance

Remove missing values.<br />Parameters:<br />------------<br />**axis** : {0 or 'index', 1 or 'columns'}, default 0<br />        Determine if rows or columns which contain missing values are<br />        removed.<br />    <br />        * 0, or 'index' : Drop rows which contain missing values.<br />        * 1, or 'columns' : Drop columns which contain missing value.<br />.. deprecated:: 0.23.0: Pass tuple or list to drop on multiple<br />        axes.

how : {‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.

‘any’ : If any NA values are present, drop that row or column.
‘all’ : If all values are NA, drop that row or column.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.

inplace : bool, default False
If True, do operation inplace and return None.

# 删除任意 NaN 所在的行
df.dropna(how='any')
# 删除任意 NaN 所在的列
df.dropna(axis=1)
# 删除整行皆为 NaN 的行
df.dropna(how='all')
# 删除 NAN 所在行，范围锁定在 name 和 bore 列
df.dropna(axis=0, subset=['name', 'born'])
# 删除 NAN 所在列，范围锁定在 第0和第1行
df.dropna(axis=1, subset=[0, 1])
# 原地删除
df.dropna(inplace=True)

DataFrame.isna: Indicate missing values.<br />    DataFrame.notna : Indicate existing (non-missing) values.<br />    DataFrame.fillna : Replace missing values.<br />    Series.dropna : Drop missing values.<br />    Index.dropna : Drop missing indices.

1.2.5 DataFrame文件操作

Pandas可以将DataFrame写入Excel/CSV文件中，同样也可以从Excel/CSV文件中读取。

# 写入csv文件
df.to_csv('animal.csv')
# 读取csv文件
df.read_csv('animal.csv')
# 写入excel
writer = pd.ExcelWriter('animal.xlsx')
df3.to_excel(writer, sheet_name='Sheet1')
df3.to_excel(writer, sheet_name='Sheet2')
df3.to_excel(writer, sheet_name='Sheet3')
writer.save()
# 读取excel
pd.read_excel('animal.xlsx', 2, index_col=None, na_values=['NA'])

具体的参数值函数，请参考doc。

2. 实用进阶

2.1 pandas.core.indexes.datetimes.DatetimeIndex

建立一个以 2018 年每一天为索引，值为随机数的 Series

dti = pd.date_range(start = '2018-01-01', end='2018-12-31', freq='D')
s = pd.Series(np.random.rand(len(dti)), index=dti)

统计每个周三的值的和

s[s.index.weekday == 2].sum()

统计每个月的平均值

s.resample('M').mean()

统计每分钟的平均值

s = pd.date_range(start='2018/1/1', end='2018/1/2', closed='left', freq='S')
ts = pd.Series(np.random.randint(0, 500, len(s)), index=s)
ts.resample('Min').mean()

resample 适合小粒度向大粒度转换，否则大粒度向小粒度的转换会产生NaN

转换 tz

s = pd.date_range('today', periods=1, freq='D')  # 获取当前时间
ts = pd.Series(np.random.randn(len(s)), index=s)  # 随机数值
ts.tz_localize('UTC')  # 转换为 UTC 时间
2019-05-28 09:17:29.204571+00:00    1.583188
Freq: D, dtype: float64
ts.tz_localize('Asia/Shanghai')
# ---OUTPUT---
2019-05-28 17:08:55.604510+08:00    0.245186
Freq: D, dtype: float64

2.2 Series 多重索引

# 创建多重索引Series
letters = ['A', 'B', 'C']
numbers = list(range(3))
mi = pd.MultiIndex.from_product([letters, numbers])  # 设置多重索引
s = pd.Series(np.random.rand(len(letters) * len(numbers)), 
              index=mi,
              names=['ca', 'idx'])  # 随机数
ca  idx
A    0    0.430402
     1    0.990126
     2    0.084091
B    0    0.878942
     1    0.209060
     2    0.717289
C    0    0.143833
     1    0.780968
     2    0.297559
dtype: float64

索引查询

# 一级索引全部，二级索引的0和2
s.loc[:, [0,2]]
# ---OUTPUT---
A  0    0.470544
   2    0.907349
B  0    0.153506
   2    0.875122
C  0    0.680005
   2    0.169241
dtype: float64
# 一级索引的'A', 二级索引的0和2
s.loc['A', [0,2]]
# ---OUTPUT---
A  0    0.470544
   2    0.907349
dtype: float64
# 一级索引的'A'和'C', 二级索引的0和2
s.loc[['A', 'C'], [0, 2]]
# ---OUTPUT---
A  0    0.470544
   2    0.907349
C  0    0.680005
   2    0.169241
dtype: float64

多重索引的切片

s.loc[pd.IndexSlice[:'B', 1:]]
# ---OUTPUT---
ca  idx
A   1      0.945881
    2      0.119563
B   1      0.979727
    2      0.531048
dtype: float64

2.3 DataFrame多重索引

# 创建多重索引 DataFrame
frame = pd.DataFrame(np.arange(18).reshape(6, 3),
                     index=[['A','A', 'A', 'B', 'B', 'B'], list('123123')],
                     columns=['hello', 'Leo', 'Pansy'])
# ---OUTPUT---
        hello    Leo        Pansy
A    1    0        1        2
    2    3        4        5
    3    6        7        8
B    1    9        10        11
    2    12        13        14
    3    15        16        17

多重索引分组求和

frame.index.names = ['first', 'second']
# ---OUTPUT---
                hello    Leo        Pansy
first   second
A        1        0        1        2
        2        3        4        5
        3        6        7        8
B        1        9        10        11
        2        12        13        14
        3        15        16        17
frame.groupby('first').sum()
# ---OUTPUT---
        hello    Leo        Pansy
first            
A        9        12        15
B        36        39        42

2.4 DataFrame行列互换

frame.stack()           # 一行三列变三行一列
# ---OUTPUT---
first  second       
A      1       hello     0
               Leo       1
               Pansy     2
       2       hello     3
               Leo       4
               Pansy     5
       3       hello     6
               Leo       7
               Pansy     8
B      1       hello     9
               Leo      10
               Pansy    11
       2       hello    12
               Leo      13
               Pansy    14
       3       hello    15
               Leo      16
               Pansy    17
dtype: int64

2.5 DataFrame条件查找

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
#---OUTPUT---
    animal    age    visits    priority
a    cat        2.5        1        yes
b    cat        3.0        3        yes
c    snake    0.5        2        no
d    dog        NaN        3        yes
e    dog        5.0        2        no
f    cat        2.0        3        no
g    snake    4.5        1        no
h    cat        NaN        1        yes
i    dog        7.0        2        no
j    dog        3.0        1        no
df['age']
# ---OUTPUT
a    2.5
b    3.0
c    0.5
d    NaN
e    5.0
f    2.0
g    4.5
h    NaN
i    7.0
j    3.0
Name: age, dtype: float64
df['age'] > 3
# ---OUTPUT---
a    False
b    False
c    False
d    False
e     True
f    False
g     True
h    False
i     True
j    False
Name: age, dtype: bool
df[df['age'] > 3]
# ---OUTPUT---
    animal    age    visits    priority
e    dog        5.0        2        no
g    snake    4.5        1        no
i    dog        7.0        2        no

python数据分析 -- pandas 库

1. 基本数据结构

1.1 Series

1.1.1 初始化的3种方式

1.1.2 修改Index

1.1.3 纵向拼接

1.1.4 按指定索引删除元素

1.1.5 按索引值索引元素

1.1.6 Series的切片操作

1.1.7 Series的运算符

1.1.8 ufunc

1.2 DataFrame — 二维表结构

1.2.1 初始化的2种方式

1.2.2 DataFrame的类型

1.2.3 DataFrame的基本操作

1.2.3.1 预览头N（默认5）行数据

1.2.3.2 预览尾N（默认5）行数据

1.2.3.3 查看DataFrame的索引

1.2.3.4 查看DataFrame的列名

1.2.3.5 查看DataFrame的数值

1.2.3.6 查看DataFrame的统计数据

1.2.3.7 DataFrame的转置操作（横列互换）

1.2.3.8 对DataFrame进行排序

1.2.3.9 DataFrame的切片 — N排

1.2.3.10 DataFrame 获取rows of columns — DataFrame.loc[input, col]

1.2.3.11 DataFrame 获取rows of columns — DataFrame.iloc[input, col]

1.2.3.12 生成DataFrame副本

1.2.3.13 判别DataFrame元素为空

1.2.3.14 添加列数据

1.2.3.15 修改value

1.2.3.16 DataFrame合并

1.2.3.17 分组groupby

数据

—-OUTPUT—-

—-OUTPUT—-

1.2.4 DataFrame对缺失值的操作

1.2.5 DataFrame文件操作

2. 实用进阶

2.1 pandas.core.indexes.datetimes.DatetimeIndex

2.2 Series 多重索引

2.3 DataFrame多重索引

2.4 DataFrame行列互换

2.5 DataFrame条件查找