- 1. 基本数据结构
- 1.1 Series
- 1.2 DataFrame — 二维表结构
- 1.2.1 初始化的2种方式
- 1.2.2 DataFrame的类型
- 1.2.3 DataFrame的基本操作
- 1.2.3.1 预览头N(默认5)行数据
- 1.2.3.2 预览尾N(默认5)行数据
- 1.2.3.3 查看DataFrame的索引
- 1.2.3.4 查看DataFrame的列名
- 1.2.3.5 查看DataFrame的数值
- 1.2.3.6 查看DataFrame的统计数据
- 1.2.3.7 DataFrame的转置操作(横列互换)
- 1.2.3.8 对DataFrame进行排序
- 1.2.3.9 DataFrame的切片 — N排
- 1.2.3.10 DataFrame 获取rows of columns — DataFrame.loc[input, col]
- 1.2.3.11 DataFrame 获取rows of columns — DataFrame.iloc[input, col]
- 1.2.3.12 生成DataFrame副本
- 1.2.3.13 判别DataFrame元素为空
- 1.2.3.14 添加列数据
- 1.2.3.15 修改value
- 1.2.3.16 DataFrame合并
- 1.2.3.17 分组groupby
- 数据
- —-OUTPUT—-
- —-OUTPUT—-
- 2. 实用进阶
1. 基本数据结构
1.1 Series
Series是个定长的字典序列。
存储时相当于两个ndarray。
1.1.1 初始化的3种方式
Parameters
----------
data : array-like, Iterable, dict, or scalar value
Contains data stored in Series.
.. versionchanged :: 0.23.0
If data is a dict, argument order is maintained for Python 3.6
and later.
index : array-like or Index (1d)
Values must be hashable and have the same length as `data`.
Non-unique index values are allowed. Will default to
RangeIndex (0, 1, 2, ..., n) if not provided. If both a dict and index
sequence are used, the index will override the keys found in the
dict.
dtype : str, numpy.dtype, or ExtensionDtype, optional
dtype for the output Series. If not specified, this will be
inferred from `data`.
See the :ref:`user guide <basics.dtypes>` for more usages.
copy : bool, default False
Copy input data.
from pandas import Series
import numpy as np
# 方式1:以列表进行初始化,列表值为列值
li = [1,2,3,4,5]
x1 = Series(li) # index 默认从0起始
# ---OUTPUT---
0 1
1 2
2 3
3 4
4 5
dtype: int64
# 方式2:以 numpy.ndarray 进行初始化
n = np.random.random(5)
x1 = Series(n, index=['a', 'b', 'c', 'd'], dtype=str)
# ---OUTPUT---
a 0.377831
b 0.629109
c 0.841532
d -0.429905
dtype: float64
# 方式3:以字典方式创建Series
d = {'a':5, 'b':6, 'c':7, 'd':8}
x2 = Series(d)
# ---OUTPUT---
a 5
b 6
c 7
d 8
dtype: int64
1.1.2 修改Index
# 修改index
s1 = pd.Series([1,2,3])
# ---OUTPUT---
0 1
1 2
2 3
dtype: int64
s1.index = ['a', 'b', 'c']
# ---OUTPUT---
a 1
b 2
c 3
dtype: int64
1.1.3 纵向拼接
s1.append(s2)
1.1.4 按指定索引删除元素
si.drop('e') # 删除index为'e'的元素信息
1.1.5 按索引值索引元素
# 修改索引值为'A'的值为6
si['A'] = 6
1.1.6 Series的切片操作
si[:3] # si的前三个索引元素的切片,类型仍然为Series
1.1.7 Series的运算符
# S3
a 1
b 2
c 3
d 4
e 5
dtype: int64
# S4
a 1
b 2
c 3
d 4
A 6
B 1
C 2
D 3
E 4
dtype: int64
- 加法:加法/减法/乘法/除法运算是按照索引计算,如果索引不同则填充为
NaN
(空值).
S4.add(S3)
# --OUTPUT---
A NaN
B NaN
C NaN
D NaN
E NaN
a 2.0
b 4.0
c 6.0
d 8.0
e NaN
dtype: float64
S4.sub(S3)
# --OUTPUT---
A NaN
B NaN
C NaN
D NaN
E NaN
a 0.0
b 0.0
c 0.0
d 0.0
e NaN
dtype: float64
1.1.8 ufunc
# 中位数
si.median()
# 求和
si.sum()
# 最大值
si.max()
# 最小值
si.min()
1.2 DataFrame — 二维表结构
DataFrame数据结构类似于数据库表,包含了行索引和列索引,可以将其视为由相同索引的Series组成的字典类型。
1.2.1 初始化的2种方式
Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
Dict can contain Series, arrays, constants, or list-like objects
.. versionchanged :: 0.23.0
If data is a dict, argument order is maintained for Python 3.6
and later.
index : Index or array-like
Index to use for resulting frame. Will default to RangeIndex if
no indexing information part of input data and no index provided
columns : Index or array-like
Column labels to use for resulting frame. Will default to
RangeIndex (0, 1, 2, ..., n) if no column labels are provided
dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer
copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
# 方式1:由字典创造DataFrame, 字典的每个key-value对,为一列数据, 列头值已定
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2
0 1 3
1 2 4
# 由numpy.ndarray创造DataFrame, ndarray仅为值, columns实为列头值, index为索引值
df2 = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'], index=['1st', '2nd', '3rd'])
# ---OUTPUT---
a b c
1st 1 2 3
2nd 4 5 6
3rd 7 8 9
1.2.2 DataFrame的类型
df.dtypes
# ---OUTPUT---
animal object
age float64
visits int64
priority object
dtype: object
1.2.3 DataFrame的基本操作
df
# ---OUTPUT---
animal age visits priority
a cat 2.5 1 yes
b cat 3.0 3 yes
c snake 0.5 2 no
d dog NaN 3 yes
e dog 5.0 2 no
f cat 2.0 3 no
g snake 4.5 1 no
h cat NaN 1 yes
i dog 7.0 2 no
j dog 3.0 1 no
1.2.3.1 预览头N(默认5)行数据
df.head(N)
1.2.3.2 预览尾N(默认5)行数据
df.tail(N)
1.2.3.3 查看DataFrame的索引
df.index
# ---OUTPUT---
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')
1.2.3.4 查看DataFrame的列名
df.columns
# ---OUTPUT---
Index(['animal', 'age', 'visits', 'priority'], dtype='object')
1.2.3.5 查看DataFrame的数值
df.values
# ---OUTPUT---
array([['cat', 2.5, 1, 'yes'],
['cat', 3.0, 3, 'yes'],
['snake', 0.5, 2, 'no'],
['dog', nan, 3, 'yes'],
['dog', 5.0, 2, 'no'],
['cat', 2.0, 3, 'no'],
['snake', 4.5, 1, 'no'],
['cat', nan, 1, 'yes'],
['dog', 7.0, 2, 'no'],
['dog', 3.0, 1, 'no']], dtype=object)
1.2.3.6 查看DataFrame的统计数据
df.describe()
# ---OUTPUT---
age visits
count 8.000000 10.000000
mean 3.437500 1.900000
std 2.007797 0.875595
min 0.500000 1.000000
25% 2.375000 1.000000
50% 3.000000 2.000000
75% 4.625000 2.750000
max 7.000000 3.000000
1.2.3.7 DataFrame的转置操作(横列互换)
df.T # 类型仍为 pandas.core.frame.DataFrame
1.2.3.8 对DataFrame进行排序
df.sort_values(by='age')
sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’) method of pandas.core.frame.DataFrame instance.
Parameters:
———————
by : str or list of str
Name or list of names to sort by.
- if axis
is 0 or 'index'
then by
may contain index
levels and/or column labels
- if axis
is 1 or 'columns'
then by
may contain column
levels and/or index labels
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Axis to be sorted
ascending : bool or list of bool, default True
Sort ascending vs. descending. Specify list for multiple sort
orders. If this is a list of bools, must match the length of
the by.
inplace : bool, default False
if True, perform operation in-place
kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’
Choice of sorting algorithm. See also ndarray.np.sort for more
information. mergesort
is the only stable algorithm. For
DataFrames, this option is only applied when sorting on a single
column or label.
na_position : {‘first’, ‘last’}, default ‘last’
first
puts NaNs at the beginning, last
puts NaNs at the end
TIPS:
若参数axis为0,则是依据纵列值进行排序,by取值应为column的label;
若参数axis为1,则是依据横排值进行排序,by取值应为index的label。
1.2.3.9 DataFrame的切片 — N排
df[1:3]
1.2.3.10 DataFrame 获取rows of columns — DataFrame.loc[input, col]
Access a group of rows and columns by label(s) or a boolean array.
input:
- single label : ‘a’ …
- A list or array of labels : [‘a’, ‘b’, ‘c’]…
- A slice object with labels : ‘a’ : ‘f’ (including)
- A boolean array of the same length as the axis being sliced : [True, True, False]
- A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)
col : 可选
- single col name — Series type
- a list of col names — DataFrame
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
# ---OUTPUT---
animal age visits priority
a cat 2.5 1 yes
b cat 3.0 3 yes
c snake 0.5 2 no
d dog NaN 3 yes
e dog 5.0 2 no
f cat 2.0 3 no
g snake 4.5 1 no
h cat NaN 1 yes
i dog 7.0 2 no
j dog 3.0 1 no
# input : single label
df.loc['a']
# ---OUTPUT---
animal cat
age 2.5
visits 1
priority yes
Name: a, dtype: object
# input : a list of labels
df.loc[['a', 'b']]
# ---OUTPUT---
animal age visits priority
a cat 2.5 1 yes
b cat 3.0 3 yes
# input : a slice object with labels
df.loc['b' : 'c']
# ---OUTPUT---
animal age visits priority
b cat 3.0 3 yes
c snake 0.5 2 no
# A boolean array
df.loc[df['age'] > 4]
# ---OUTPUT---
animal age visits priority
e dog 5.0 2 no
g snake 4.5 1 no
i dog 7.0 2 no
# input : A callable function
def func(DataFrame):
return 'a'
df.loc[func]
# ---OUTPUT---
animal cat
age 2.5
visits 1
priority yes
Name: a, dtype: object
# col : single col name
df.loc['a':'c', 'visits']
# ---OUTPUT---
a 1
b 3
c 2
Name: visits, dtype: int64
# col : a list of col names
df.loc['a':'c', ['visits', 'age']]
visits age
a 1 2.5
b 3 3.0
c 2 0.5
1.2.3.11 DataFrame 获取rows of columns — DataFrame.iloc[input, col]
Purely integer-location based indexing for selection by position from 0 to length-1 of the axis.
input :
- An integer : 3
- A list or array of integers : [2,4,7]
- A slice object with ints : 2 : 5 (excluding)
- a boolean array
- A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)
col :
- An integer
- a list of array of integers
- a slice object with ints
# 沿用上例的数据
# input : An integer
df.iloc[2]
# ---OUTPUT---
animal snake
age 0.5
visits 2
priority no
Name: c, dtype: object
# input : A list of array of integers
df.iloc[[2, 4]]
# ---OUTPUT---
animal age visits priority
c snake 0.5 2 no
e dog 5.0 2 no
# input : A slice object with ints
df.iloc[2:4]
# ---OUTPUT---
animal age visits priority
c snake 0.5 2 no
d dog NaN 3 yes
# intput : a boolean array -- 不能使用 df['age']>6这种方式
df.iloc[[False, True]]
# ---OUTPUT---
animal age visits priority
b cat 3.0 3 yes
# input : func
df.iloc[lambda df : [4, 5]]
# ---OUTPUT---
animal age visits priority
e dog 5.0 2 no
f cat 2.0 3 no
# loc : An integer
df.iloc[0:3, 1]
# ---OUTPUT---
a 2.5
b 3.0
c 0.5
Name: age, dtype: float64
a 2.5
b 3.0
c 0.5
Name: age, dtype: float64
# loc : A list of integers
df.iloc[0:3, [1,2]]
# ---OUTPUT---
age visits
a 2.5 1
b 3.0 3
c 0.5 2
# loc : a slice object with ints
df.iloc[0:3, 1:3]
# ---OUTPUT---
age visits
a 2.5 1
b 3.0 3
c 0.5 2
1.2.3.12 生成DataFrame副本
df_bak = df.copy()
1.2.3.13 判别DataFrame元素为空
df.isnull()
返回一个与df相同size的DataFrame,若元素之为None或numpy.NaN,则返回True,否则返回Flase
1.2.3.14 添加列数据
col_add = pd.Series([0,1,2,3,4,5,6,7,8,9], index=df.index)
df['No.'] = col_add
1.2.3.15 修改value
修改DataFrame的value的方法:
- 通过绝对坐标进行修改
df.iat(x,y)
- 通过索引 index,col 进行修改
df.loc(index, col)
# a. 通过绝对坐标进行修改
df.iat[0, 1] = 3
# b. 通过索引 index,col 进行修改
df['f', 'age'] = 15
DataFrame.iat : Access a single value for a row/column pair by integer position.
DataFrame.at : Access a single value for a row/column label pair
DataFrame.iloc : Access a group of rows and columns by integer position(s)
DataFrame.loc : Access a group of rows and columns by label(s)
1.2.3.16 DataFrame合并
merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False,
right_index=False, sort=False, suffixes=('_x', '_y'), copy=True,
indicator=False, validate=None)
Merge DataFrame or named Series objects with a database-style join.
The join is done on columns or indexes. If joining columns on
columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes
on indexes or indexes on a column or columns, the index will be passed on.
Parameters
----------
left : DataFrame
right : DataFrame or named Series
Object to merge with.
how : {'left', 'right', 'outer', 'inner'}, default 'inner'
Type of merge to be performed.
* left: use only keys from left frame, similar to a SQL left outer join;
preserve key order.
* right: use only keys from right frame, similar to a SQL right outer join;
preserve key order.
* outer: use union of keys from both frames, similar to a SQL full outer
join; sort keys lexicographically.
* inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys.
on : label or list
Column or index level names to join on. These must be found in both
DataFrames. If `on` is None and not merging on indexes then this defaults
to the intersection of the columns in both DataFrames.
left_on : label or list, or array-like
Column or index level names to join on in the left DataFrame. Can also
be an array or list of arrays of the length of the left DataFrame.
These arrays are treated as if they are columns.
right_on : label or list, or array-like
Column or index level names to join on in the right DataFrame. Can also
be an array or list of arrays of the length of the right DataFrame.
These arrays are treated as if they are columns.
left_index : bool, default False
Use the index from the left DataFrame as the join key(s). If it is a
MultiIndex, the number of keys in the other DataFrame (either the index
or a number of columns) must match the number of levels.
right_index : bool, default False
Use the index from the right DataFrame as the join key. Same caveats as
left_index.
sort : bool, default False
Sort the join keys lexicographically in the result DataFrame. If False,
the order of the join keys depends on the join type (how keyword).
suffixes : tuple of (str, str), default ('_x', '_y')
Suffix to apply to overlapping column names in the left and right
side, respectively. To raise an exception on overlapping columns use
(False, False).
copy : bool, default True
If False, avoid copy if possible.
indicator : bool or str, default False
If True, adds a column to output DataFrame called "_merge" with
information on the source of each row.
If string, column with information on source of each row will be added to
output DataFrame, and column will be named value of string.
Information column is Categorical-type and takes on a value of "left_only"
for observations whose merge key only appears in 'left' DataFrame,
"right_only" for observations whose merge key only appears in 'right'
DataFrame, and "both" if the observation's merge key is found in both.
validate : str, optional
If specified, checks if merge is of specified type.
* "one_to_one" or "1:1": check if merge keys are unique in both
left and right datasets.
* "one_to_many" or "1:m": check if merge keys are unique in left
dataset.
* "many_to_one" or "m:1": check if merge keys are unique in right
dataset.
* "many_to_many" or "m:m": allowed, but does not result in checks.
Returns
-------
DataFrame
A DataFrame of the two merged objects.
See Also
--------
merge_ordered : Merge with optional filling/interpolation.
merge_asof : Merge on nearest keys.
DataFrame.join : Similar method using indices.
left = pd.DataFrame({'key': ['foo1', 'foo2'], 'one': [1, 2]})
right = pd.DataFrame({'key': ['foo2', 'foo3'], 'two': [4, 5]})
print(left)
print(right)
# ---OUTPUT---
key one
0 foo1 1
1 foo2 2
key two
0 foo2 4
1 foo3 5
pd.merge(left, right, on='key')
key one two
0 foo2 2 4
pd.merge(left, right, on='key', how='outer')
key one two
0 foo1 1.0 NaN
1 foo2 2.0 4.0
2 foo3 NaN 5.0
pd.merge(left, right, on='key', how='left')
key one two
0 foo1 1 NaN
1 foo2 2 4.0
pd.merge(left, right, on='key', how='right')
key one two
0 foo2 2.0 4
1 foo3 NaN 5
1.2.3.17 分组groupby
- 单一维度,列有重复值,按列分类
```python
数据
df = pd.DataFrame({‘Animal’ : [‘Falcon’, ‘Falcon’,'Parrot', 'Parrot'],
'Max Speed' : [380., 370., 24., 26.]})
—-OUTPUT—-
Animal Max Speed 0 Falcon 380.0 1 Falcon 370.0 2 Parrot 24.0 3 Parrot 26.0
df.groupby(‘Animal’).size()
—-OUTPUT—-
Animal Falcon 2 Parrot 2 dtype: int64
- 多维度Index的分组
```python
# 数据
arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
['Capitve', 'Wild', 'Capitve', 'Wild']]
index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
df = pd.DataFrame({'Max Speed' : [390., 350., 30., 20.]}, index=index)
# ---OUTPUT---
Max Speed
Animal Type
Falcon Capitve 390.0
Wild 350.0
Parrot Capitve 30.0
Wild 20.0
df.groupby(level=0).size()
# ---OUTPUT---
Animal
Falcon 2
Parrot 2
dtype: int64
df.groupby(level=1).size()
# ---OUTPUT---
Type
Capitve 2
Wild 2
dtype: int64
1.2.4 DataFrame对缺失值的操作
- 对缺失值填充
fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs) method of pandas.core.series.Series instance
Parameters:
——————
inplace : boolean, default False
# 将age列NaN填充为3
df['age'].fillna(3)
# 将整个DataFrame的NaN填充为0
df.fillna(0)
# inplace=True 执行原地替换
# 各列替换为不同的值
df.fillna({'A':0, 'B':1, 'C':2, 'D':3})
- 删除存在缺失的行
dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance
Remove missing values.<br />Parameters:<br />------------<br />**axis** : {0 or 'index', 1 or 'columns'}, default 0<br /> Determine if rows or columns which contain missing values are<br /> removed.<br /> <br /> * 0, or 'index' : Drop rows which contain missing values.<br /> * 1, or 'columns' : Drop columns which contain missing value.<br />.. deprecated:: 0.23.0: Pass tuple or list to drop on multiple<br /> axes.
how : {‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.
‘any’ : If any NA values are present, drop that row or column.
‘all’ : If all values are NA, drop that row or column.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace : bool, default False
If True, do operation inplace and return None.
# 删除任意 NaN 所在的行
df.dropna(how='any')
# 删除任意 NaN 所在的列
df.dropna(axis=1)
# 删除整行皆为 NaN 的行
df.dropna(how='all')
# 删除 NAN 所在行,范围锁定在 name 和 bore 列
df.dropna(axis=0, subset=['name', 'born'])
# 删除 NAN 所在列,范围锁定在 第0和第1行
df.dropna(axis=1, subset=[0, 1])
# 原地删除
df.dropna(inplace=True)
DataFrame.isna: Indicate missing values.<br /> DataFrame.notna : Indicate existing (non-missing) values.<br /> DataFrame.fillna : Replace missing values.<br /> Series.dropna : Drop missing values.<br /> Index.dropna : Drop missing indices.
1.2.5 DataFrame文件操作
Pandas可以将DataFrame写入Excel/CSV文件中,同样也可以从Excel/CSV文件中读取。
# 写入csv文件
df.to_csv('animal.csv')
# 读取csv文件
df.read_csv('animal.csv')
# 写入excel
writer = pd.ExcelWriter('animal.xlsx')
df3.to_excel(writer, sheet_name='Sheet1')
df3.to_excel(writer, sheet_name='Sheet2')
df3.to_excel(writer, sheet_name='Sheet3')
writer.save()
# 读取excel
pd.read_excel('animal.xlsx', 2, index_col=None, na_values=['NA'])
具体的参数值函数,请参考doc。
2. 实用进阶
2.1 pandas.core.indexes.datetimes.DatetimeIndex
建立一个以 2018 年每一天为索引,值为随机数的 Series
dti = pd.date_range(start = '2018-01-01', end='2018-12-31', freq='D')
s = pd.Series(np.random.rand(len(dti)), index=dti)
统计每个周三的值的和
s[s.index.weekday == 2].sum()
统计每个月的平均值
s.resample('M').mean()
统计每分钟的平均值
s = pd.date_range(start='2018/1/1', end='2018/1/2', closed='left', freq='S')
ts = pd.Series(np.random.randint(0, 500, len(s)), index=s)
ts.resample('Min').mean()
resample 适合小粒度向大粒度转换, 否则大粒度向小粒度的转换会产生NaN
转换 tz
s = pd.date_range('today', periods=1, freq='D') # 获取当前时间
ts = pd.Series(np.random.randn(len(s)), index=s) # 随机数值
ts.tz_localize('UTC') # 转换为 UTC 时间
2019-05-28 09:17:29.204571+00:00 1.583188
Freq: D, dtype: float64
ts.tz_localize('Asia/Shanghai')
# ---OUTPUT---
2019-05-28 17:08:55.604510+08:00 0.245186
Freq: D, dtype: float64
2.2 Series 多重索引
# 创建多重索引Series
letters = ['A', 'B', 'C']
numbers = list(range(3))
mi = pd.MultiIndex.from_product([letters, numbers]) # 设置多重索引
s = pd.Series(np.random.rand(len(letters) * len(numbers)),
index=mi,
names=['ca', 'idx']) # 随机数
ca idx
A 0 0.430402
1 0.990126
2 0.084091
B 0 0.878942
1 0.209060
2 0.717289
C 0 0.143833
1 0.780968
2 0.297559
dtype: float64
索引查询
# 一级索引全部,二级索引的0和2
s.loc[:, [0,2]]
# ---OUTPUT---
A 0 0.470544
2 0.907349
B 0 0.153506
2 0.875122
C 0 0.680005
2 0.169241
dtype: float64
# 一级索引的'A', 二级索引的0和2
s.loc['A', [0,2]]
# ---OUTPUT---
A 0 0.470544
2 0.907349
dtype: float64
# 一级索引的'A'和'C', 二级索引的0和2
s.loc[['A', 'C'], [0, 2]]
# ---OUTPUT---
A 0 0.470544
2 0.907349
C 0 0.680005
2 0.169241
dtype: float64
多重索引的切片
s.loc[pd.IndexSlice[:'B', 1:]]
# ---OUTPUT---
ca idx
A 1 0.945881
2 0.119563
B 1 0.979727
2 0.531048
dtype: float64
2.3 DataFrame多重索引
# 创建多重索引 DataFrame
frame = pd.DataFrame(np.arange(18).reshape(6, 3),
index=[['A','A', 'A', 'B', 'B', 'B'], list('123123')],
columns=['hello', 'Leo', 'Pansy'])
# ---OUTPUT---
hello Leo Pansy
A 1 0 1 2
2 3 4 5
3 6 7 8
B 1 9 10 11
2 12 13 14
3 15 16 17
多重索引分组求和
frame.index.names = ['first', 'second']
# ---OUTPUT---
hello Leo Pansy
first second
A 1 0 1 2
2 3 4 5
3 6 7 8
B 1 9 10 11
2 12 13 14
3 15 16 17
frame.groupby('first').sum()
# ---OUTPUT---
hello Leo Pansy
first
A 9 12 15
B 36 39 42
2.4 DataFrame行列互换
frame.stack() # 一行三列变三行一列
# ---OUTPUT---
first second
A 1 hello 0
Leo 1
Pansy 2
2 hello 3
Leo 4
Pansy 5
3 hello 6
Leo 7
Pansy 8
B 1 hello 9
Leo 10
Pansy 11
2 hello 12
Leo 13
Pansy 14
3 hello 15
Leo 16
Pansy 17
dtype: int64
2.5 DataFrame条件查找
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
#---OUTPUT---
animal age visits priority
a cat 2.5 1 yes
b cat 3.0 3 yes
c snake 0.5 2 no
d dog NaN 3 yes
e dog 5.0 2 no
f cat 2.0 3 no
g snake 4.5 1 no
h cat NaN 1 yes
i dog 7.0 2 no
j dog 3.0 1 no
df['age']
# ---OUTPUT
a 2.5
b 3.0
c 0.5
d NaN
e 5.0
f 2.0
g 4.5
h NaN
i 7.0
j 3.0
Name: age, dtype: float64
df['age'] > 3
# ---OUTPUT---
a False
b False
c False
d False
e True
f False
g True
h False
i True
j False
Name: age, dtype: bool
df[df['age'] > 3]
# ---OUTPUT---
animal age visits priority
e dog 5.0 2 no
g snake 4.5 1 no
i dog 7.0 2 no