- 1. 基本数据结构
- 1.1 Series
- 1.2 DataFrame — 二维表结构
- 1.2.1 初始化的2种方式
- 1.2.2 DataFrame的类型
- 1.2.3 DataFrame的基本操作
- 1.2.3.1 预览头N(默认5)行数据
- 1.2.3.2 预览尾N(默认5)行数据
- 1.2.3.3 查看DataFrame的索引
- 1.2.3.4 查看DataFrame的列名
- 1.2.3.5 查看DataFrame的数值
- 1.2.3.6 查看DataFrame的统计数据
- 1.2.3.7 DataFrame的转置操作(横列互换)
- 1.2.3.8 对DataFrame进行排序
- 1.2.3.9 DataFrame的切片 — N排
- 1.2.3.10 DataFrame 获取rows of columns — DataFrame.loc[input, col]
- 1.2.3.11 DataFrame 获取rows of columns — DataFrame.iloc[input, col]
- 1.2.3.12 生成DataFrame副本
- 1.2.3.13 判别DataFrame元素为空
- 1.2.3.14 添加列数据
- 1.2.3.15 修改value
- 1.2.3.16 DataFrame合并
- 1.2.3.17 分组groupby
- 数据
- —-OUTPUT—-
- —-OUTPUT—-
- 2. 实用进阶
1. 基本数据结构
1.1 Series
Series是个定长的字典序列。
存储时相当于两个ndarray。
1.1.1 初始化的3种方式
Parameters----------data : array-like, Iterable, dict, or scalar valueContains data stored in Series... versionchanged :: 0.23.0If data is a dict, argument order is maintained for Python 3.6and later.index : array-like or Index (1d)Values must be hashable and have the same length as `data`.Non-unique index values are allowed. Will default toRangeIndex (0, 1, 2, ..., n) if not provided. If both a dict and indexsequence are used, the index will override the keys found in thedict.dtype : str, numpy.dtype, or ExtensionDtype, optionaldtype for the output Series. If not specified, this will beinferred from `data`.See the :ref:`user guide <basics.dtypes>` for more usages.copy : bool, default FalseCopy input data.
from pandas import Seriesimport numpy as np# 方式1:以列表进行初始化,列表值为列值li = [1,2,3,4,5]x1 = Series(li) # index 默认从0起始# ---OUTPUT---0 11 22 33 44 5dtype: int64# 方式2:以 numpy.ndarray 进行初始化n = np.random.random(5)x1 = Series(n, index=['a', 'b', 'c', 'd'], dtype=str)# ---OUTPUT---a 0.377831b 0.629109c 0.841532d -0.429905dtype: float64# 方式3:以字典方式创建Seriesd = {'a':5, 'b':6, 'c':7, 'd':8}x2 = Series(d)# ---OUTPUT---a 5b 6c 7d 8dtype: int64
1.1.2 修改Index
# 修改indexs1 = pd.Series([1,2,3])# ---OUTPUT---0 11 22 3dtype: int64s1.index = ['a', 'b', 'c']# ---OUTPUT---a 1b 2c 3dtype: int64
1.1.3 纵向拼接
s1.append(s2)
1.1.4 按指定索引删除元素
si.drop('e') # 删除index为'e'的元素信息
1.1.5 按索引值索引元素
# 修改索引值为'A'的值为6si['A'] = 6
1.1.6 Series的切片操作
si[:3] # si的前三个索引元素的切片,类型仍然为Series
1.1.7 Series的运算符
# S3a 1b 2c 3d 4e 5dtype: int64# S4a 1b 2c 3d 4A 6B 1C 2D 3E 4dtype: int64
- 加法:加法/减法/乘法/除法运算是按照索引计算,如果索引不同则填充为
NaN(空值).
S4.add(S3)# --OUTPUT---A NaNB NaNC NaND NaNE NaNa 2.0b 4.0c 6.0d 8.0e NaNdtype: float64
S4.sub(S3)# --OUTPUT---A NaNB NaNC NaND NaNE NaNa 0.0b 0.0c 0.0d 0.0e NaNdtype: float64
1.1.8 ufunc
# 中位数si.median()# 求和si.sum()# 最大值si.max()# 最小值si.min()
1.2 DataFrame — 二维表结构
DataFrame数据结构类似于数据库表,包含了行索引和列索引,可以将其视为由相同索引的Series组成的字典类型。
1.2.1 初始化的2种方式
Parameters----------data : ndarray (structured or homogeneous), Iterable, dict, or DataFrameDict can contain Series, arrays, constants, or list-like objects.. versionchanged :: 0.23.0If data is a dict, argument order is maintained for Python 3.6and later.index : Index or array-likeIndex to use for resulting frame. Will default to RangeIndex ifno indexing information part of input data and no index providedcolumns : Index or array-likeColumn labels to use for resulting frame. Will default toRangeIndex (0, 1, 2, ..., n) if no column labels are provideddtype : dtype, default NoneData type to force. Only a single dtype is allowed. If None, infercopy : boolean, default FalseCopy data from inputs. Only affects DataFrame / 2d ndarray input
>>> d = {'col1': [1, 2], 'col2': [3, 4]}# 方式1:由字典创造DataFrame, 字典的每个key-value对,为一列数据, 列头值已定>>> df = pd.DataFrame(data=d)>>> dfcol1 col20 1 31 2 4# 由numpy.ndarray创造DataFrame, ndarray仅为值, columns实为列头值, index为索引值df2 = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),columns=['a', 'b', 'c'], index=['1st', '2nd', '3rd'])# ---OUTPUT---a b c1st 1 2 32nd 4 5 63rd 7 8 9
1.2.2 DataFrame的类型
df.dtypes# ---OUTPUT---animal objectage float64visits int64priority objectdtype: object
1.2.3 DataFrame的基本操作
df# ---OUTPUT---animal age visits prioritya cat 2.5 1 yesb cat 3.0 3 yesc snake 0.5 2 nod dog NaN 3 yese dog 5.0 2 nof cat 2.0 3 nog snake 4.5 1 noh cat NaN 1 yesi dog 7.0 2 noj dog 3.0 1 no
1.2.3.1 预览头N(默认5)行数据
df.head(N)
1.2.3.2 预览尾N(默认5)行数据
df.tail(N)
1.2.3.3 查看DataFrame的索引
df.index# ---OUTPUT---Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')
1.2.3.4 查看DataFrame的列名
df.columns# ---OUTPUT---Index(['animal', 'age', 'visits', 'priority'], dtype='object')
1.2.3.5 查看DataFrame的数值
df.values# ---OUTPUT---array([['cat', 2.5, 1, 'yes'],['cat', 3.0, 3, 'yes'],['snake', 0.5, 2, 'no'],['dog', nan, 3, 'yes'],['dog', 5.0, 2, 'no'],['cat', 2.0, 3, 'no'],['snake', 4.5, 1, 'no'],['cat', nan, 1, 'yes'],['dog', 7.0, 2, 'no'],['dog', 3.0, 1, 'no']], dtype=object)
1.2.3.6 查看DataFrame的统计数据
df.describe()# ---OUTPUT---age visitscount 8.000000 10.000000mean 3.437500 1.900000std 2.007797 0.875595min 0.500000 1.00000025% 2.375000 1.00000050% 3.000000 2.00000075% 4.625000 2.750000max 7.000000 3.000000
1.2.3.7 DataFrame的转置操作(横列互换)
df.T # 类型仍为 pandas.core.frame.DataFrame
1.2.3.8 对DataFrame进行排序
df.sort_values(by='age')
sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’) method of pandas.core.frame.DataFrame instance.
Parameters:
———————
by : str or list of str
Name or list of names to sort by.
- if axis is 0 or 'index' then by may contain index
levels and/or column labels
- if axis is 1 or 'columns' then by may contain column
levels and/or index labels
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Axis to be sorted
ascending : bool or list of bool, default True
Sort ascending vs. descending. Specify list for multiple sort
orders. If this is a list of bools, must match the length of
the by.
inplace : bool, default False
if True, perform operation in-place
kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’
Choice of sorting algorithm. See also ndarray.np.sort for more
information. mergesort is the only stable algorithm. For
DataFrames, this option is only applied when sorting on a single
column or label.
na_position : {‘first’, ‘last’}, default ‘last’
first puts NaNs at the beginning, last puts NaNs at the end
TIPS:
若参数axis为0,则是依据纵列值进行排序,by取值应为column的label;
若参数axis为1,则是依据横排值进行排序,by取值应为index的label。
1.2.3.9 DataFrame的切片 — N排
df[1:3]
1.2.3.10 DataFrame 获取rows of columns — DataFrame.loc[input, col]
Access a group of rows and columns by label(s) or a boolean array.
input:
- single label : ‘a’ …
- A list or array of labels : [‘a’, ‘b’, ‘c’]…
- A slice object with labels : ‘a’ : ‘f’ (including)
- A boolean array of the same length as the axis being sliced : [True, True, False]
- A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)
col : 可选
- single col name — Series type
- a list of col names — DataFrame
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']df = pd.DataFrame(data, index=labels)# ---OUTPUT---animal age visits prioritya cat 2.5 1 yesb cat 3.0 3 yesc snake 0.5 2 nod dog NaN 3 yese dog 5.0 2 nof cat 2.0 3 nog snake 4.5 1 noh cat NaN 1 yesi dog 7.0 2 noj dog 3.0 1 no# input : single labeldf.loc['a']# ---OUTPUT---animal catage 2.5visits 1priority yesName: a, dtype: object# input : a list of labelsdf.loc[['a', 'b']]# ---OUTPUT---animal age visits prioritya cat 2.5 1 yesb cat 3.0 3 yes# input : a slice object with labelsdf.loc['b' : 'c']# ---OUTPUT---animal age visits priorityb cat 3.0 3 yesc snake 0.5 2 no# A boolean arraydf.loc[df['age'] > 4]# ---OUTPUT---animal age visits prioritye dog 5.0 2 nog snake 4.5 1 noi dog 7.0 2 no# input : A callable functiondef func(DataFrame):return 'a'df.loc[func]# ---OUTPUT---animal catage 2.5visits 1priority yesName: a, dtype: object# col : single col namedf.loc['a':'c', 'visits']# ---OUTPUT---a 1b 3c 2Name: visits, dtype: int64# col : a list of col namesdf.loc['a':'c', ['visits', 'age']]visits agea 1 2.5b 3 3.0c 2 0.5
1.2.3.11 DataFrame 获取rows of columns — DataFrame.iloc[input, col]
Purely integer-location based indexing for selection by position from 0 to length-1 of the axis.
input :
- An integer : 3
- A list or array of integers : [2,4,7]
- A slice object with ints : 2 : 5 (excluding)
- a boolean array
- A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)
col :
- An integer
- a list of array of integers
- a slice object with ints
# 沿用上例的数据# input : An integerdf.iloc[2]# ---OUTPUT---animal snakeage 0.5visits 2priority noName: c, dtype: object# input : A list of array of integersdf.iloc[[2, 4]]# ---OUTPUT---animal age visits priorityc snake 0.5 2 noe dog 5.0 2 no# input : A slice object with intsdf.iloc[2:4]# ---OUTPUT---animal age visits priorityc snake 0.5 2 nod dog NaN 3 yes# intput : a boolean array -- 不能使用 df['age']>6这种方式df.iloc[[False, True]]# ---OUTPUT---animal age visits priorityb cat 3.0 3 yes# input : funcdf.iloc[lambda df : [4, 5]]# ---OUTPUT---animal age visits prioritye dog 5.0 2 nof cat 2.0 3 no# loc : An integerdf.iloc[0:3, 1]# ---OUTPUT---a 2.5b 3.0c 0.5Name: age, dtype: float64a 2.5b 3.0c 0.5Name: age, dtype: float64# loc : A list of integersdf.iloc[0:3, [1,2]]# ---OUTPUT---age visitsa 2.5 1b 3.0 3c 0.5 2# loc : a slice object with intsdf.iloc[0:3, 1:3]# ---OUTPUT---age visitsa 2.5 1b 3.0 3c 0.5 2
1.2.3.12 生成DataFrame副本
df_bak = df.copy()
1.2.3.13 判别DataFrame元素为空
df.isnull()返回一个与df相同size的DataFrame,若元素之为None或numpy.NaN,则返回True,否则返回Flase
1.2.3.14 添加列数据
col_add = pd.Series([0,1,2,3,4,5,6,7,8,9], index=df.index)df['No.'] = col_add
1.2.3.15 修改value
修改DataFrame的value的方法:
- 通过绝对坐标进行修改
df.iat(x,y) - 通过索引 index,col 进行修改
df.loc(index, col)
# a. 通过绝对坐标进行修改df.iat[0, 1] = 3# b. 通过索引 index,col 进行修改df['f', 'age'] = 15
DataFrame.iat : Access a single value for a row/column pair by integer position.
DataFrame.at : Access a single value for a row/column label pair
DataFrame.iloc : Access a group of rows and columns by integer position(s)
DataFrame.loc : Access a group of rows and columns by label(s)
1.2.3.16 DataFrame合并
merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False,right_index=False, sort=False, suffixes=('_x', '_y'), copy=True,indicator=False, validate=None)Merge DataFrame or named Series objects with a database-style join.The join is done on columns or indexes. If joining columns oncolumns, the DataFrame indexes *will be ignored*. Otherwise if joining indexeson indexes or indexes on a column or columns, the index will be passed on.Parameters----------left : DataFrameright : DataFrame or named SeriesObject to merge with.how : {'left', 'right', 'outer', 'inner'}, default 'inner'Type of merge to be performed.* left: use only keys from left frame, similar to a SQL left outer join;preserve key order.* right: use only keys from right frame, similar to a SQL right outer join;preserve key order.* outer: use union of keys from both frames, similar to a SQL full outerjoin; sort keys lexicographically.* inner: use intersection of keys from both frames, similar to a SQL innerjoin; preserve the order of the left keys.on : label or listColumn or index level names to join on. These must be found in bothDataFrames. If `on` is None and not merging on indexes then this defaultsto the intersection of the columns in both DataFrames.left_on : label or list, or array-likeColumn or index level names to join on in the left DataFrame. Can alsobe an array or list of arrays of the length of the left DataFrame.These arrays are treated as if they are columns.right_on : label or list, or array-likeColumn or index level names to join on in the right DataFrame. Can alsobe an array or list of arrays of the length of the right DataFrame.These arrays are treated as if they are columns.left_index : bool, default FalseUse the index from the left DataFrame as the join key(s). If it is aMultiIndex, the number of keys in the other DataFrame (either the indexor a number of columns) must match the number of levels.right_index : bool, default FalseUse the index from the right DataFrame as the join key. Same caveats asleft_index.sort : bool, default FalseSort the join keys lexicographically in the result DataFrame. If False,the order of the join keys depends on the join type (how keyword).suffixes : tuple of (str, str), default ('_x', '_y')Suffix to apply to overlapping column names in the left and rightside, respectively. To raise an exception on overlapping columns use(False, False).copy : bool, default TrueIf False, avoid copy if possible.indicator : bool or str, default FalseIf True, adds a column to output DataFrame called "_merge" withinformation on the source of each row.If string, column with information on source of each row will be added tooutput DataFrame, and column will be named value of string.Information column is Categorical-type and takes on a value of "left_only"for observations whose merge key only appears in 'left' DataFrame,"right_only" for observations whose merge key only appears in 'right'DataFrame, and "both" if the observation's merge key is found in both.validate : str, optionalIf specified, checks if merge is of specified type.* "one_to_one" or "1:1": check if merge keys are unique in bothleft and right datasets.* "one_to_many" or "1:m": check if merge keys are unique in leftdataset.* "many_to_one" or "m:1": check if merge keys are unique in rightdataset.* "many_to_many" or "m:m": allowed, but does not result in checks.Returns-------DataFrameA DataFrame of the two merged objects.See Also--------merge_ordered : Merge with optional filling/interpolation.merge_asof : Merge on nearest keys.DataFrame.join : Similar method using indices.
left = pd.DataFrame({'key': ['foo1', 'foo2'], 'one': [1, 2]})right = pd.DataFrame({'key': ['foo2', 'foo3'], 'two': [4, 5]})print(left)print(right)# ---OUTPUT---key one0 foo1 11 foo2 2key two0 foo2 41 foo3 5pd.merge(left, right, on='key')key one two0 foo2 2 4pd.merge(left, right, on='key', how='outer')key one two0 foo1 1.0 NaN1 foo2 2.0 4.02 foo3 NaN 5.0pd.merge(left, right, on='key', how='left')key one two0 foo1 1 NaN1 foo2 2 4.0pd.merge(left, right, on='key', how='right')key one two0 foo2 2.0 41 foo3 NaN 5
1.2.3.17 分组groupby
- 单一维度,列有重复值,按列分类
```python
数据
df = pd.DataFrame({‘Animal’ : [‘Falcon’, ‘Falcon’,'Parrot', 'Parrot'],'Max Speed' : [380., 370., 24., 26.]})
—-OUTPUT—-
Animal Max Speed 0 Falcon 380.0 1 Falcon 370.0 2 Parrot 24.0 3 Parrot 26.0
df.groupby(‘Animal’).size()
—-OUTPUT—-
Animal Falcon 2 Parrot 2 dtype: int64
- 多维度Index的分组```python# 数据arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],['Capitve', 'Wild', 'Capitve', 'Wild']]index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))df = pd.DataFrame({'Max Speed' : [390., 350., 30., 20.]}, index=index)# ---OUTPUT---Max SpeedAnimal TypeFalcon Capitve 390.0Wild 350.0Parrot Capitve 30.0Wild 20.0df.groupby(level=0).size()# ---OUTPUT---AnimalFalcon 2Parrot 2dtype: int64df.groupby(level=1).size()# ---OUTPUT---TypeCapitve 2Wild 2dtype: int64
1.2.4 DataFrame对缺失值的操作
- 对缺失值填充
fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs) method of pandas.core.series.Series instance
Parameters:
——————
inplace : boolean, default False
# 将age列NaN填充为3df['age'].fillna(3)# 将整个DataFrame的NaN填充为0df.fillna(0)# inplace=True 执行原地替换# 各列替换为不同的值df.fillna({'A':0, 'B':1, 'C':2, 'D':3})
- 删除存在缺失的行
dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance
Remove missing values.<br />Parameters:<br />------------<br />**axis** : {0 or 'index', 1 or 'columns'}, default 0<br /> Determine if rows or columns which contain missing values are<br /> removed.<br /> <br /> * 0, or 'index' : Drop rows which contain missing values.<br /> * 1, or 'columns' : Drop columns which contain missing value.<br />.. deprecated:: 0.23.0: Pass tuple or list to drop on multiple<br /> axes.
how : {‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.
‘any’ : If any NA values are present, drop that row or column.
‘all’ : If all values are NA, drop that row or column.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace : bool, default False
If True, do operation inplace and return None.
# 删除任意 NaN 所在的行df.dropna(how='any')# 删除任意 NaN 所在的列df.dropna(axis=1)# 删除整行皆为 NaN 的行df.dropna(how='all')# 删除 NAN 所在行,范围锁定在 name 和 bore 列df.dropna(axis=0, subset=['name', 'born'])# 删除 NAN 所在列,范围锁定在 第0和第1行df.dropna(axis=1, subset=[0, 1])# 原地删除df.dropna(inplace=True)
DataFrame.isna: Indicate missing values.<br /> DataFrame.notna : Indicate existing (non-missing) values.<br /> DataFrame.fillna : Replace missing values.<br /> Series.dropna : Drop missing values.<br /> Index.dropna : Drop missing indices.
1.2.5 DataFrame文件操作
Pandas可以将DataFrame写入Excel/CSV文件中,同样也可以从Excel/CSV文件中读取。
# 写入csv文件df.to_csv('animal.csv')# 读取csv文件df.read_csv('animal.csv')# 写入excelwriter = pd.ExcelWriter('animal.xlsx')df3.to_excel(writer, sheet_name='Sheet1')df3.to_excel(writer, sheet_name='Sheet2')df3.to_excel(writer, sheet_name='Sheet3')writer.save()# 读取excelpd.read_excel('animal.xlsx', 2, index_col=None, na_values=['NA'])
具体的参数值函数,请参考doc。
2. 实用进阶
2.1 pandas.core.indexes.datetimes.DatetimeIndex
建立一个以 2018 年每一天为索引,值为随机数的 Series
dti = pd.date_range(start = '2018-01-01', end='2018-12-31', freq='D')s = pd.Series(np.random.rand(len(dti)), index=dti)
统计每个周三的值的和
s[s.index.weekday == 2].sum()
统计每个月的平均值
s.resample('M').mean()
统计每分钟的平均值
s = pd.date_range(start='2018/1/1', end='2018/1/2', closed='left', freq='S')ts = pd.Series(np.random.randint(0, 500, len(s)), index=s)ts.resample('Min').mean()
resample 适合小粒度向大粒度转换, 否则大粒度向小粒度的转换会产生NaN
转换 tz
s = pd.date_range('today', periods=1, freq='D') # 获取当前时间ts = pd.Series(np.random.randn(len(s)), index=s) # 随机数值ts.tz_localize('UTC') # 转换为 UTC 时间2019-05-28 09:17:29.204571+00:00 1.583188Freq: D, dtype: float64ts.tz_localize('Asia/Shanghai')# ---OUTPUT---2019-05-28 17:08:55.604510+08:00 0.245186Freq: D, dtype: float64
2.2 Series 多重索引
# 创建多重索引Seriesletters = ['A', 'B', 'C']numbers = list(range(3))mi = pd.MultiIndex.from_product([letters, numbers]) # 设置多重索引s = pd.Series(np.random.rand(len(letters) * len(numbers)),index=mi,names=['ca', 'idx']) # 随机数ca idxA 0 0.4304021 0.9901262 0.084091B 0 0.8789421 0.2090602 0.717289C 0 0.1438331 0.7809682 0.297559dtype: float64
索引查询
# 一级索引全部,二级索引的0和2s.loc[:, [0,2]]# ---OUTPUT---A 0 0.4705442 0.907349B 0 0.1535062 0.875122C 0 0.6800052 0.169241dtype: float64# 一级索引的'A', 二级索引的0和2s.loc['A', [0,2]]# ---OUTPUT---A 0 0.4705442 0.907349dtype: float64# 一级索引的'A'和'C', 二级索引的0和2s.loc[['A', 'C'], [0, 2]]# ---OUTPUT---A 0 0.4705442 0.907349C 0 0.6800052 0.169241dtype: float64
多重索引的切片
s.loc[pd.IndexSlice[:'B', 1:]]# ---OUTPUT---ca idxA 1 0.9458812 0.119563B 1 0.9797272 0.531048dtype: float64
2.3 DataFrame多重索引
# 创建多重索引 DataFrameframe = pd.DataFrame(np.arange(18).reshape(6, 3),index=[['A','A', 'A', 'B', 'B', 'B'], list('123123')],columns=['hello', 'Leo', 'Pansy'])# ---OUTPUT---hello Leo PansyA 1 0 1 22 3 4 53 6 7 8B 1 9 10 112 12 13 143 15 16 17
多重索引分组求和
frame.index.names = ['first', 'second']# ---OUTPUT---hello Leo Pansyfirst secondA 1 0 1 22 3 4 53 6 7 8B 1 9 10 112 12 13 143 15 16 17frame.groupby('first').sum()# ---OUTPUT---hello Leo PansyfirstA 9 12 15B 36 39 42
2.4 DataFrame行列互换
frame.stack() # 一行三列变三行一列# ---OUTPUT---first secondA 1 hello 0Leo 1Pansy 22 hello 3Leo 4Pansy 53 hello 6Leo 7Pansy 8B 1 hello 9Leo 10Pansy 112 hello 12Leo 13Pansy 143 hello 15Leo 16Pansy 17dtype: int64
2.5 DataFrame条件查找
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']df = pd.DataFrame(data, index=labels)#---OUTPUT---animal age visits prioritya cat 2.5 1 yesb cat 3.0 3 yesc snake 0.5 2 nod dog NaN 3 yese dog 5.0 2 nof cat 2.0 3 nog snake 4.5 1 noh cat NaN 1 yesi dog 7.0 2 noj dog 3.0 1 nodf['age']# ---OUTPUTa 2.5b 3.0c 0.5d NaNe 5.0f 2.0g 4.5h NaNi 7.0j 3.0Name: age, dtype: float64df['age'] > 3# ---OUTPUT---a Falseb Falsec Falsed Falsee Truef Falseg Trueh Falsei Truej FalseName: age, dtype: booldf[df['age'] > 3]# ---OUTPUT---animal age visits prioritye dog 5.0 2 nog snake 4.5 1 noi dog 7.0 2 no
