1. 基本数据结构

1.1 Series

Series是个定长的字典序列。
存储时相当于两个ndarray。

1.1.1 初始化的3种方式

  1. Parameters
  2. ----------
  3. data : array-like, Iterable, dict, or scalar value
  4. Contains data stored in Series.
  5. .. versionchanged :: 0.23.0
  6. If data is a dict, argument order is maintained for Python 3.6
  7. and later.
  8. index : array-like or Index (1d)
  9. Values must be hashable and have the same length as `data`.
  10. Non-unique index values are allowed. Will default to
  11. RangeIndex (0, 1, 2, ..., n) if not provided. If both a dict and index
  12. sequence are used, the index will override the keys found in the
  13. dict.
  14. dtype : str, numpy.dtype, or ExtensionDtype, optional
  15. dtype for the output Series. If not specified, this will be
  16. inferred from `data`.
  17. See the :ref:`user guide <basics.dtypes>` for more usages.
  18. copy : bool, default False
  19. Copy input data.
  1. from pandas import Series
  2. import numpy as np
  3. # 方式1:以列表进行初始化,列表值为列值
  4. li = [1,2,3,4,5]
  5. x1 = Series(li) # index 默认从0起始
  6. # ---OUTPUT---
  7. 0 1
  8. 1 2
  9. 2 3
  10. 3 4
  11. 4 5
  12. dtype: int64
  13. # 方式2:以 numpy.ndarray 进行初始化
  14. n = np.random.random(5)
  15. x1 = Series(n, index=['a', 'b', 'c', 'd'], dtype=str)
  16. # ---OUTPUT---
  17. a 0.377831
  18. b 0.629109
  19. c 0.841532
  20. d -0.429905
  21. dtype: float64
  22. # 方式3:以字典方式创建Series
  23. d = {'a':5, 'b':6, 'c':7, 'd':8}
  24. x2 = Series(d)
  25. # ---OUTPUT---
  26. a 5
  27. b 6
  28. c 7
  29. d 8
  30. dtype: int64

1.1.2 修改Index

  1. # 修改index
  2. s1 = pd.Series([1,2,3])
  3. # ---OUTPUT---
  4. 0 1
  5. 1 2
  6. 2 3
  7. dtype: int64
  8. s1.index = ['a', 'b', 'c']
  9. # ---OUTPUT---
  10. a 1
  11. b 2
  12. c 3
  13. dtype: int64

1.1.3 纵向拼接

  1. s1.append(s2)

1.1.4 按指定索引删除元素

  1. si.drop('e') # 删除index为'e'的元素信息

1.1.5 按索引值索引元素

  1. # 修改索引值为'A'的值为6
  2. si['A'] = 6

1.1.6 Series的切片操作

  1. si[:3] # si的前三个索引元素的切片,类型仍然为Series

1.1.7 Series的运算符

  1. # S3
  2. a 1
  3. b 2
  4. c 3
  5. d 4
  6. e 5
  7. dtype: int64
  8. # S4
  9. a 1
  10. b 2
  11. c 3
  12. d 4
  13. A 6
  14. B 1
  15. C 2
  16. D 3
  17. E 4
  18. dtype: int64
  • 加法:加法/减法/乘法/除法运算是按照索引计算,如果索引不同则填充为 NaN(空值).
  1. S4.add(S3)
  2. # --OUTPUT---
  3. A NaN
  4. B NaN
  5. C NaN
  6. D NaN
  7. E NaN
  8. a 2.0
  9. b 4.0
  10. c 6.0
  11. d 8.0
  12. e NaN
  13. dtype: float64
  1. S4.sub(S3)
  2. # --OUTPUT---
  3. A NaN
  4. B NaN
  5. C NaN
  6. D NaN
  7. E NaN
  8. a 0.0
  9. b 0.0
  10. c 0.0
  11. d 0.0
  12. e NaN
  13. dtype: float64

1.1.8 ufunc

  1. # 中位数
  2. si.median()
  3. # 求和
  4. si.sum()
  5. # 最大值
  6. si.max()
  7. # 最小值
  8. si.min()

1.2 DataFrame — 二维表结构

DataFrame数据结构类似于数据库表,包含了行索引和列索引,可以将其视为由相同索引的Series组成的字典类型。

1.2.1 初始化的2种方式

  1. Parameters
  2. ----------
  3. data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
  4. Dict can contain Series, arrays, constants, or list-like objects
  5. .. versionchanged :: 0.23.0
  6. If data is a dict, argument order is maintained for Python 3.6
  7. and later.
  8. index : Index or array-like
  9. Index to use for resulting frame. Will default to RangeIndex if
  10. no indexing information part of input data and no index provided
  11. columns : Index or array-like
  12. Column labels to use for resulting frame. Will default to
  13. RangeIndex (0, 1, 2, ..., n) if no column labels are provided
  14. dtype : dtype, default None
  15. Data type to force. Only a single dtype is allowed. If None, infer
  16. copy : boolean, default False
  17. Copy data from inputs. Only affects DataFrame / 2d ndarray input
  1. >>> d = {'col1': [1, 2], 'col2': [3, 4]}
  2. # 方式1:由字典创造DataFrame, 字典的每个key-value对,为一列数据, 列头值已定
  3. >>> df = pd.DataFrame(data=d)
  4. >>> df
  5. col1 col2
  6. 0 1 3
  7. 1 2 4
  8. # 由numpy.ndarray创造DataFrame, ndarray仅为值, columns实为列头值, index为索引值
  9. df2 = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
  10. columns=['a', 'b', 'c'], index=['1st', '2nd', '3rd'])
  11. # ---OUTPUT---
  12. a b c
  13. 1st 1 2 3
  14. 2nd 4 5 6
  15. 3rd 7 8 9

1.2.2 DataFrame的类型

  1. df.dtypes
  2. # ---OUTPUT---
  3. animal object
  4. age float64
  5. visits int64
  6. priority object
  7. dtype: object

1.2.3 DataFrame的基本操作

  1. df
  2. # ---OUTPUT---
  3. animal age visits priority
  4. a cat 2.5 1 yes
  5. b cat 3.0 3 yes
  6. c snake 0.5 2 no
  7. d dog NaN 3 yes
  8. e dog 5.0 2 no
  9. f cat 2.0 3 no
  10. g snake 4.5 1 no
  11. h cat NaN 1 yes
  12. i dog 7.0 2 no
  13. j dog 3.0 1 no

1.2.3.1 预览头N(默认5)行数据

  1. df.head(N)

1.2.3.2 预览尾N(默认5)行数据

  1. df.tail(N)

1.2.3.3 查看DataFrame的索引

  1. df.index
  2. # ---OUTPUT---
  3. Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

1.2.3.4 查看DataFrame的列名

  1. df.columns
  2. # ---OUTPUT---
  3. Index(['animal', 'age', 'visits', 'priority'], dtype='object')

1.2.3.5 查看DataFrame的数值

  1. df.values
  2. # ---OUTPUT---
  3. array([['cat', 2.5, 1, 'yes'],
  4. ['cat', 3.0, 3, 'yes'],
  5. ['snake', 0.5, 2, 'no'],
  6. ['dog', nan, 3, 'yes'],
  7. ['dog', 5.0, 2, 'no'],
  8. ['cat', 2.0, 3, 'no'],
  9. ['snake', 4.5, 1, 'no'],
  10. ['cat', nan, 1, 'yes'],
  11. ['dog', 7.0, 2, 'no'],
  12. ['dog', 3.0, 1, 'no']], dtype=object)

1.2.3.6 查看DataFrame的统计数据

  1. df.describe()
  2. # ---OUTPUT---
  3. age visits
  4. count 8.000000 10.000000
  5. mean 3.437500 1.900000
  6. std 2.007797 0.875595
  7. min 0.500000 1.000000
  8. 25% 2.375000 1.000000
  9. 50% 3.000000 2.000000
  10. 75% 4.625000 2.750000
  11. max 7.000000 3.000000

1.2.3.7 DataFrame的转置操作(横列互换)

  1. df.T # 类型仍为 pandas.core.frame.DataFrame

1.2.3.8 对DataFrame进行排序

  1. df.sort_values(by='age')

sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’) method of pandas.core.frame.DataFrame instance.
Parameters:
———————
by : str or list of str
Name or list of names to sort by.
- if axis is 0 or 'index' then by may contain index
levels and/or column labels
- if axis is 1 or 'columns' then by may contain column
levels and/or index labels
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Axis to be sorted
ascending : bool or list of bool, default True
Sort ascending vs. descending. Specify list for multiple sort
orders. If this is a list of bools, must match the length of
the by.
inplace : bool, default False
if True, perform operation in-place
kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’
Choice of sorting algorithm. See also ndarray.np.sort for more
information. mergesort is the only stable algorithm. For
DataFrames, this option is only applied when sorting on a single
column or label.
na_position : {‘first’, ‘last’}, default ‘last’
first puts NaNs at the beginning, last puts NaNs at the end


TIPS:
若参数axis为0,则是依据纵列值进行排序,by取值应为column的label;
若参数axis为1,则是依据横排值进行排序,by取值应为index的label。

1.2.3.9 DataFrame的切片 — N排

  1. df[1:3]

1.2.3.10 DataFrame 获取rows of columns — DataFrame.loc[input, col]

Access a group of rows and columns by label(s) or a boolean array.

input

  • single label : ‘a’ …
  • A list or array of labels : [‘a’, ‘b’, ‘c’]…
  • A slice object with labels : ‘a’ : ‘f’ (including)
  • A boolean array of the same length as the axis being sliced : [True, True, False]
  • A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)

col : 可选

  • single col name — Series type
  • a list of col names — DataFrame
  1. data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
  2. 'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
  3. 'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
  4. 'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
  5. labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
  6. df = pd.DataFrame(data, index=labels)
  7. # ---OUTPUT---
  8. animal age visits priority
  9. a cat 2.5 1 yes
  10. b cat 3.0 3 yes
  11. c snake 0.5 2 no
  12. d dog NaN 3 yes
  13. e dog 5.0 2 no
  14. f cat 2.0 3 no
  15. g snake 4.5 1 no
  16. h cat NaN 1 yes
  17. i dog 7.0 2 no
  18. j dog 3.0 1 no
  19. # input : single label
  20. df.loc['a']
  21. # ---OUTPUT---
  22. animal cat
  23. age 2.5
  24. visits 1
  25. priority yes
  26. Name: a, dtype: object
  27. # input : a list of labels
  28. df.loc[['a', 'b']]
  29. # ---OUTPUT---
  30. animal age visits priority
  31. a cat 2.5 1 yes
  32. b cat 3.0 3 yes
  33. # input : a slice object with labels
  34. df.loc['b' : 'c']
  35. # ---OUTPUT---
  36. animal age visits priority
  37. b cat 3.0 3 yes
  38. c snake 0.5 2 no
  39. # A boolean array
  40. df.loc[df['age'] > 4]
  41. # ---OUTPUT---
  42. animal age visits priority
  43. e dog 5.0 2 no
  44. g snake 4.5 1 no
  45. i dog 7.0 2 no
  46. # input : A callable function
  47. def func(DataFrame):
  48. return 'a'
  49. df.loc[func]
  50. # ---OUTPUT---
  51. animal cat
  52. age 2.5
  53. visits 1
  54. priority yes
  55. Name: a, dtype: object
  56. # col : single col name
  57. df.loc['a':'c', 'visits']
  58. # ---OUTPUT---
  59. a 1
  60. b 3
  61. c 2
  62. Name: visits, dtype: int64
  63. # col : a list of col names
  64. df.loc['a':'c', ['visits', 'age']]
  65. visits age
  66. a 1 2.5
  67. b 3 3.0
  68. c 2 0.5

1.2.3.11 DataFrame 获取rows of columns — DataFrame.iloc[input, col]

Purely integer-location based indexing for selection by position from 0 to length-1 of the axis.

input :

  • An integer : 3
  • A list or array of integers : [2,4,7]
  • A slice object with ints : 2 : 5 (excluding)
  • a boolean array
  • A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)

col :

  • An integer
  • a list of array of integers
  • a slice object with ints
  1. # 沿用上例的数据
  2. # input : An integer
  3. df.iloc[2]
  4. # ---OUTPUT---
  5. animal snake
  6. age 0.5
  7. visits 2
  8. priority no
  9. Name: c, dtype: object
  10. # input : A list of array of integers
  11. df.iloc[[2, 4]]
  12. # ---OUTPUT---
  13. animal age visits priority
  14. c snake 0.5 2 no
  15. e dog 5.0 2 no
  16. # input : A slice object with ints
  17. df.iloc[2:4]
  18. # ---OUTPUT---
  19. animal age visits priority
  20. c snake 0.5 2 no
  21. d dog NaN 3 yes
  22. # intput : a boolean array -- 不能使用 df['age']>6这种方式
  23. df.iloc[[False, True]]
  24. # ---OUTPUT---
  25. animal age visits priority
  26. b cat 3.0 3 yes
  27. # input : func
  28. df.iloc[lambda df : [4, 5]]
  29. # ---OUTPUT---
  30. animal age visits priority
  31. e dog 5.0 2 no
  32. f cat 2.0 3 no
  33. # loc : An integer
  34. df.iloc[0:3, 1]
  35. # ---OUTPUT---
  36. a 2.5
  37. b 3.0
  38. c 0.5
  39. Name: age, dtype: float64
  40. a 2.5
  41. b 3.0
  42. c 0.5
  43. Name: age, dtype: float64
  44. # loc : A list of integers
  45. df.iloc[0:3, [1,2]]
  46. # ---OUTPUT---
  47. age visits
  48. a 2.5 1
  49. b 3.0 3
  50. c 0.5 2
  51. # loc : a slice object with ints
  52. df.iloc[0:3, 1:3]
  53. # ---OUTPUT---
  54. age visits
  55. a 2.5 1
  56. b 3.0 3
  57. c 0.5 2

1.2.3.12 生成DataFrame副本

  1. df_bak = df.copy()

1.2.3.13 判别DataFrame元素为空

  1. df.isnull()
  2. 返回一个与df相同sizeDataFrame,若元素之为Nonenumpy.NaN,则返回True,否则返回Flase

1.2.3.14 添加列数据

  1. col_add = pd.Series([0,1,2,3,4,5,6,7,8,9], index=df.index)
  2. df['No.'] = col_add

1.2.3.15 修改value

修改DataFrame的value的方法:

  1. 通过绝对坐标进行修改 df.iat(x,y)
  2. 通过索引 index,col 进行修改 df.loc(index, col)
  1. # a. 通过绝对坐标进行修改
  2. df.iat[0, 1] = 3
  3. # b. 通过索引 index,col 进行修改
  4. df['f', 'age'] = 15

DataFrame.iat : Access a single value for a row/column pair by integer position.
DataFrame.at : Access a single value for a row/column label pair

DataFrame.iloc : Access a group of rows and columns by integer position(s)
DataFrame.loc : Access a group of rows and columns by label(s)

1.2.3.16 DataFrame合并

  1. merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False,
  2. right_index=False, sort=False, suffixes=('_x', '_y'), copy=True,
  3. indicator=False, validate=None)
  4. Merge DataFrame or named Series objects with a database-style join.
  5. The join is done on columns or indexes. If joining columns on
  6. columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes
  7. on indexes or indexes on a column or columns, the index will be passed on.
  8. Parameters
  9. ----------
  10. left : DataFrame
  11. right : DataFrame or named Series
  12. Object to merge with.
  13. how : {'left', 'right', 'outer', 'inner'}, default 'inner'
  14. Type of merge to be performed.
  15. * left: use only keys from left frame, similar to a SQL left outer join;
  16. preserve key order.
  17. * right: use only keys from right frame, similar to a SQL right outer join;
  18. preserve key order.
  19. * outer: use union of keys from both frames, similar to a SQL full outer
  20. join; sort keys lexicographically.
  21. * inner: use intersection of keys from both frames, similar to a SQL inner
  22. join; preserve the order of the left keys.
  23. on : label or list
  24. Column or index level names to join on. These must be found in both
  25. DataFrames. If `on` is None and not merging on indexes then this defaults
  26. to the intersection of the columns in both DataFrames.
  27. left_on : label or list, or array-like
  28. Column or index level names to join on in the left DataFrame. Can also
  29. be an array or list of arrays of the length of the left DataFrame.
  30. These arrays are treated as if they are columns.
  31. right_on : label or list, or array-like
  32. Column or index level names to join on in the right DataFrame. Can also
  33. be an array or list of arrays of the length of the right DataFrame.
  34. These arrays are treated as if they are columns.
  35. left_index : bool, default False
  36. Use the index from the left DataFrame as the join key(s). If it is a
  37. MultiIndex, the number of keys in the other DataFrame (either the index
  38. or a number of columns) must match the number of levels.
  39. right_index : bool, default False
  40. Use the index from the right DataFrame as the join key. Same caveats as
  41. left_index.
  42. sort : bool, default False
  43. Sort the join keys lexicographically in the result DataFrame. If False,
  44. the order of the join keys depends on the join type (how keyword).
  45. suffixes : tuple of (str, str), default ('_x', '_y')
  46. Suffix to apply to overlapping column names in the left and right
  47. side, respectively. To raise an exception on overlapping columns use
  48. (False, False).
  49. copy : bool, default True
  50. If False, avoid copy if possible.
  51. indicator : bool or str, default False
  52. If True, adds a column to output DataFrame called "_merge" with
  53. information on the source of each row.
  54. If string, column with information on source of each row will be added to
  55. output DataFrame, and column will be named value of string.
  56. Information column is Categorical-type and takes on a value of "left_only"
  57. for observations whose merge key only appears in 'left' DataFrame,
  58. "right_only" for observations whose merge key only appears in 'right'
  59. DataFrame, and "both" if the observation's merge key is found in both.
  60. validate : str, optional
  61. If specified, checks if merge is of specified type.
  62. * "one_to_one" or "1:1": check if merge keys are unique in both
  63. left and right datasets.
  64. * "one_to_many" or "1:m": check if merge keys are unique in left
  65. dataset.
  66. * "many_to_one" or "m:1": check if merge keys are unique in right
  67. dataset.
  68. * "many_to_many" or "m:m": allowed, but does not result in checks.
  69. Returns
  70. -------
  71. DataFrame
  72. A DataFrame of the two merged objects.
  73. See Also
  74. --------
  75. merge_ordered : Merge with optional filling/interpolation.
  76. merge_asof : Merge on nearest keys.
  77. DataFrame.join : Similar method using indices.
  1. left = pd.DataFrame({'key': ['foo1', 'foo2'], 'one': [1, 2]})
  2. right = pd.DataFrame({'key': ['foo2', 'foo3'], 'two': [4, 5]})
  3. print(left)
  4. print(right)
  5. # ---OUTPUT---
  6. key one
  7. 0 foo1 1
  8. 1 foo2 2
  9. key two
  10. 0 foo2 4
  11. 1 foo3 5
  12. pd.merge(left, right, on='key')
  13. key one two
  14. 0 foo2 2 4
  15. pd.merge(left, right, on='key', how='outer')
  16. key one two
  17. 0 foo1 1.0 NaN
  18. 1 foo2 2.0 4.0
  19. 2 foo3 NaN 5.0
  20. pd.merge(left, right, on='key', how='left')
  21. key one two
  22. 0 foo1 1 NaN
  23. 1 foo2 2 4.0
  24. pd.merge(left, right, on='key', how='right')
  25. key one two
  26. 0 foo2 2.0 4
  27. 1 foo3 NaN 5

1.2.3.17 分组groupby

  • 单一维度,列有重复值,按列分类 ```python

    数据

    df = pd.DataFrame({‘Animal’ : [‘Falcon’, ‘Falcon’,
    1. 'Parrot', 'Parrot'],
    2. 'Max Speed' : [380., 370., 24., 26.]})

    —-OUTPUT—-

    Animal Max Speed 0 Falcon 380.0 1 Falcon 370.0 2 Parrot 24.0 3 Parrot 26.0

df.groupby(‘Animal’).size()

—-OUTPUT—-

Animal Falcon 2 Parrot 2 dtype: int64

  1. - 多维度Index的分组
  2. ```python
  3. # 数据
  4. arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
  5. ['Capitve', 'Wild', 'Capitve', 'Wild']]
  6. index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
  7. df = pd.DataFrame({'Max Speed' : [390., 350., 30., 20.]}, index=index)
  8. # ---OUTPUT---
  9. Max Speed
  10. Animal Type
  11. Falcon Capitve 390.0
  12. Wild 350.0
  13. Parrot Capitve 30.0
  14. Wild 20.0
  15. df.groupby(level=0).size()
  16. # ---OUTPUT---
  17. Animal
  18. Falcon 2
  19. Parrot 2
  20. dtype: int64
  21. df.groupby(level=1).size()
  22. # ---OUTPUT---
  23. Type
  24. Capitve 2
  25. Wild 2
  26. dtype: int64

1.2.4 DataFrame对缺失值的操作

  • 对缺失值填充

fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs) method of pandas.core.series.Series instance

Parameters:
——————
inplace : boolean, default False

  1. # 将age列NaN填充为3
  2. df['age'].fillna(3)
  3. # 将整个DataFrame的NaN填充为0
  4. df.fillna(0)
  5. # inplace=True 执行原地替换
  6. # 各列替换为不同的值
  7. df.fillna({'A':0, 'B':1, 'C':2, 'D':3})
  • 删除存在缺失的行

dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance

  1. Remove missing values.<br />Parameters:<br />------------<br />**axis** : {0 or 'index', 1 or 'columns'}, default 0<br /> Determine if rows or columns which contain missing values are<br /> removed.<br /> <br /> * 0, or 'index' : Drop rows which contain missing values.<br /> * 1, or 'columns' : Drop columns which contain missing value.<br />.. deprecated:: 0.23.0: Pass tuple or list to drop on multiple<br /> axes.

how : {‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.

‘any’ : If any NA values are present, drop that row or column.
‘all’ : If all values are NA, drop that row or column.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.

inplace : bool, default False
If True, do operation inplace and return None.

  1. # 删除任意 NaN 所在的行
  2. df.dropna(how='any')
  3. # 删除任意 NaN 所在的列
  4. df.dropna(axis=1)
  5. # 删除整行皆为 NaN 的行
  6. df.dropna(how='all')
  7. # 删除 NAN 所在行,范围锁定在 name 和 bore 列
  8. df.dropna(axis=0, subset=['name', 'born'])
  9. # 删除 NAN 所在列,范围锁定在 第0和第1行
  10. df.dropna(axis=1, subset=[0, 1])
  11. # 原地删除
  12. df.dropna(inplace=True)
  1. DataFrame.isna: Indicate missing values.<br /> DataFrame.notna : Indicate existing (non-missing) values.<br /> DataFrame.fillna : Replace missing values.<br /> Series.dropna : Drop missing values.<br /> Index.dropna : Drop missing indices.

1.2.5 DataFrame文件操作

Pandas可以将DataFrame写入Excel/CSV文件中,同样也可以从Excel/CSV文件中读取。

  1. # 写入csv文件
  2. df.to_csv('animal.csv')
  3. # 读取csv文件
  4. df.read_csv('animal.csv')
  5. # 写入excel
  6. writer = pd.ExcelWriter('animal.xlsx')
  7. df3.to_excel(writer, sheet_name='Sheet1')
  8. df3.to_excel(writer, sheet_name='Sheet2')
  9. df3.to_excel(writer, sheet_name='Sheet3')
  10. writer.save()
  11. # 读取excel
  12. pd.read_excel('animal.xlsx', 2, index_col=None, na_values=['NA'])

具体的参数值函数,请参考doc。

2. 实用进阶

2.1 pandas.core.indexes.datetimes.DatetimeIndex

建立一个以 2018 年每一天为索引,值为随机数的 Series

  1. dti = pd.date_range(start = '2018-01-01', end='2018-12-31', freq='D')
  2. s = pd.Series(np.random.rand(len(dti)), index=dti)

统计每个周三的值的和

  1. s[s.index.weekday == 2].sum()

统计每个月的平均值

  1. s.resample('M').mean()

统计每分钟的平均值

  1. s = pd.date_range(start='2018/1/1', end='2018/1/2', closed='left', freq='S')
  2. ts = pd.Series(np.random.randint(0, 500, len(s)), index=s)
  3. ts.resample('Min').mean()

resample 适合小粒度向大粒度转换, 否则大粒度向小粒度的转换会产生NaN

转换 tz

  1. s = pd.date_range('today', periods=1, freq='D') # 获取当前时间
  2. ts = pd.Series(np.random.randn(len(s)), index=s) # 随机数值
  3. ts.tz_localize('UTC') # 转换为 UTC 时间
  4. 2019-05-28 09:17:29.204571+00:00 1.583188
  5. Freq: D, dtype: float64
  6. ts.tz_localize('Asia/Shanghai')
  7. # ---OUTPUT---
  8. 2019-05-28 17:08:55.604510+08:00 0.245186
  9. Freq: D, dtype: float64

2.2 Series 多重索引

  1. # 创建多重索引Series
  2. letters = ['A', 'B', 'C']
  3. numbers = list(range(3))
  4. mi = pd.MultiIndex.from_product([letters, numbers]) # 设置多重索引
  5. s = pd.Series(np.random.rand(len(letters) * len(numbers)),
  6. index=mi,
  7. names=['ca', 'idx']) # 随机数
  8. ca idx
  9. A 0 0.430402
  10. 1 0.990126
  11. 2 0.084091
  12. B 0 0.878942
  13. 1 0.209060
  14. 2 0.717289
  15. C 0 0.143833
  16. 1 0.780968
  17. 2 0.297559
  18. dtype: float64

索引查询

  1. # 一级索引全部,二级索引的0和2
  2. s.loc[:, [0,2]]
  3. # ---OUTPUT---
  4. A 0 0.470544
  5. 2 0.907349
  6. B 0 0.153506
  7. 2 0.875122
  8. C 0 0.680005
  9. 2 0.169241
  10. dtype: float64
  11. # 一级索引的'A', 二级索引的0和2
  12. s.loc['A', [0,2]]
  13. # ---OUTPUT---
  14. A 0 0.470544
  15. 2 0.907349
  16. dtype: float64
  17. # 一级索引的'A'和'C', 二级索引的0和2
  18. s.loc[['A', 'C'], [0, 2]]
  19. # ---OUTPUT---
  20. A 0 0.470544
  21. 2 0.907349
  22. C 0 0.680005
  23. 2 0.169241
  24. dtype: float64

多重索引的切片

  1. s.loc[pd.IndexSlice[:'B', 1:]]
  2. # ---OUTPUT---
  3. ca idx
  4. A 1 0.945881
  5. 2 0.119563
  6. B 1 0.979727
  7. 2 0.531048
  8. dtype: float64

2.3 DataFrame多重索引

  1. # 创建多重索引 DataFrame
  2. frame = pd.DataFrame(np.arange(18).reshape(6, 3),
  3. index=[['A','A', 'A', 'B', 'B', 'B'], list('123123')],
  4. columns=['hello', 'Leo', 'Pansy'])
  5. # ---OUTPUT---
  6. hello Leo Pansy
  7. A 1 0 1 2
  8. 2 3 4 5
  9. 3 6 7 8
  10. B 1 9 10 11
  11. 2 12 13 14
  12. 3 15 16 17

多重索引分组求和

  1. frame.index.names = ['first', 'second']
  2. # ---OUTPUT---
  3. hello Leo Pansy
  4. first second
  5. A 1 0 1 2
  6. 2 3 4 5
  7. 3 6 7 8
  8. B 1 9 10 11
  9. 2 12 13 14
  10. 3 15 16 17
  11. frame.groupby('first').sum()
  12. # ---OUTPUT---
  13. hello Leo Pansy
  14. first
  15. A 9 12 15
  16. B 36 39 42

2.4 DataFrame行列互换

  1. frame.stack() # 一行三列变三行一列
  2. # ---OUTPUT---
  3. first second
  4. A 1 hello 0
  5. Leo 1
  6. Pansy 2
  7. 2 hello 3
  8. Leo 4
  9. Pansy 5
  10. 3 hello 6
  11. Leo 7
  12. Pansy 8
  13. B 1 hello 9
  14. Leo 10
  15. Pansy 11
  16. 2 hello 12
  17. Leo 13
  18. Pansy 14
  19. 3 hello 15
  20. Leo 16
  21. Pansy 17
  22. dtype: int64

2.5 DataFrame条件查找

  1. data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
  2. 'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
  3. 'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
  4. 'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
  5. labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
  6. df = pd.DataFrame(data, index=labels)
  7. #---OUTPUT---
  8. animal age visits priority
  9. a cat 2.5 1 yes
  10. b cat 3.0 3 yes
  11. c snake 0.5 2 no
  12. d dog NaN 3 yes
  13. e dog 5.0 2 no
  14. f cat 2.0 3 no
  15. g snake 4.5 1 no
  16. h cat NaN 1 yes
  17. i dog 7.0 2 no
  18. j dog 3.0 1 no
  19. df['age']
  20. # ---OUTPUT
  21. a 2.5
  22. b 3.0
  23. c 0.5
  24. d NaN
  25. e 5.0
  26. f 2.0
  27. g 4.5
  28. h NaN
  29. i 7.0
  30. j 3.0
  31. Name: age, dtype: float64
  32. df['age'] > 3
  33. # ---OUTPUT---
  34. a False
  35. b False
  36. c False
  37. d False
  38. e True
  39. f False
  40. g True
  41. h False
  42. i True
  43. j False
  44. Name: age, dtype: bool
  45. df[df['age'] > 3]
  46. # ---OUTPUT---
  47. animal age visits priority
  48. e dog 5.0 2 no
  49. g snake 4.5 1 no
  50. i dog 7.0 2 no