新手必备,10分钟学pandas

1.导入包

导入包

  1. import numpy as np
  2. import pandas as pd

2.创建向量

使用Series函数让pandas创建默认的整数索引

  1. import numpy as np
  2. import pandas as pd
  3. s = pd.Series([1, 3, 5, np.nan, 6, 8])
  4. print(s)

0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64

  1. import numpy as np
  2. import pandas as pd
  3. s = pd.Series([1, 3, 5, np.nan, 6, 8])
  4. print(s[0], s[2])

1.0 5.0

3.创建DataFrame

通过Numpy array创建一个有日期和标记的列的数据框架

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6) %使用data_range创建连续的向量
  4. print(dates)

DatetimeIndex([‘2013-01-01’, ‘2013-01-02’, ‘2013-01-03’, ‘2013-01-04’, ‘2013-01-05’, ‘2013-01-06’], dtype=’datetime64[ns]’, freq=’D’)

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6)
  4. print(dates[2])

2013-01-03 00:00:00

DataFrame创建矩阵用法一

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6)
  4. %使用DataFrame创建矩阵,columns(矩阵的行标头),index(矩阵的列标头)
  5. df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  6. print(df)
  1. A B C D

2013-01-01 0.652211 0.351307 -0.047034 0.233530 2013-01-02 -0.997980 0.292630 -1.312628 0.222933 2013-01-03 -0.733029 0.878319 0.789346 0.547204 2013-01-04 -2.428614 -1.820740 -0.884209 0.454380 2013-01-05 0.500799 0.293417 0.384024 1.294105 2013-01-06 0.257280 -0.692720 0.376611 -1.718606

DataFrame创建矩阵用法二

  1. import numpy as np
  2. import pandas as pd
  3. # dates = pd.date_range('20130101', periods=6)
  4. # df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. #使用字典创建DataFrame
  6. df = pd.DataFrame({
  7. 'A': 1,
  8. 'B': pd.Timestamp('20130102'),
  9. 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
  10. 'D': np.array([3]*4, dtype='int32'),
  11. 'E': pd.Categorical(['test', 'train', 'test', 'train']),
  12. 'F': 'foo'
  13. })
  14. print(df)
  1. A B C D E F

0 1 2013-01-02 1.0 3 test foo 1 1 2013-01-02 1.0 3 train foo 2 1 2013-01-02 1.0 3 test foo 3 1 2013-01-02 1.0 3 train foo

从结果可以看出,DataFrame的行标头默认为数字,我们用字典赋值的是列标头
查看这个df的每列的数据类型

A int64 B datetime64[ns] C float32

D int32

E category

F object

dtype: object

4.查看数据

df.head()

  1. import numpy as np
  2. import pandas as pd
  3. # dates = pd.date_range('20130101', periods=6)
  4. # df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. df = pd.DataFrame({
  6. 'A': 1,
  7. 'B': pd.Timestamp('20130102'),
  8. 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
  9. 'D': np.array([3]*4, dtype='int32'),
  10. 'E': pd.Categorical(['test', 'train', 'test', 'train']),
  11. 'F': 'foo'
  12. })
  13. print(df.head()) %查看所有数据
  1. A B C D E F

0 1 2013-01-02 1.0 3 test foo

1 1 2013-01-02 1.0 3 train foo

2 1 2013-01-02 1.0 3 test foo

3 1 2013-01-02 1.0 3 train foo

df.tail()

  1. print(df.tail(2)) %查看最后的两行数据
  1. A B C D E F

2 1 2013-01-02 1.0 3 test foo

3 1 2013-01-02 1.0 3 train foo

df.index

  1. print(df.index) %查看DataFrame的行标头

Int64Index([0, 1, 2, 3], dtype=’int64’)

df.columns

  1. print(df.columns) %查看DataFrame的列标头

Index([‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’], dtype=’object’)

df.to_numpy()

注意:使用df.to_numpy()将DataFrame转换为Numpy是需要很高的代价的。这是因为:Numpy arrays对于整个矩阵而言只会有一个数据类型,而DataFrame每列都可为不同的数据类型。这就很可能导致,Numpy arrys中的数据最终只有一个父类: object类型

  1. import numpy as np
  2. import pandas as pd
  3. # dates = pd.date_range('20130101', periods=6)
  4. # df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. df = pd.DataFrame({
  6. 'A': 1,
  7. 'B': pd.Timestamp('20130102'),
  8. 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
  9. 'D': np.array([3]*4, dtype='int32'),
  10. 'E': pd.Categorical(['test', 'train', 'test', 'train']),
  11. 'F': 'foo'
  12. })
  13. nf = df.to_numpy() %转换为numpy类型的时候没有行表头和列表头
  14. print(nf)

[[1 Timestamp(‘2013-01-02 00:00:00’) 1.0 3 ‘test’ ‘foo’]

[1 Timestamp(‘2013-01-02 00:00:00’) 1.0 3 ‘train’ ‘foo’]

[1 Timestamp(‘2013-01-02 00:00:00’) 1.0 3 ‘test’ ‘foo’]

[1 Timestamp(‘2013-01-02 00:00:00’) 1.0 3 ‘train’ ‘foo’]]

descirbe()

describe() : 展示了数据的统计结果

  1. import numpy as np
  2. import pandas as pd
  3. # dates = pd.date_range('20130101', periods=6)
  4. # df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. df = pd.DataFrame({
  6. 'A': 1,
  7. 'B': pd.Timestamp('20130102'),
  8. 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
  9. 'D': np.array([3]*4, dtype='int32'),
  10. 'E': pd.Categorical(['test', 'train', 'test', 'train']),
  11. 'F': 'foo'
  12. })
  13. print(df.describe())
  1. A C D

count 4.0 4.0 4.0

mean 1.0 1.0 3.0

std 0.0 0.0 0.0

min 1.0 1.0 3.0

25% 1.0 1.0 3.0

50% 1.0 1.0 3.0

75% 1.0 1.0 3.0

max 1.0 1.0 3.0

df.T

df.T:DataFrame转置

  1. import numpy as np
  2. import pandas as pd
  3. # dates = pd.date_range('20130101', periods=6)
  4. # df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. df = pd.DataFrame({
  6. 'A': 1,
  7. 'B': pd.Timestamp('20130102'),
  8. 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
  9. 'D': np.array([3]*4, dtype='int32'),
  10. 'E': pd.Categorical(['test', 'train', 'test', 'train']),
  11. 'F': 'foo'
  12. })
  13. print(df.T)
  1. 0 ... 3

A 1 … 1

B 2013-01-02 00:00:00 … 2013-01-02 00:00:00

C 1 … 1

D 3 … 3

E test … train

F foo … foo

[6 rows x 4 columns]

sort_index()

  1. import numpy as np
  2. import pandas as pd
  3. # dates = pd.date_range('20130101', periods=6)
  4. # df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. df = pd.DataFrame({
  6. 'A': 1,
  7. 'B': pd.Timestamp('20130102'),
  8. 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
  9. 'D': np.array([3]*4, dtype='int32'),
  10. 'E': pd.Categorical(['test', 'train', 'test', 'train']),
  11. 'F': 'foo'
  12. })
  13. print(df)
  14. print(df.sort_index(axis=1, ascending=False))

image.png

sort_values()

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6)
  4. df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. df = pd.DataFrame({
  6. 'A': 1,
  7. 'B': pd.Timestamp('20130102'),
  8. 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
  9. 'D': np.array(np.random.rand(4), dtype='float32'),
  10. 'E': pd.Categorical(['test', 'train', 'test', 'train']),
  11. 'F': 'foo'
  12. })
  13. print(df)
  14. print(df.sort_values(by='D'))

image.png

5.选择数据

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6)
  4. df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. df = pd.DataFrame({
  6. 'A': 1,
  7. 'B': pd.Timestamp('20130102'),
  8. 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
  9. 'D': np.array(np.random.rand(4), dtype='float32'),
  10. 'E': pd.Categorical(['test', 'train', 'test', 'train']),
  11. 'F': 'foo'
  12. })
  13. print(df)
  14. print(df['A'])

image.png

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6)
  4. df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. df = pd.DataFrame({
  6. 'A': 1,
  7. 'B': pd.Timestamp('20130102'),
  8. 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
  9. 'D': np.array(np.random.rand(4), dtype='float32'),
  10. 'E': pd.Categorical(['test', 'train', 'test', 'train']),
  11. 'F': 'foo'
  12. })
  13. print(df)
  14. print(df[0: 2])

image.png

loc函数

不带标头的第一行

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6)
  4. df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. # df = pd.DataFrame({
  6. # 'A': 1,
  7. # 'B': pd.Timestamp('20130102'),
  8. # 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
  9. # 'D': np.array(np.random.rand(4), dtype='float32'),
  10. # 'E': pd.Categorical(['test', 'train', 'test', 'train']),
  11. # 'F': 'foo'
  12. # })
  13. print(df)
  14. print(df.loc[dates[0]])

image.png

特定列数据

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6)
  4. df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. print(df)
  6. print(df.loc[:, ['A', 'B']])

image.png

iloc函数

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6)
  4. df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. print(df)
  6. print(df.iloc[3])

image.png

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6)
  4. df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. print(df)
  6. print(df.iloc[2: 4, 0: 2])

image.png

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6)
  4. df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. print(df)
  6. print(df.iloc[[2, 4], [0, 2]])

image.png

iat函数

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6)
  4. df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. print(df)
  6. print(df.iat[1, 1])

image.png

逻辑判断,找到符合条件的数据

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6)
  4. df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. print(df)
  6. print(df[df['A'] > 0])

image.png

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6)
  4. df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. print(df)
  6. print(df > 0)
  7. print(df[df > 0])

image.png

isin函数

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6)
  4. df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. df2 = df.copy()
  6. df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
  7. print(df2)
  8. print(df[df2['E'].isin(['two', 'three'])])

image.png

6.改变DataFrame数据

  1. import numpy as np
  2. import pandas as pd
  3. dates = pd.date_range('20130101', periods=6)
  4. df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
  5. s1 = pd.Series(np.arange(6)+1, index=pd.date_range('20130102', periods=6))
  6. print(df)
  7. df['F'] = s1
  8. df.at[dates[0], 'A'] = 0
  9. df.iat[0, 1] = 0
  10. df.loc[:, 'D'] = np.array([5] * len(df))
  11. print(df)

image.png