• ">创建数据
  • ">数据查看
  • ">数据选取
    • ">使用[ ]选取数据
    • ">通过标签选取数据
    • ">通过位置选取数据
    • ">使用布尔索引
  • ">数据可视化

    导入库

    1. import numpy as np
    2. import pandas as pd
    3. import matplotlib.pyplot as plt

    创建数据

    使用pd.Series创建Series对象

    1. s = pd.Series([1,3,5,np.nan,6,8])
    2. s
    3. >>>
    4. 0 1.0
    5. 1 3.0
    6. 2 5.0
    7. 3 NaN
    8. 4 6.0
    9. 5 8.0
    10. dtype: float64

    通过numpy的array数据来创建DataFrame对象

    1. dates = pd.date_range('20130101', periods=6)
    2. dates
    3. >>>
    4. DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
    5. '2013-01-05', '2013-01-06'],
    6. dtype='datetime64[ns]', freq='D')
    7. df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
    8. print(df)
    9. >>>
    10. A B C D
    11. 2013-01-01 0.342275 -0.333060 -0.294502 1.808311
    12. 2013-01-02 -0.010251 -0.322083 -0.992557 -0.960891
    13. 2013-01-03 -0.344072 -1.185725 0.674009 -0.716058
    14. 2013-01-04 -0.235446 -1.721794 -1.265767 0.242253
    15. 2013-01-05 3.074955 1.848873 1.813445 -0.795627
    16. 2013-01-06 -0.039975 1.090794 -0.605099 -1.111459

    通过字典创建DataFrame对象

    1. df2 = pd.DataFrame({ 'A' : 1.,
    2. 'B' : pd.Timestamp('20130102'),
    3. 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
    4. 'D' : np.array([3] * 4,dtype='int32'),
    5. 'E' : pd.Categorical(["test","train","test","train"]),
    6. 'F' : 'foo' })
    7. print(df2)
    8. >>>
    9. A B C D E F
    10. 0 1.0 2013-01-02 1.0 3 test foo
    11. 1 1.0 2013-01-02 1.0 3 train foo
    12. 2 1.0 2013-01-02 1.0 3 test foo
    13. 3 1.0 2013-01-02 1.0 3 train foo
    14. print(df2.dtypes)
    15. >>>
    16. A float64
    17. B datetime64[ns]
    18. C float32
    19. D int32
    20. E category
    21. F object
    22. dtype: object
    23. print(dir(df2))
    24. >>>
    25. 返回一个列表

    数据查看

    基本方法,务必掌握,更多相关查看数据的方法可以参与官方文档
    下面分别是查看数据的顶部和尾部的方法

    1. dates = pd.date_range('20130101', periods=6)
    2. df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
    3. print(df.head(2)) # 查找头部的前两个数据,head()默认是前5个
    4. >>>
    5. A B C D
    6. 2013-01-01 0.342275 -0.333060 -0.294502 1.808311
    7. 2013-01-02 -0.010251 -0.322083 -0.992557 -0.960891\
    8. print(df.tail(3)) # 查找尾部的3个数据
    9. >>>
    10. A B C D
    11. 2013-01-04 -0.235446 -1.721794 -1.265767 0.242253
    12. 2013-01-05 3.074955 1.848873 1.813445 -0.795627
    13. 2013-01-06 -0.039975 1.090794 -0.605099 -1.111459

    查看DataFrame对象的索引,列名,数据信息

    1. print(df.index) # 查找数据的索引
    2. >>>
    3. DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
    4. '2013-01-05', '2013-01-06'],
    5. dtype='datetime64[ns]', freq='D')
    6. print(df.columns) # 查找数据的列名
    7. >>>
    8. Index(['A', 'B', 'C', 'D'], dtype='object')
    9. print(df.values)
    10. >>>
    11. [[ 0.34227537, -0.33306022, -0.29450173, 1.80831125],
    12. [-0.01025096, -0.3220833 , -0.99255656, -0.96089093],
    13. [-0.34407203, -1.18572491, 0.67400852, -0.71605802],
    14. [-0.2354458 , -1.7217938 , -1.26576668, 0.24225255],
    15. [ 3.07495472, 1.84887323, 1.81344527, -0.79562727],
    16. [-0.0399747 , 1.0907938 , -0.60509926, -1.11145858]]

    描述性统计

    1. print(df.describe())
    2. >>>
    3. A B C D
    4. count 6.000000 6.000000 6.000000 6.000000
    5. mean 0.896009 0.507334 0.520300 -0.604691
    6. std 1.226597 0.871341 1.047377 1.338247
    7. min -0.654441 -0.562629 -1.176366 -2.093140
    8. 25% -0.189769 -0.114563 0.228438 -1.450522
    9. 50% 1.394002 0.541984 0.574028 -0.900380
    10. 75% 1.781452 0.899909 0.927062 0.030616
    11. max 2.049583 1.836860 1.992128 1.558711

    数据转置

    1. print(df.T)
    2. >>>
    3. 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
    4. A 0.106123 -0.102469 2.113397 0.231633 -0.483121 -0.433374
    5. B 0.313585 -0.490534 1.031808 -0.209964 -0.679466 -0.562606
    6. C -0.030473 0.006813 -2.225570 -0.246044 0.066495 -0.371361
    7. D 0.416181 -1.344376 -1.508693 -1.496924 -0.236305 0.136705

    根据列名排序

    1. df.sort_index(axis=1, ascending=False) # 默认是True(升序),降序是False
    2. >>>
    3. D C B A
    4. 2013-01-01 -1.743265 -2.436682 -0.126209 -0.909977
    5. 2013-01-02 -1.237405 -0.479658 0.289598 0.386831
    6. 2013-01-03 1.222879 -0.611954 0.466847 -0.964970
    7. 2013-01-04 3.709656 -0.859897 0.626873 1.386637
    8. 2013-01-05 0.372616 -0.668504 -0.824841 -0.577940
    9. 2013-01-06 -0.172015 1.126194 -0.763443 -0.343243

    根据B列数值排序

    1. print(df.sort_values(by='B'))
    2. >>>
    3. A B C D
    4. 2013-01-06 1.693180 -1.613240 -0.146807 -1.550596
    5. 2013-01-01 0.156765 -1.120396 0.447468 -1.014416
    6. 2013-01-03 0.775879 -1.093397 0.227219 0.169062
    7. 2013-01-05 -1.279614 -1.088507 -0.327012 -0.519709
    8. 2013-01-02 0.936605 -0.299676 -1.370837 -0.425901
    9. 2013-01-04 0.812881 0.881192 -0.350331 -2.037380

    数据选取

    官方建议使用优化的熊猫数据访问方法.at,.iat,.loc和.iloc,部分较早的pandas版本可以使用.ix
    5分钟学会Pandas中iloc/loc/ix区别

    使用[ ]选取数据

    选取单列数据,等效于df.A:

    1. print(df['A']) # 相当于df.A
    2. >>>
    3. 2013-01-01 0.342275
    4. 2013-01-02 -0.010251
    5. 2013-01-03 -0.344072
    6. 2013-01-04 -0.235446
    7. 2013-01-05 3.074955
    8. 2013-01-06 -0.039975
    9. Freq: D, Name: A, dtype: float64

    按行选取数据,使用[]

    1. print(df[0:3]) # 打印前3行数据,前开后闭
    2. >>>
    3. A B C D
    4. 2013-01-01 -1.889500 -1.413149 -0.039584 0.031551
    5. 2013-01-02 1.480268 0.108239 -0.005645 0.536260
    6. 2013-01-03 1.385717 0.227386 -0.098316 1.056272
    7. print(df['20130102':'20130104'])
    8. >>>
    9. A B C D
    10. 2013-01-02 -0.175396 -0.608281 1.472997 -0.842902
    11. 2013-01-03 1.073921 0.536321 -1.062791 0.778709
    12. 2013-01-04 0.144927 -0.107287 -0.594705 0.644814

    通过标签选取数据

    1. print(df.loc[dates[0]]) # 选取第一行数据
    2. >>>
    3. A 0.342275
    4. B -0.333060
    5. C -0.294502
    6. D 1.808311
    7. Name: 2013-01-01 00:00:00, dtype: float64
    1. print(df.loc[:,['A','B']])
    2. >>>
    3. A B
    4. 2013-01-01 1.105010 -0.320929
    5. 2013-01-02 -1.204395 -0.570691
    6. 2013-01-03 -0.786688 0.424701
    7. 2013-01-04 -0.121843 0.127801
    8. 2013-01-05 -0.035029 0.293037
    9. 2013-01-06 -0.603599 -1.931956
    10. print(df.loc['20130102':'20130104',['A','B']])
    11. >>>
    12. A B
    13. 2013-01-02 0.467955 -2.148467
    14. 2013-01-03 0.887625 1.035388
    15. 2013-01-04 0.055645 -0.018191
    16. print(df.loc['20130102',['A','B']])
    17. >>>
    18. A -1.138735
    19. B -0.100542
    20. Name: 2013-01-02 00:00:00, dtype: float64
    21. print(df.loc[dates[0],'A'])
    22. >>>
    23. 0.7391924458347098
    24. print(df.at[dates[0],'A'])
    25. >>>
    26. 0.7391924458347098

    通过位置选取数据

    1. print(df.iloc[3]) # 获取下标为3(第4行)的数据
    2. >>>
    3. A -0.381260
    4. B 0.868501
    5. C -1.668756
    6. D 0.839632
    7. Name: 2013-01-04 00:00:00, dtype: float64
    8. print(df.iloc[3:5, 0:2]) # 前开后闭
    9. >>>
    10. A B
    11. 2013-01-04 0.482456 -1.216927
    12. 2013-01-05 1.008627 0.427897
    13. print(df.iloc[[1,2,4],[0,2]]) # 取下标为1,2,4的行
    14. >>>
    15. A C
    16. 2013-01-02 1.112508 0.969343
    17. 2013-01-03 -0.164053 -1.322557
    18. 2013-01-05 -1.073691 -0.356547
    19. print(df.iloc[1:3]) # 取下标从1到3(3取不到)的行
    20. >>>
    21. A B C D
    22. 2013-01-02 0.800642 -0.504769 0.519685 1.978916
    23. 2013-01-03 0.137714 0.540270 0.374199 -1.224552
    24. print(df.iloc[:, 1:3]) # 取全部的行,列取1到3(3取不到)
    25. >>>
    26. B C
    27. 2013-01-01 -0.333060 -0.294502
    28. 2013-01-02 -0.322083 -0.992557
    29. 2013-01-03 -1.185725 0.674009
    30. 2013-01-04 -1.721794 -1.265767
    31. 2013-01-05 1.848873 1.813445
    32. 2013-01-06 1.090794 -0.605099
    33. print(df.iloc[1, 1]) # 获取单个值坐标为1,1
    34. >>>
    35. 1.265304932997343
    36. print(df.iat[1, 1]) # 获取单个值坐标为1,1
    37. >>>
    38. -0.4083246435680777

    使用布尔索引

    1. print(df[df.A>0]) # 取A列大于0的
    2. >>>
    3. A B C D
    4. 2013-01-01 0.342275 -0.333060 -0.294502 1.808311
    5. 2013-01-05 3.074955 1.848873 1.813445 -0.795627
    6. print(df[df>0]) # 取全部大于0的数据
    7. >>>
    8. A B C D
    9. 2013-01-01 NaN 0.362070 NaN 2.037941
    10. 2013-01-02 1.500178 NaN 0.168748 0.620532
    11. 2013-01-03 NaN NaN 0.922889 NaN
    12. 2013-01-04 1.698255 0.025913 NaN NaN
    13. 2013-01-05 0.152086 0.348645 0.237132 0.868915
    14. 2013-01-06 NaN 0.050824 0.075995 NaN
    15. df2 = df.copy() # 复制df
    16. df2['E'] = ['one', 'one','two','three','four','three'] # df2新增1列
    17. print(df2)
    18. >>>
    19. A B C D E
    20. 2013-01-01 -0.897222 -1.714759 0.358384 -1.475133 one
    21. 2013-01-02 -1.707707 -0.444518 -2.838489 -2.436182 one
    22. 2013-01-03 -0.955428 0.005758 -0.264125 -0.045104 two
    23. 2013-01-04 1.037277 0.255815 0.180912 -0.311802 three
    24. 2013-01-05 1.631085 3.236270 -0.039909 -0.280554 four
    25. 2013-01-06 1.758670 1.209860 0.948103 0.129601 three
    26. print(df2[df2['E'].isin(['two','four'])]) # 取df2中E列中含有two和four的每一行
    27. >>>
    28. A B C D E
    29. 2013-01-03 -0.269128 -0.624533 -1.616405 -0.678576 two
    30. 2013-01-05 0.203549 -0.853705 -0.523561 1.429644 four

    数据可视化

    1. ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000)) # 创建对象
    2. print(ts.head())
    3. ts = ts.cumsum() #累加