清洗数据:删除指定数据、处理缺失数据etc


一、数据预览:tail()、head()

  1. import numpy as np
  2. import pandas as pd
  3. df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
  4. print(df_obj.tail())# 数据预览尾巴
  5. print(df_obj.head())# 数据预览头部
  1. a b c d
  2. 0 -0.507788 0.213237 0.003150 -0.777312
  3. 1 -0.896653 -2.188016 -0.114848 0.167057
  4. 2 -1.131242 -0.142287 -1.027330 1.861814
  5. 3 0.369608 0.823453 1.030830 -0.041778
  6. 4 -0.647625 0.056791 -0.394078 -1.347718
  7. a b c d
  8. 0 -0.507788 0.213237 0.003150 -0.777312
  9. 1 -0.896653 -2.188016 -0.114848 0.167057
  10. 2 -1.131242 -0.142287 -1.027330 1.861814
  11. 3 0.369608 0.823453 1.030830 -0.041778
  12. 4 -0.647625 0.056791 -0.394078 -1.347718

二、数据描述:shape、info()

  1. print ('数据集有%i行,%i列' %(df_obj.shape[0], df_obj.shape[1]))
  1. 数据集有5行,4
  1. print(df_obj.info())
  1. <class 'pandas.core.frame.DataFrame'>
  2. RangeIndex: 5 entries, 0 to 4
  3. Data columns (total 4 columns):
  4. a 5 non-null float64
  5. b 5 non-null float64
  6. c 5 non-null float64
  7. d 5 non-null float64
  8. dtypes: float64(4)
  9. memory usage: 288.0 bytes
  10. None

三、数据统计:describe()

  1. print(df_obj.describe())
  1. a b c d
  2. count 5.000000 5.000000 5.000000 5.000000
  3. mean -0.562740 -0.247365 -0.100455 -0.027587
  4. std 0.573191 1.143294 0.747673 1.215808
  5. min -1.131242 -2.188016 -1.027330 -1.347718
  6. 25% -0.896653 -0.142287 -0.394078 -0.777312
  7. 50% -0.647625 0.056791 -0.114848 -0.041778
  8. 75% -0.507788 0.213237 0.003150 0.167057
  9. max 0.369608 0.823453 1.030830 1.861814

四、pandas不完全显示行列

  1. pd.set_option('display.max_rows', 100) //显示的最大行数(避免只显示部分行数据)
  2. pd.set_option('display.max_columns', 1000) //显示的最大列数(避免列显示不全)
  3. pd.set_option("display.max_colwidth", 1000) //每一列最大的宽度(避免属性值或列名显示不全)
  4. pd.set_option('display.width', 1000) //每一行的宽度(避免换行)

五、删除指定行列数据

  1. import pandas as pd
  2. import numpy as np
  1. dict_data = {'A': 1.,
  2. 'B': pd.Timestamp('20161217'),
  3. 'C': pd.Series(1, index=list(range(4)),dtype='float32'),
  4. 'D': np.array([3] * 4,dtype='int32'),
  5. 'E' : pd.Categorical(["Python","Java","C++","C#"]),
  6. 'F' : 'ChinaHadoop' }
  7. df_obj2 = pd.DataFrame(dict_data)
  8. print(df_obj2)
  1. A B C D E F
  2. 0 1.0 2016-12-17 1.0 3 Python ChinaHadoop
  3. 1 1.0 2016-12-17 1.0 3 Java ChinaHadoop
  4. 2 1.0 2016-12-17 1.0 3 C++ ChinaHadoop
  5. 3 1.0 2016-12-17 1.0 3 C# ChinaHadoop

del

删除列

  1. del df_obj2['A']
  2. print (df_obj2.head())
  1. B C D E F
  2. 0 2016-12-17 1.0 3 Python ChinaHadoop
  3. 1 2016-12-17 1.0 3 Java ChinaHadoop
  4. 2 2016-12-17 1.0 3 C++ ChinaHadoop
  5. 3 2016-12-17 1.0 3 C# ChinaHadoop

drop

删除行/列数据

  1. dict_data = {'A': 1.,
  2. 'B': pd.Timestamp('20161217'),
  3. 'C': pd.Series(1, index=list(range(4)),dtype='float32'),
  4. 'D': np.array([3] * 4,dtype='int32'),
  5. 'E' : pd.Categorical(["Python","Java","C++","C#"]),
  6. 'F' : 'ChinaHadoop' }
  7. df_obj3 = pd.DataFrame(dict_data,index = ['sfd','sdfd','wer','rwer'])
  8. print (df_obj3.head(7))
  9. print(df_obj3.drop('wer'))#删除行
  10. print(df_obj3.drop('F',axis=1))#删除列
  1. A B C D E F
  2. sfd 1.0 2016-12-17 NaN 3 Python ChinaHadoop
  3. sdfd 1.0 2016-12-17 NaN 3 Java ChinaHadoop
  4. wer 1.0 2016-12-17 NaN 3 C++ ChinaHadoop
  5. rwer 1.0 2016-12-17 NaN 3 C# ChinaHadoop
  6. A B C D E F
  7. sfd 1.0 2016-12-17 NaN 3 Python ChinaHadoop
  8. sdfd 1.0 2016-12-17 NaN 3 Java ChinaHadoop
  9. rwer 1.0 2016-12-17 NaN 3 C# ChinaHadoop
  10. A B C D E
  11. sfd 1.0 2016-12-17 NaN 3 Python
  12. sdfd 1.0 2016-12-17 NaN 3 Java
  13. wer 1.0 2016-12-17 NaN 3 C++
  14. rwer 1.0 2016-12-17 NaN 3 C#

六、处理缺失数据

  1. df_data = pd.DataFrame([np.random.randn(3), [1., np.nan, np.nan],
  2. [4., np.nan, np.nan], [1., np.nan, 2.]])
  3. df_data.head()
0 1 2
0 -0.702713 -0.991383 -1.058464
1 1.000000 NaN NaN
2 4.000000 NaN NaN
3 1.000000 NaN 2.000000

判断是否存在缺失值

  1. df_data.isnull()
0 1 2
0 False False False
1 False True True
2 False True True
3 False True False

丢弃缺失数据

  1. print(df_data.dropna(axis=0))
  2. #0是行;1是列
  1. 0 1 2
  2. 0 -0.702713 -0.991383 -1.058464

填充缺失数据

  1. df_data.fillna(-100.)
0 1 2
0 -0.702713 -0.991383 -1.058464
1 1.000000 -100.000000 -100.000000
2 4.000000 -100.000000 -100.000000
3 1.000000 -100.000000 2.000000