复习:数据分析的第一步,加载数据我们已经学习完毕了。当数据展现在我们面前的时候,我们所要做的第一步就是认识他,今天我们要学习的就是了解字段含义以及初步观察数据

1 第一章:数据载入及初步观察

1.4 知道你的数据叫什么

我们学习pandas的基础操作,那么上一节通过pandas加载之后的数据,其数据类型是什么呢? 开始前导入numpy和pandas import numpy as np import pandas as pd

1.4.1 任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子🌰[开放题]

写入代码

s = pd.Series(data=[1,2,3], index = pd.Index([0,1,2]), name = “myseries”) s 0 1 1 2 2 3 Name: myseries, dtype: int64data = [[1, ‘a’, 1.2], [2, ‘b’, 2.2], [3, ‘c’, 3.2]] df = pd.DataFrame(data = data, index = [‘row_%d’%i for i in range(3)], columns=[‘col_0’, ‘col_1’, ‘col_2’]) df |

  1. | col_0
  2. | col_1
  3. | col_2
  4. |

| —- | —- | —- | —- |

  1. |
  2. row_0
  3. | 1
  4. | a
  5. | 1.2
  6. |
  7. |
  8. row_1
  9. | 2
  10. | b
  11. | 2.2
  12. |
  13. |
  14. row_2
  15. | 3
  16. | c
  17. | 3.2
  18. |

‘’’

我们举的例子

sdata = {‘Ohio’: 35000, ‘Texas’: 71000, ‘Oregon’: 16000, ‘Utah’: 5000} example_1 = pd.Series(sdata) example_1 ‘’’ ‘’’

我们举的例子

data = {‘state’: [‘Ohio’, ‘Ohio’, ‘Ohio’, ‘Nevada’, ‘Nevada’, ‘Nevada’], ‘year’: [2000, 2001, 2002, 2001, 2002, 2003],’pop’: [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} example_2 = pd.DataFrame(data) example_2 ‘’’

1.4.2 任务二:根据上节课的方法载入”train.csv”文件

写入代码

df = pd.read_csv(“train.csv”) df.head() |

  1. | PassengerId
  2. | Survived
  3. | Pclass
  4. | Name
  5. | Sex
  6. | Age
  7. | SibSp
  8. | Parch
  9. | Ticket
  10. | Fare
  11. | Cabin
  12. | Embarked
  13. |

| —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- |

  1. |
  2. 0
  3. | 1
  4. | 0
  5. | 3
  6. | Braund, Mr. Owen Harris
  7. | male
  8. | 22.0
  9. | 1
  10. | 0
  11. | A/5 21171
  12. | 7.2500
  13. | NaN
  14. | S
  15. |
  16. |
  17. 1
  18. | 2
  19. | 1
  20. | 1
  21. | Cumings, Mrs. John Bradley (Florence Briggs Th...
  22. | female
  23. | 38.0
  24. | 1
  25. | 0
  26. | PC 17599
  27. | 71.2833
  28. | C85
  29. | C
  30. |
  31. |
  32. 2
  33. | 3
  34. | 1
  35. | 3
  36. | Heikkinen, Miss. Laina
  37. | female
  38. | 26.0
  39. | 0
  40. | 0
  41. | STON/O2. 3101282
  42. | 7.9250
  43. | NaN
  44. | S
  45. |
  46. |
  47. 3
  48. | 4
  49. | 1
  50. | 1
  51. | Futrelle, Mrs. Jacques Heath (Lily May Peel)
  52. | female
  53. | 35.0
  54. | 1
  55. | 0
  56. | 113803
  57. | 53.1000
  58. | C123
  59. | S
  60. |
  61. |
  62. 4
  63. | 5
  64. | 0
  65. | 3
  66. | Allen, Mr. William Henry
  67. | male
  68. | 35.0
  69. | 0
  70. | 0
  71. | 373450
  72. | 8.0500
  73. | NaN
  74. | S
  75. |

也可以加载上一节课保存的”train_chinese.csv”文件。通过翻译版train_chinese.csv熟悉了这个数据集,然后我们对trian.csv来进行操作

1.4.3 任务三:查看DataFrame数据的每列的项

写入代码

df.columns Index([‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’, ‘Embarked’], dtype=’object’)#### 1.4.4任务四:查看”cabin”这列的所有项 [有多种方法]

写入代码

df.Cabin 0 NaN 1 C85 2 NaN 3 C123 4 NaN … 886 NaN 887 B42 888 NaN 889 C148 890 NaN Name: Cabin, Length: 891, dtype: object#写入代码 df[“Cabin”] 0 NaN 1 C85 2 NaN 3 C123 4 NaN … 886 NaN 887 B42 888 NaN 889 C148 890 NaN Name: Cabin, Length: 891, dtype: object#### 1.4.5 任务五:加载文件”test_1.csv”,然后对比”train.csv”,看看有哪些多出的列,然后将多出的列删除 经过我们的观察发现一个测试集test_1.csv有一列是多余的,我们需要将这个多余的列删去

写入代码

test_1 = pd.read_csv(“test_1.csv”) del test_1[“a”] test_1 |

  1. | Unnamed: 0
  2. | PassengerId
  3. | Survived
  4. | Pclass
  5. | Name
  6. | Sex
  7. | Age
  8. | SibSp
  9. | Parch
  10. | Ticket
  11. | Fare
  12. | Cabin
  13. | Embarked
  14. |

| —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- |

  1. |
  2. 0
  3. | 0
  4. | 1
  5. | 0
  6. | 3
  7. | Braund, Mr. Owen Harris
  8. | male
  9. | 22.0
  10. | 1
  11. | 0
  12. | A/5 21171
  13. | 7.2500
  14. | NaN
  15. | S
  16. |
  17. |
  18. 1
  19. | 1
  20. | 2
  21. | 1
  22. | 1
  23. | Cumings, Mrs. John Bradley (Florence Briggs Th...
  24. | female
  25. | 38.0
  26. | 1
  27. | 0
  28. | PC 17599
  29. | 71.2833
  30. | C85
  31. | C
  32. |
  33. |
  34. 2
  35. | 2
  36. | 3
  37. | 1
  38. | 3
  39. | Heikkinen, Miss. Laina
  40. | female
  41. | 26.0
  42. | 0
  43. | 0
  44. | STON/O2. 3101282
  45. | 7.9250
  46. | NaN
  47. | S
  48. |
  49. |
  50. 3
  51. | 3
  52. | 4
  53. | 1
  54. | 1
  55. | Futrelle, Mrs. Jacques Heath (Lily May Peel)
  56. | female
  57. | 35.0
  58. | 1
  59. | 0
  60. | 113803
  61. | 53.1000
  62. | C123
  63. | S
  64. |
  65. |
  66. 4
  67. | 4
  68. | 5
  69. | 0
  70. | 3
  71. | Allen, Mr. William Henry
  72. | male
  73. | 35.0
  74. | 0
  75. | 0
  76. | 373450
  77. | 8.0500
  78. | NaN
  79. | S
  80. |
  81. |
  82. ...
  83. | ...
  84. | ...
  85. | ...
  86. | ...
  87. | ...
  88. | ...
  89. | ...
  90. | ...
  91. | ...
  92. | ...
  93. | ...
  94. | ...
  95. | ...
  96. |
  97. |
  98. 886
  99. | 886
  100. | 887
  101. | 0
  102. | 2
  103. | Montvila, Rev. Juozas
  104. | male
  105. | 27.0
  106. | 0
  107. | 0
  108. | 211536
  109. | 13.0000
  110. | NaN
  111. | S
  112. |
  113. |
  114. 887
  115. | 887
  116. | 888
  117. | 1
  118. | 1
  119. | Graham, Miss. Margaret Edith
  120. | female
  121. | 19.0
  122. | 0
  123. | 0
  124. | 112053
  125. | 30.0000
  126. | B42
  127. | S
  128. |
  129. |
  130. 888
  131. | 888
  132. | 889
  133. | 0
  134. | 3
  135. | Johnston, Miss. Catherine Helen "Carrie"
  136. | female
  137. | NaN
  138. | 1
  139. | 2
  140. | W./C. 6607
  141. | 23.4500
  142. | NaN
  143. | S
  144. |
  145. |
  146. 889
  147. | 889
  148. | 890
  149. | 1
  150. | 1
  151. | Behr, Mr. Karl Howell
  152. | male
  153. | 26.0
  154. | 0
  155. | 0
  156. | 111369
  157. | 30.0000
  158. | C148
  159. | C
  160. |
  161. |
  162. 890
  163. | 890
  164. | 891
  165. | 0
  166. | 3
  167. | Dooley, Mr. Patrick
  168. | male
  169. | 32.0
  170. | 0
  171. | 0
  172. | 370376
  173. | 7.7500
  174. | NaN
  175. | Q
  176. |

891 rows × 13 columns 【思考】还有其他的删除多余的列的方式吗?

思考回答

写入代码

test_1 = pd.read_csv(“test_1.csv”) test_1.drop(columns=’a’) |

  1. | Unnamed: 0
  2. | PassengerId
  3. | Survived
  4. | Pclass
  5. | Name
  6. | Sex
  7. | Age
  8. | SibSp
  9. | Parch
  10. | Ticket
  11. | Fare
  12. | Cabin
  13. | Embarked
  14. |

| —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- |

  1. |
  2. 0
  3. | 0
  4. | 1
  5. | 0
  6. | 3
  7. | Braund, Mr. Owen Harris
  8. | male
  9. | 22.0
  10. | 1
  11. | 0
  12. | A/5 21171
  13. | 7.2500
  14. | NaN
  15. | S
  16. |
  17. |
  18. 1
  19. | 1
  20. | 2
  21. | 1
  22. | 1
  23. | Cumings, Mrs. John Bradley (Florence Briggs Th...
  24. | female
  25. | 38.0
  26. | 1
  27. | 0
  28. | PC 17599
  29. | 71.2833
  30. | C85
  31. | C
  32. |
  33. |
  34. 2
  35. | 2
  36. | 3
  37. | 1
  38. | 3
  39. | Heikkinen, Miss. Laina
  40. | female
  41. | 26.0
  42. | 0
  43. | 0
  44. | STON/O2. 3101282
  45. | 7.9250
  46. | NaN
  47. | S
  48. |
  49. |
  50. 3
  51. | 3
  52. | 4
  53. | 1
  54. | 1
  55. | Futrelle, Mrs. Jacques Heath (Lily May Peel)
  56. | female
  57. | 35.0
  58. | 1
  59. | 0
  60. | 113803
  61. | 53.1000
  62. | C123
  63. | S
  64. |
  65. |
  66. 4
  67. | 4
  68. | 5
  69. | 0
  70. | 3
  71. | Allen, Mr. William Henry
  72. | male
  73. | 35.0
  74. | 0
  75. | 0
  76. | 373450
  77. | 8.0500
  78. | NaN
  79. | S
  80. |
  81. |
  82. ...
  83. | ...
  84. | ...
  85. | ...
  86. | ...
  87. | ...
  88. | ...
  89. | ...
  90. | ...
  91. | ...
  92. | ...
  93. | ...
  94. | ...
  95. | ...
  96. |
  97. |
  98. 886
  99. | 886
  100. | 887
  101. | 0
  102. | 2
  103. | Montvila, Rev. Juozas
  104. | male
  105. | 27.0
  106. | 0
  107. | 0
  108. | 211536
  109. | 13.0000
  110. | NaN
  111. | S
  112. |
  113. |
  114. 887
  115. | 887
  116. | 888
  117. | 1
  118. | 1
  119. | Graham, Miss. Margaret Edith
  120. | female
  121. | 19.0
  122. | 0
  123. | 0
  124. | 112053
  125. | 30.0000
  126. | B42
  127. | S
  128. |
  129. |
  130. 888
  131. | 888
  132. | 889
  133. | 0
  134. | 3
  135. | Johnston, Miss. Catherine Helen "Carrie"
  136. | female
  137. | NaN
  138. | 1
  139. | 2
  140. | W./C. 6607
  141. | 23.4500
  142. | NaN
  143. | S
  144. |
  145. |
  146. 889
  147. | 889
  148. | 890
  149. | 1
  150. | 1
  151. | Behr, Mr. Karl Howell
  152. | male
  153. | 26.0
  154. | 0
  155. | 0
  156. | 111369
  157. | 30.0000
  158. | C148
  159. | C
  160. |
  161. |
  162. 890
  163. | 890
  164. | 891
  165. | 0
  166. | 3
  167. | Dooley, Mr. Patrick
  168. | male
  169. | 32.0
  170. | 0
  171. | 0
  172. | 370376
  173. | 7.7500
  174. | NaN
  175. | Q
  176. |

891 rows × 13 columns

1.4.6 任务六: 将[‘PassengerId’,’Name’,’Age’,’Ticket’]这几个列元素隐藏,只观察其他几个列元素

写入代码

df.drop([‘PassengerId’,’Name’,’Age’,’Ticket’],axis=1).head(3) |

  1. | Survived
  2. | Pclass
  3. | Sex
  4. | SibSp
  5. | Parch
  6. | Fare
  7. | Cabin
  8. | Embarked
  9. |

| —- | —- | —- | —- | —- | —- | —- | —- | —- |

  1. |
  2. 0
  3. | 0
  4. | 3
  5. | male
  6. | 1
  7. | 0
  8. | 7.2500
  9. | NaN
  10. | S
  11. |
  12. |
  13. 1
  14. | 1
  15. | 1
  16. | female
  17. | 1
  18. | 0
  19. | 71.2833
  20. | C85
  21. | C
  22. |
  23. |
  24. 2
  25. | 1
  26. | 3
  27. | female
  28. | 0
  29. | 0
  30. | 7.9250
  31. | NaN
  32. | S
  33. |

【思考】对比任务五和任务六,是不是使用了不一样的方法(函数),如果使用一样的函数如何完成上面的不同的要求呢? drop函数 【思考回答】
如果想要完全的删除你的数据结构,使用inplace=True,因为使用inplace就将原数据覆盖了,所以这里没有用

1.5 筛选的逻辑

表格数据中,最重要的一个功能就是要具有可筛选的能力,选出我所需要的信息,丢弃无用的信息。
下面我们还是用实战来学习pandas这个功能。

1.5.1 任务一: 我们以”Age”为筛选条件,显示年龄在10岁以下的乘客信息。

写入代码

df[df[“Age”]<10].head(3) |

  1. | PassengerId
  2. | Survived
  3. | Pclass
  4. | Name
  5. | Sex
  6. | Age
  7. | SibSp
  8. | Parch
  9. | Ticket
  10. | Fare
  11. | Cabin
  12. | Embarked
  13. |

| —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- |

  1. |
  2. 7
  3. | 8
  4. | 0
  5. | 3
  6. | Palsson, Master. Gosta Leonard
  7. | male
  8. | 2.0
  9. | 3
  10. | 1
  11. | 349909
  12. | 21.075
  13. | NaN
  14. | S
  15. |
  16. |
  17. 10
  18. | 11
  19. | 1
  20. | 3
  21. | Sandstrom, Miss. Marguerite Rut
  22. | female
  23. | 4.0
  24. | 1
  25. | 1
  26. | PP 9549
  27. | 16.700
  28. | G6
  29. | S
  30. |
  31. |
  32. 16
  33. | 17
  34. | 0
  35. | 3
  36. | Rice, Master. Eugene
  37. | male
  38. | 2.0
  39. | 4
  40. | 1
  41. | 382652
  42. | 29.125
  43. | NaN
  44. | Q
  45. |

1.5.2 任务二: 以”Age”为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage

写入代码

midage = df.query(“Age<50 & Age>10”) midage.head() |

  1. | PassengerId
  2. | Survived
  3. | Pclass
  4. | Name
  5. | Sex
  6. | Age
  7. | SibSp
  8. | Parch
  9. | Ticket
  10. | Fare
  11. | Cabin
  12. | Embarked
  13. |

| —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- |

  1. |
  2. 0
  3. | 1
  4. | 0
  5. | 3
  6. | Braund, Mr. Owen Harris
  7. | male
  8. | 22.0
  9. | 1
  10. | 0
  11. | A/5 21171
  12. | 7.2500
  13. | NaN
  14. | S
  15. |
  16. |
  17. 1
  18. | 2
  19. | 1
  20. | 1
  21. | Cumings, Mrs. John Bradley (Florence Briggs Th...
  22. | female
  23. | 38.0
  24. | 1
  25. | 0
  26. | PC 17599
  27. | 71.2833
  28. | C85
  29. | C
  30. |
  31. |
  32. 2
  33. | 3
  34. | 1
  35. | 3
  36. | Heikkinen, Miss. Laina
  37. | female
  38. | 26.0
  39. | 0
  40. | 0
  41. | STON/O2. 3101282
  42. | 7.9250
  43. | NaN
  44. | S
  45. |
  46. |
  47. 3
  48. | 4
  49. | 1
  50. | 1
  51. | Futrelle, Mrs. Jacques Heath (Lily May Peel)
  52. | female
  53. | 35.0
  54. | 1
  55. | 0
  56. | 113803
  57. | 53.1000
  58. | C123
  59. | S
  60. |
  61. |
  62. 4
  63. | 5
  64. | 0
  65. | 3
  66. | Allen, Mr. William Henry
  67. | male
  68. | 35.0
  69. | 0
  70. | 0
  71. | 373450
  72. | 8.0500
  73. | NaN
  74. | S
  75. |

【提示】了解pandas的条件筛选方式以及如何使用交集和并集操作

1.5.3 任务三:将midage的数据中第100行的”Pclass”和”Sex”的数据显示出来

写入代码

midage.iloc[99][[‘Pclass’,’Sex’]] midage = midage.reset_index(drop=True) midage.loc[[100],[‘Pclass’,’Sex’]] Pclass 2 Sex male Name: 148, dtype: object【思考】这个reset_index()函数的作用是什么?如果不用这个函数,下面的任务会出现什么情况?
会串行

1.5.4 任务四:将midage的数据中第100,105,108行的”Pclass”,”Name”和”Sex”的数据显示出来

写入代码

midage.loc[[100,105,108],[‘Pclass’,’Name’,’Sex’]] |

  1. | Pclass
  2. | Name
  3. | Sex
  4. |

| —- | —- | —- | —- |

  1. |
  2. 100
  3. | 2
  4. | Byles, Rev. Thomas Roussel Davids
  5. | male
  6. |
  7. |
  8. 105
  9. | 3
  10. | Cribb, Mr. John Hatfield
  11. | male
  12. |
  13. |
  14. 108
  15. | 3
  16. | Calic, Mr. Jovo
  17. | male
  18. |

【提示】使用pandas提出的简单方式,你可以看看loc方法
对比整体的数据位置,你有发现什么问题吗?那么如何解决?
调整index

1.5.5 任务五:使用iloc方法将midage的数据中第100,105,108行的”Pclass”,”Name”和”Sex”的数据显示出来

写入代码

midage.iloc[[100,105,108],[2,3,4]] |

  1. | Pclass
  2. | Name
  3. | Sex
  4. |

| —- | —- | —- | —- |

  1. |
  2. 100
  3. | 2
  4. | Byles, Rev. Thomas Roussel Davids
  5. | male
  6. |
  7. |
  8. 105
  9. | 3
  10. | Cribb, Mr. John Hatfield
  11. | male
  12. |
  13. |
  14. 108
  15. | 3
  16. | Calic, Mr. Jovo
  17. | male
  18. |