复习:数据分析的第一步,加载数据我们已经学习完毕了。当数据展现在我们面前的时候,我们所要做的第一步就是认识他,今天我们要学习的就是了解字段含义以及初步观察数据。
1 第一章:数据载入及初步观察¶
1.4 知道你的数据叫什么¶
我们学习pandas的基础操作,那么上一节通过pandas加载之后的数据,其数据类型是什么呢? 开始前导入numpy和pandas import numpy as np import pandas as pd
1.4.1 任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子🌰[开放题]¶
写入代码
s = pd.Series(data=[1,2,3], index = pd.Index([0,1,2]), name = “myseries”) s 0 1 1 2 2 3 Name: myseries, dtype: int64data = [[1, ‘a’, 1.2], [2, ‘b’, 2.2], [3, ‘c’, 3.2]] df = pd.DataFrame(data = data, index = [‘row_%d’%i for i in range(3)], columns=[‘col_0’, ‘col_1’, ‘col_2’]) df |
| col_0
| col_1
| col_2
|
| —- | —- | —- | —- |
|
row_0
| 1
| a
| 1.2
|
|
row_1
| 2
| b
| 2.2
|
|
row_2
| 3
| c
| 3.2
|
‘’’
我们举的例子
sdata = {‘Ohio’: 35000, ‘Texas’: 71000, ‘Oregon’: 16000, ‘Utah’: 5000} example_1 = pd.Series(sdata) example_1 ‘’’ ‘’’
我们举的例子
data = {‘state’: [‘Ohio’, ‘Ohio’, ‘Ohio’, ‘Nevada’, ‘Nevada’, ‘Nevada’], ‘year’: [2000, 2001, 2002, 2001, 2002, 2003],’pop’: [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} example_2 = pd.DataFrame(data) example_2 ‘’’
1.4.2 任务二:根据上节课的方法载入”train.csv”文件¶
写入代码
df = pd.read_csv(“train.csv”) df.head() |
| PassengerId
| Survived
| Pclass
| Name
| Sex
| Age
| SibSp
| Parch
| Ticket
| Fare
| Cabin
| Embarked
|
| —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- |
|
0
| 1
| 0
| 3
| Braund, Mr. Owen Harris
| male
| 22.0
| 1
| 0
| A/5 21171
| 7.2500
| NaN
| S
|
|
1
| 2
| 1
| 1
| Cumings, Mrs. John Bradley (Florence Briggs Th...
| female
| 38.0
| 1
| 0
| PC 17599
| 71.2833
| C85
| C
|
|
2
| 3
| 1
| 3
| Heikkinen, Miss. Laina
| female
| 26.0
| 0
| 0
| STON/O2. 3101282
| 7.9250
| NaN
| S
|
|
3
| 4
| 1
| 1
| Futrelle, Mrs. Jacques Heath (Lily May Peel)
| female
| 35.0
| 1
| 0
| 113803
| 53.1000
| C123
| S
|
|
4
| 5
| 0
| 3
| Allen, Mr. William Henry
| male
| 35.0
| 0
| 0
| 373450
| 8.0500
| NaN
| S
|
也可以加载上一节课保存的”train_chinese.csv”文件。通过翻译版train_chinese.csv熟悉了这个数据集,然后我们对trian.csv来进行操作
1.4.3 任务三:查看DataFrame数据的每列的项¶
写入代码
df.columns Index([‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’, ‘Embarked’], dtype=’object’)#### 1.4.4任务四:查看”cabin”这列的所有项 [有多种方法]¶
写入代码
df.Cabin 0 NaN 1 C85 2 NaN 3 C123 4 NaN … 886 NaN 887 B42 888 NaN 889 C148 890 NaN Name: Cabin, Length: 891, dtype: object#写入代码 df[“Cabin”] 0 NaN 1 C85 2 NaN 3 C123 4 NaN … 886 NaN 887 B42 888 NaN 889 C148 890 NaN Name: Cabin, Length: 891, dtype: object#### 1.4.5 任务五:加载文件”test_1.csv”,然后对比”train.csv”,看看有哪些多出的列,然后将多出的列删除¶ 经过我们的观察发现一个测试集test_1.csv有一列是多余的,我们需要将这个多余的列删去
写入代码
test_1 = pd.read_csv(“test_1.csv”) del test_1[“a”] test_1 |
| Unnamed: 0
| PassengerId
| Survived
| Pclass
| Name
| Sex
| Age
| SibSp
| Parch
| Ticket
| Fare
| Cabin
| Embarked
|
| —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- |
|
0
| 0
| 1
| 0
| 3
| Braund, Mr. Owen Harris
| male
| 22.0
| 1
| 0
| A/5 21171
| 7.2500
| NaN
| S
|
|
1
| 1
| 2
| 1
| 1
| Cumings, Mrs. John Bradley (Florence Briggs Th...
| female
| 38.0
| 1
| 0
| PC 17599
| 71.2833
| C85
| C
|
|
2
| 2
| 3
| 1
| 3
| Heikkinen, Miss. Laina
| female
| 26.0
| 0
| 0
| STON/O2. 3101282
| 7.9250
| NaN
| S
|
|
3
| 3
| 4
| 1
| 1
| Futrelle, Mrs. Jacques Heath (Lily May Peel)
| female
| 35.0
| 1
| 0
| 113803
| 53.1000
| C123
| S
|
|
4
| 4
| 5
| 0
| 3
| Allen, Mr. William Henry
| male
| 35.0
| 0
| 0
| 373450
| 8.0500
| NaN
| S
|
|
...
| ...
| ...
| ...
| ...
| ...
| ...
| ...
| ...
| ...
| ...
| ...
| ...
| ...
|
|
886
| 886
| 887
| 0
| 2
| Montvila, Rev. Juozas
| male
| 27.0
| 0
| 0
| 211536
| 13.0000
| NaN
| S
|
|
887
| 887
| 888
| 1
| 1
| Graham, Miss. Margaret Edith
| female
| 19.0
| 0
| 0
| 112053
| 30.0000
| B42
| S
|
|
888
| 888
| 889
| 0
| 3
| Johnston, Miss. Catherine Helen "Carrie"
| female
| NaN
| 1
| 2
| W./C. 6607
| 23.4500
| NaN
| S
|
|
889
| 889
| 890
| 1
| 1
| Behr, Mr. Karl Howell
| male
| 26.0
| 0
| 0
| 111369
| 30.0000
| C148
| C
|
|
890
| 890
| 891
| 0
| 3
| Dooley, Mr. Patrick
| male
| 32.0
| 0
| 0
| 370376
| 7.7500
| NaN
| Q
|
891 rows × 13 columns 【思考】还有其他的删除多余的列的方式吗?
思考回答
写入代码
test_1 = pd.read_csv(“test_1.csv”) test_1.drop(columns=’a’) |
| Unnamed: 0
| PassengerId
| Survived
| Pclass
| Name
| Sex
| Age
| SibSp
| Parch
| Ticket
| Fare
| Cabin
| Embarked
|
| —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- |
|
0
| 0
| 1
| 0
| 3
| Braund, Mr. Owen Harris
| male
| 22.0
| 1
| 0
| A/5 21171
| 7.2500
| NaN
| S
|
|
1
| 1
| 2
| 1
| 1
| Cumings, Mrs. John Bradley (Florence Briggs Th...
| female
| 38.0
| 1
| 0
| PC 17599
| 71.2833
| C85
| C
|
|
2
| 2
| 3
| 1
| 3
| Heikkinen, Miss. Laina
| female
| 26.0
| 0
| 0
| STON/O2. 3101282
| 7.9250
| NaN
| S
|
|
3
| 3
| 4
| 1
| 1
| Futrelle, Mrs. Jacques Heath (Lily May Peel)
| female
| 35.0
| 1
| 0
| 113803
| 53.1000
| C123
| S
|
|
4
| 4
| 5
| 0
| 3
| Allen, Mr. William Henry
| male
| 35.0
| 0
| 0
| 373450
| 8.0500
| NaN
| S
|
|
...
| ...
| ...
| ...
| ...
| ...
| ...
| ...
| ...
| ...
| ...
| ...
| ...
| ...
|
|
886
| 886
| 887
| 0
| 2
| Montvila, Rev. Juozas
| male
| 27.0
| 0
| 0
| 211536
| 13.0000
| NaN
| S
|
|
887
| 887
| 888
| 1
| 1
| Graham, Miss. Margaret Edith
| female
| 19.0
| 0
| 0
| 112053
| 30.0000
| B42
| S
|
|
888
| 888
| 889
| 0
| 3
| Johnston, Miss. Catherine Helen "Carrie"
| female
| NaN
| 1
| 2
| W./C. 6607
| 23.4500
| NaN
| S
|
|
889
| 889
| 890
| 1
| 1
| Behr, Mr. Karl Howell
| male
| 26.0
| 0
| 0
| 111369
| 30.0000
| C148
| C
|
|
890
| 890
| 891
| 0
| 3
| Dooley, Mr. Patrick
| male
| 32.0
| 0
| 0
| 370376
| 7.7500
| NaN
| Q
|
891 rows × 13 columns
1.4.6 任务六: 将[‘PassengerId’,’Name’,’Age’,’Ticket’]这几个列元素隐藏,只观察其他几个列元素¶
写入代码
df.drop([‘PassengerId’,’Name’,’Age’,’Ticket’],axis=1).head(3) |
| Survived
| Pclass
| Sex
| SibSp
| Parch
| Fare
| Cabin
| Embarked
|
| —- | —- | —- | —- | —- | —- | —- | —- | —- |
|
0
| 0
| 3
| male
| 1
| 0
| 7.2500
| NaN
| S
|
|
1
| 1
| 1
| female
| 1
| 0
| 71.2833
| C85
| C
|
|
2
| 1
| 3
| female
| 0
| 0
| 7.9250
| NaN
| S
|
【思考】对比任务五和任务六,是不是使用了不一样的方法(函数),如果使用一样的函数如何完成上面的不同的要求呢?
drop函数
【思考回答】
如果想要完全的删除你的数据结构,使用inplace=True,因为使用inplace就将原数据覆盖了,所以这里没有用
1.5 筛选的逻辑¶
表格数据中,最重要的一个功能就是要具有可筛选的能力,选出我所需要的信息,丢弃无用的信息。
下面我们还是用实战来学习pandas这个功能。
1.5.1 任务一: 我们以”Age”为筛选条件,显示年龄在10岁以下的乘客信息。¶
写入代码
df[df[“Age”]<10].head(3) |
| PassengerId
| Survived
| Pclass
| Name
| Sex
| Age
| SibSp
| Parch
| Ticket
| Fare
| Cabin
| Embarked
|
| —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- |
|
7
| 8
| 0
| 3
| Palsson, Master. Gosta Leonard
| male
| 2.0
| 3
| 1
| 349909
| 21.075
| NaN
| S
|
|
10
| 11
| 1
| 3
| Sandstrom, Miss. Marguerite Rut
| female
| 4.0
| 1
| 1
| PP 9549
| 16.700
| G6
| S
|
|
16
| 17
| 0
| 3
| Rice, Master. Eugene
| male
| 2.0
| 4
| 1
| 382652
| 29.125
| NaN
| Q
|
1.5.2 任务二: 以”Age”为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage¶
写入代码
midage = df.query(“Age<50 & Age>10”) midage.head() |
| PassengerId
| Survived
| Pclass
| Name
| Sex
| Age
| SibSp
| Parch
| Ticket
| Fare
| Cabin
| Embarked
|
| —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- |
|
0
| 1
| 0
| 3
| Braund, Mr. Owen Harris
| male
| 22.0
| 1
| 0
| A/5 21171
| 7.2500
| NaN
| S
|
|
1
| 2
| 1
| 1
| Cumings, Mrs. John Bradley (Florence Briggs Th...
| female
| 38.0
| 1
| 0
| PC 17599
| 71.2833
| C85
| C
|
|
2
| 3
| 1
| 3
| Heikkinen, Miss. Laina
| female
| 26.0
| 0
| 0
| STON/O2. 3101282
| 7.9250
| NaN
| S
|
|
3
| 4
| 1
| 1
| Futrelle, Mrs. Jacques Heath (Lily May Peel)
| female
| 35.0
| 1
| 0
| 113803
| 53.1000
| C123
| S
|
|
4
| 5
| 0
| 3
| Allen, Mr. William Henry
| male
| 35.0
| 0
| 0
| 373450
| 8.0500
| NaN
| S
|
【提示】了解pandas的条件筛选方式以及如何使用交集和并集操作
1.5.3 任务三:将midage的数据中第100行的”Pclass”和”Sex”的数据显示出来¶
写入代码
midage.iloc[99][[‘Pclass’,’Sex’]]
midage = midage.reset_index(drop=True)
midage.loc[[100],[‘Pclass’,’Sex’]]
Pclass 2
Sex male
Name: 148, dtype: object【思考】这个reset_index()函数的作用是什么?如果不用这个函数,下面的任务会出现什么情况?
会串行
1.5.4 任务四:将midage的数据中第100,105,108行的”Pclass”,”Name”和”Sex”的数据显示出来¶
写入代码
midage.loc[[100,105,108],[‘Pclass’,’Name’,’Sex’]] |
| Pclass
| Name
| Sex
|
| —- | —- | —- | —- |
|
100
| 2
| Byles, Rev. Thomas Roussel Davids
| male
|
|
105
| 3
| Cribb, Mr. John Hatfield
| male
|
|
108
| 3
| Calic, Mr. Jovo
| male
|
【提示】使用pandas提出的简单方式,你可以看看loc方法
对比整体的数据位置,你有发现什么问题吗?那么如何解决?
调整index
1.5.5 任务五:使用iloc方法将midage的数据中第100,105,108行的”Pclass”,”Name”和”Sex”的数据显示出来¶
写入代码
midage.iloc[[100,105,108],[2,3,4]] |
| Pclass
| Name
| Sex
|
| —- | —- | —- | —- |
|
100
| 2
| Byles, Rev. Thomas Roussel Davids
| male
|
|
105
| 3
| Cribb, Mr. John Hatfield
| male
|
|
108
| 3
| Calic, Mr. Jovo
| male
|