这篇文章中,你将可以学习关于Python中流行的数据分析库Pandas相关知识。
推荐你在学习教程的同时,完成对一些真实数据的处理。

开始

使用pandas库,一般来说,需要在你的代码开头用以下语句导入pandas库

  1. import pandas as pd

创建数据

Pandas中有两种数据类型:DataFrameSeries.

DataFrame

DataFrame 是一个表格,包含由独立项构成的数组,每个单元包括一个值,每个项由行和列构成,例如下面语句可以构成一个简单的DataFrame:

  1. import pandas as pd
  2. dfInt = pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
  3. print(dfInt)
  1. Yes No
  2. 0 50 131
  3. 1 21 2

在这个例子中,”0, No”项的值是 131,”0, Yes”项的值是50,等等。
DataFrame 的项不的值不局限于整数,例如,下面的DataFrame 的值是字符串:

  1. import pandas as pd
  2. dfStr = pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})
  3. print(dfStr)
  1. Bob Sue
  2. 0 I liked it. Pretty good.
  3. 1 It was awful. Bland.

我们用 pd.DataFrame() 构造器生成了一个DataFrame对象。声明了一个新的以列名为键、以项的列表为值的字典的。这是构建一个新的DataFrame的标准方式,也是最常见的方式。
字典-列表构造器赋值给列标签,但对行标签来说,仅用于从0开始的升序计数, (0, 1, 2, 3, …),一般来说这样够用了,但我们经常期望给那么标签自身赋值。DataFrame中使用列表值可做为行标签的索引,我们可以通过在构造器中使用索引参数为其赋值。

  1. import pandas as pd
  2. dfIndex = pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
  3. 'Sue': ['Pretty good.', 'Bland.']},
  4. index=['Product A', 'Product B'])
  5. print(dfIndex)
  1. Bob Sue
  2. Product A I liked it. Pretty good.
  3. Product B It was awful. Bland.

Series

一个 Series,从结构上说,是一系列数据值。如果 一个 DataFrame 是一个表格,那么,一个 Series 就是一个列表,实际上,可以仅用一个列表不用其他参数就可以创建一个 Series 。

  1. import pandas as pd
  2. s = pd.Series([1, 2, 3, 4, 5])
  3. print(s)
  1. 0 1
  2. 1 2
  3. 2 3
  4. 3 4
  5. 4 5
  6. dtype: int64

从本质上来说,Series 可以算是只有一列的 DataFrame。 所以可以象前面用过的方法一样,通过使用一个索引参数赋一列值给Series。

  1. import pandas as pd
  2. s =pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
  3. print(s)
  1. 2015 Sales 30
  2. 2016 Sales 35
  3. 2017 Sales 40
  4. Name: Product A, dtype: int64

读数据文件

以手工方法创建 DataFrame 或 Series 很容易,但大部分时候,我们处理的并不是手工创建的简单数据,更为普遍的操作是处理已经存在的实际数据。
数据可能被存储在各种不同的形式和格式,最基础、最普遍的形式是存储于CSV文件中,我们打开CSV文件时,可以看到以下格式的数据:

  1. Product A,Product B,Product C,
  2. 30,21,9,
  3. 35,34,1,
  4. 41,11,11

CSV 文件是一个以逗号分隔值的表格,因此而命名f”Comma-Separated Values”, 或简称CSV。
现在让我们把示例数据集放在一边,看一下真实数据集被读入DataFrame后是什么样子。我们使用
pd.read_csv() 函数读数据到DataFrame中,可以这样操作:

  1. wine_reviews = pd.read_csv("../data/winemag-data-130k-v2.csv")
  2. # 读取当前文件上你目录中data路径下的winemag-data-130k-v2.csv

可以用shape属性查看DataFrame的大小

  1. print(wine_reviews.shape) # (129971, 14)

得到的这个新的DataFrame中包含129971条记录,分为14个不同的列,总数据将近200万的数据项。我们可以用head()函数检查DataFrame中的内容,这个函数将只返回数据的前5行。

  1. import pandas as pd
  2. wine_reviews = pd.read_csv("../data/winemag-data-130k-v2.csv")
  3. print(wine_reviews.shape) # (129971, 14)
  4. pd.set_option('display.max_columns', None) # 显示所有列
  5. pd.set_option('display.max_rows', None) # 显示所有行
  6. pd.set_option('display.width', None) # 显示宽度是无限
  7. print(wine_reviews.head()) # 返回数据的前5行
  1. Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
  2. 0 0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin OKeefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
  3. 1 1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
  4. 2 2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
  5. 3 3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
  6. 4 4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

完整列的数据如下:

Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 0 Italy Aromas include tropical fruit, broom, brimston… Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 1 Portugal This is ripe and fruity, a wine that is smooth… Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 2 US Tart and snappy, the flavors of lime flesh and… NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 3 US Pineapple rind, lemon pith and orange blossom … Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling … Riesling St. Julian
4 4 US Much like the regular bottling from 2012, this… Vintner’s Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner’s Reserve Wild Child… Pinot Noir Sweet Cheeks

pd.read_csv() 函数参数非常丰富,可以指定超过30个参数选项。例如可以指定查看pandas默认不自动提取的CSV文件内置的索引,为使pandas使用那些列做为索引以替换用脚本创建新的索引,我们可以明确规定一个列索引。

  1. import pandas as pd
  2. pd.set_option('display.max_columns', None) # 显示所有列
  3. pd.set_option('display.max_rows', None) # 显示所有行
  4. pd.set_option('display.width', None) # 显示宽度是无限
  5. wine_reviews = pd.read_csv("../data/winemag-data-130k-v2.csv", index_col=0)
  6. print(wine_reviews.head())
  1. country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
  2. 0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin OKeefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
  3. 1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
  4. 2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
  5. 3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
  6. 4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks