DataFrame是一个表格型的数据结构,类似于Excel或sql表,它含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔值等)。

DataFrame既有行索引也有列索引,它可以被看做由Series组成的字典(共用同一个索引)

DataFrame的创建

用多维数组字典、列表字典生成 DataFrame:

  1. data = {
  2. 'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
  3. 'year': [2000, 2001, 2002, 2001, 2002],
  4. 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]
  5. }
  6. frame = pd.DataFrame(data)
  7. print(frame)

输出:

  1. state year pop
  2. 0 Ohio 2000 1.5
  3. 1 Ohio 2001 1.7
  4. 2 Ohio 2002 3.6
  5. 3 Nevada 2001 2.4
  6. 4 Nevada 2002 2.9

指定行的名称

通过index指定每一行的名称:

  1. frame = pd.DataFrame(data, index=['one', 'two', 'three', 'four', 'five'])
  2. print(frame)

输出:

  1. state year pop
  2. one Ohio 2000 1.5
  3. two Ohio 2001 1.7
  4. three Ohio 2002 3.6
  5. four Nevada 2001 2.4
  6. five Nevada 2002 2.9

:::warning 注意:行的数量应与创建的frame对应,否则会报错。 :::

指定列的排序

如果指定了列顺序,则DataFrame的列就会按照指定顺序进行排列

  1. data = {
  2. 'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
  3. 'year': [2000, 2001, 2002, 2001, 2002],
  4. 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]
  5. }
  6. frame = pd.DataFrame(data, columns=['year', 'state', 'pop'])
  7. print(frame)

输出:

  1. year state pop
  2. 0 2000 Ohio 1.5
  3. 1 2001 Ohio 1.7
  4. 2 2002 Ohio 3.6
  5. 3 2001 Nevada 2.4
  6. 4 2002 Nevada 2.9

跟原Series一样,如果传入的列在数据中找不到,就会产生NAN值

  1. frame = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five'])
  2. print(frame)

输出:

  1. year state pop debt
  2. one 2000 Ohio 1.5 NaN
  3. two 2001 Ohio 1.7 NaN
  4. three 2002 Ohio 3.6 NaN
  5. four 2001 Nevada 2.4 NaN
  6. five 2002 Nevada 2.9 NaN

用 Series 生成 DataFrame

可以通过Series生成DataFrame,同样地,如果数据对应不上,则产生NaN值

  1. d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
  2. 'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
  3. print(pd.DataFrame(d))

输出:

  1. one two
  2. a 1.0 1.0
  3. b 2.0 2.0
  4. c 3.0 3.0
  5. d NaN 4.0

DataFrame的获取

通过类似字典标记的方式或属性的方式,可以将DataFrame的列获取为一个Series,返回的Series拥有原DataFrame相同的索引

  1. frame = pd.DataFrame(data, columns=['year', 'state', 'pop'], index=['one', 'two', 'three', 'four', 'five'])
  2. print(frame['state'])

输出:

  1. one Ohio
  2. two Ohio
  3. three Ohio
  4. four Nevada
  5. five Nevada
  6. Name: state, dtype: object

通过数组赋值

通过np.arange()可以为每个元素赋值

  1. frame['debt'] = np.arange(5.)
  2. print(frame)

输出

  1. year state pop debt new
  2. one 2000 Ohio 1.5 0.0 24.75
  3. two 2001 Ohio 1.7 1.0 28.05
  4. three 2002 Ohio 3.6 2.0 59.40
  5. four 2001 Nevada 2.4 3.0 39.60
  6. five 2002 Nevada 2.9 4.0 47.85

或者直接通过数组为每个元素赋值:

  1. frame['debt'] = [1,2,3,4,5]
  2. print(frame)

输出:

  1. year state pop debt
  2. one 2000 Ohio 1.5 1
  3. two 2001 Ohio 1.7 2
  4. three 2002 Ohio 3.6 3
  5. four 2001 Nevada 2.4 4
  6. five 2002 Nevada 2.9 5

DataFrame的修改

列可以通过赋值的方式进行修改,例如,给那个空的“delt”列赋上一个标量值或一组值

  1. frame['debt'] = 16.5
  2. print(frame)

输出:

  1. year state pop debt
  2. one 2000 Ohio 1.5 16.5
  3. two 2001 Ohio 1.7 16.5
  4. three 2002 Ohio 3.6 16.5
  5. four 2001 Nevada 2.4 16.5
  6. five 2002 Nevada 2.9 16.5

同样地,也可以创建一个新的列:

  1. frame['new'] = frame2['debt' ]* frame2['pop']
  2. print(frame)

输出:

  1. year state pop debt new
  2. one 2000 Ohio 1.5 16.5 24.75
  3. two 2001 Ohio 1.7 16.5 28.05
  4. three 2002 Ohio 3.6 16.5 59.40
  5. four 2001 Nevada 2.4 16.5 39.60
  6. five 2002 Nevada 2.9 16.5 47.85

DataFrame的转置

使用T可以获取到DataFrame的转置

  1. frame.T

📃 DataFrame - 图1

转化为numpy

通过to_numpy将DataFrame转化为numpy,举例:

  1. frame.to_numpy()
  1. array([[2000, 'Ohio', 1.5, 1],
  2. [2001, 'Ohio', 1.7, 2],
  3. [2002, 'Ohio', 3.6, 3],
  4. [2001, 'Nevada', 2.4, 4],
  5. [2002, 'Nevada', 2.9, 5]], dtype=object)

:::info DataFrame.to_numpy() 的输出不包含行索引和列标签。 :::