通常情况下,我们需要找出两个数据帧之间的共同行,或者找到一个数据帧中的行而第二个数据帧中缺少的行。在这篇文章中,我们将看到如何使用 pandas 来实现这一点。

这里有两个数据框,我们将使用它们来查找公共行,数据框 1 中的行和数据框 2 中的行

  1. import pandas as pd
  2. df1 = pd.DataFrame({'City': ['New York', 'Chicago', 'Tokyo', 'Paris','New Delhi'],
  3. 'Temp': [59, 29, 73, 56,48]})
  4. df2 = pd.DataFrame({'City': ['London', 'New York', 'Tokyo', 'New Delhi','Paris'],
  5. 'Temp': [55, 55, 73, 85,56]})

Pandas 两个数据框之间的区别 - kanoki - 图1

使用合并函数查找两个数据框之间的公共行

使用合并功能,您可以获得两个数据帧之间的匹配行。因此,我们将 dataframe(df1) 与 dataframe(df2) 合并,要执行的合并类型是inner,它使用两个帧的键的交集,类似于 SQL 内连接。

  1. df = df1.merge(df2, how = 'inner' ,indicator=False)
  2. df

Pandas 两个数据框之间的区别 - kanoki - 图2

所以我们得到的是东京和巴黎,这两个数据帧之间很常见

使用 Concat函数查找两个数据帧之间的公共行

concat 函数沿特定轴连接第一个数据帧(df1)下方的第二个数据帧(df2),并沿其他轴连接可选的设置逻辑。因此,在这里我们将两个数据帧连接起来,然后对所有列进行分组,并找到计数大于 1 的行,因为这些行是两个数据帧共有的行。这是代码片段:

  1. df = pd.concat([df1, df2])
  2. df = df.reset_index(drop=True)
  3. df_gpby = df.groupby(list(df.columns))
  4. idx = [x[0] for x in df_gpby.groups.values() if len(x) != 1]
  5. df.reindex(idx)

Pandas 两个数据框之间的区别 - kanoki - 图3

在 DF1 中查找 DF2 中不可用的行

We will see how to get all the rows in dataframe(df1) which are not available in dataframe(df2). We can use the same merge function as used above only the parameter indicator is set to true, which adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both

  1. df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
  2. df

Pandas 两个数据框之间的区别 - kanoki - 图4

Using the lambda function we have filtered the rows with _merge value “left_only” to get all the rows in df1 which are missing from df2

Find Rows in DF2 Which Are Not Available in DF1

Just change the filter value on _merge column to right_only to get all the rows which are available in dataframe(df2) only and missing from df1

Just see the type of merge i.e. parameter how is changed to outer which is basically union of keys from both frames, similar to a SQL full outer join

  1. df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only']
  2. df

Pandas 两个数据框之间的区别 - kanoki - 图5

Check If Two Dataframes Are Exactly Same

In order to check if two dataframes are equal we can use equals function, which llows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal. The column headers do not need to have the same type, but the elements within the columns must be the same dtype

Check If Columns of Two Dataframes Are Exactly Same

Using equals you can also compare if the columns of two dataframes are equal or not

  1. df2['Temp'].equals(df1['Temp'])
  2. False

Find Rows Which Are Not common Between Two dataframes

So far we have seen all the ways to find common rows between two dataframes or rows available in one and missing from another dataframe. Now if we have to get all the rows which are not common between the two dataframe or we want to see all the unique un-matched rows between two dataframe then we can use the concat function with drop_duplicate.

  1. pd.concat([df1,df2]).drop_duplicates(keep=False)

Pandas 两个数据框之间的区别 - kanoki - 图6

Find All Values in a Column Between Two Dataframes Which Are Not Common

We will see how to get the set of values between columns of two dataframes which aren’t common between them. So here we are finding the symmetric difference also known as the disjunctive union, of two sets is the set of elements which are in either of the sets and not in their intersection

  1. set(df1.Temp).symmetric_difference(df2.Temp)
  2. {29, 48, 55, 59, 85}

上面的代码行给出了两个数据帧和同一列之间不常见的温度值。检查 df1 和 df2 并查看不常见的值是否相同。

结论

所以我们已经看到使用 Pandas - Merge、Concat 和 Equals 我们可以轻松找到存储在数据框中的两个 excel、csv 之间的差异。它还提供了一种直观的方法来比较数据帧并找到两个数据帧之间常见或不常见的行。您还可以阅读这篇文章,了解如何使用 pandas 逐个单元格比较两个 exel 文件,并将结果存储在 Excel 报告中。
https://kanoki.org/2019/07/04/pandas-difference-between-two-dataframes/