Joining Data

Joining Data

介绍

你拥有从某个表中以任何格式获取数据的工具。但是，如果你想要的数据分布在多个表中呢？

这就是JOIN的作用！JOIN在实际工作流中非常重要。开始我们的学习吧。

例子

我们将使用假想的pets表，它有三列：

ID-宠物的ID号码
Name-宠物的名字
Animal-宠物的类型

我们还将添加另一个表，称为owners。它也有三列：

ID-主人的ID号（与宠物的ID号不同）
Name-主人的名字
Pet_ID-属于所有者的宠物的ID号（与pets表中宠物的ID号匹配）

为了获取适用于特定宠物的信息，我们将pets表中的ID列与owners表中的Pet_ID列匹配。

例如，

pets表显示，Dr. Harris Bonkers是ID为1的宠物。
owners表显示，Aubrey Little是ID为1的宠物的所有者。

把这两个事实放在一起，Dr. Harris Bonkers是归Aubrey Little所拥有。

幸运的是，我们不需要手工来确定哪个主人和哪个宠物和哪个主人一起。在接下来的一部分，你将学习如何使用JOIN创建一个新表，将pets表和owners表中的信息组合在一起。

JOIN

使用JOIN，我们可以编写一个查询来创建一个只有两列的表：名称为pet和名称为owners。

我们通过匹配pets表中的ID列与owners表中的Pet_ID列匹配的行来组合来自两个表的信息。

在查询中，ON确定每个表中的哪一列用于组合表。注意，由于ID列存在于这两个表中，我们必须明确使用哪个列。我们使用p.ID来引用pets表中的ID列，以及o.Pet_ID引用owners表中的Pet_ID列。

通常情况下，在连接表时，指定每个列来自哪个表是个好习惯。这样的话，你就不必每次返回读取查询时都打开模式。

我们今天使用的连接类型称为内部连接。这意味着，只有当用于组合它们的列中的值出现在要连接的两个表中时，才会将一行放入最终的输出表中。例如，如果在pets表中不存在Tom的ID数字4，那么我们只会从这个查询中返回3行，还有其他类型的连接，但是内部连接使用非常广泛，所有这是一个很好的开始。

例子：每种类型的软件许可证涵盖多少个文件？

GitHub是最流行的软件项目协作平台。一个GitHub存储库（或repo）是与特定项目关联的文件集合。

GitHub上的大多数repos都是在特定的法律许可下共享的，这就决定了如何使用它们的法律限制。对于我们的示例，我们将查看在每个许可证下发布了多少个不同的文件。

我们将使用数据库中的两个表。第一个表是licenses表，它提供了每个GitHub repo的名称（在repo_name列中）及其相应的许可证。这是前五行的视图。

In [1]

from google.cloud import bigquery
# Create a "Client" object
client = bigquery.Client()
# Construct a reference to the "github_repos" dataset
dataset_ref = client.dataset("github_repos", project="bigquery-public-data")
# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)
# Construct a reference to the "licenses" table
licenses_ref = dataset_ref.table("licenses")
# API request - fetch the table
licenses_table = client.get_table(licenses_ref)
# Preview the first five lines of the "licenses" table
client.list_rows(licenses_table, max_results=5).to_dataframe()

Using Kaggle's public dataset BigQuery integration.

Out [1]

	repo_name	license
0	azuredream/chat_server-client	artistic-2.0
1	Egyptian19/JemCraft	artistic-2.0
2	ZioRiP/cookie	artistic-2.0
3	ajs/perl6-log	artistic-2.0
4	JohanPotgieter/Internet	artistic-2.0

第二个表是sample_files表，其中提供了每个文件所属的GitHub repo（在repo_name列中）。下面打印了该表的前几行。

In [2]

# Construct a reference to the "sample_files" table
files_ref = dataset_ref.table("sample_files")
# API request - fetch the table
files_table = client.get_table(files_ref)
# Preview the first five lines of the "sample_files" table
client.list_rows(files_table, max_results=5).to_dataframe()

Out [2]

	repo_name	ref	path	mode	id	symlink_target
0	git/git	refs/heads/master	RelNotes	40960	62615ffa4e97803da96aefbc798ab50f949a8db7	Documentation/RelNotes/2.10.0.txt
1	np/ling	refs/heads/master	tests/success/plug_compose.t/plug_compose.ll	40960	0c1605e4b447158085656487dc477f7670c4bac1	../../../fixtures/all/plug_compose.ll
2	np/ling	refs/heads/master	fixtures/strict-par-success/parallel_assoc_lef…	40960	b59bff84ec03d12fabd3b51a27ed7e39a180097e	../all/parallel_assoc_left.ll
3	np/ling	refs/heads/master	fixtures/sequence/parallel_assoc_2tensor2_left.ll	40960	f29523e3fb65702d99478e429eac6f801f32152b	../all/parallel_assoc_2tensor2_left.ll
4	np/ling	refs/heads/master	fixtures/success/my_dual.ll	40960	38a3af095088f90dfc956cb990e893909c3ab286	../all/my_dual.ll

接下来我们编写一个查询，它使用两个表中的信息来确定每个许可证中释放了多少文件。

In [3]

# Query to determine the number of files per license, sorted by number of files
query = """
        SELECT L.license, COUNT(1) AS number_of_files
        FROM `bigquery-public-data.github_repos.sample_files` AS sf
        INNER JOIN `bigquery-public-data.github_repos.licenses` AS L 
            ON sf.repo_name = L.repo_name
        GROUP BY L.license
        ORDER BY number_of_files DESC
        """
# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query, job_config=safe_config)
# API request - run the query, and convert the results to a pandas DataFrame
file_count_by_license = query_job.to_dataframe()

这是一个很大的查询，所以我们将分别研究每个部分。

我们将从JOIN（上面用蓝色高亮显示）开始。这指定了数据的来源以及如何连接它们。我们使用ON指定通过匹配表中的repo_name列中的值来组合表。

接下来我们将讨论SELECT和GROUP BY（用黄色突出显示）。在统计sample_files表中对应于每个许可证的行数之前，GROUP BY将每个许可证的数据分成不同的组。（记住，可以使用COUNT(1)计算行数。）

最后，ORDER BY（用紫色突出显示）对结果进行排序，以便先处理包含更多文件的许可证。

这是一个很大的查询，但它给了我们一个很好的表格，总结了每个协议下允许下提交多少文件：

In [4]

# Print the DataFrame
file_count_by_license

Out [4]

	license	number_of_files
0	mit	20615554
1	gpl-2.0	17112844
2	apache-2.0	7225328
3	gpl-3.0	4954512
4	bsd-3-clause	2945518
5	agpl-3.0	1297463
6	lgpl-2.1	800634
7	bsd-2-clause	700337
8	lgpl-3.0	567189
9	mpl-2.0	469300
10	cc0-1.0	404796
11	epl-1.0	322484
12	unlicense	209289
13	artistic-2.0	148414
14	isc	117807

你将会经常使用JOIN子句，并且随着你的实验，你将会非常高效地使用它们。

Your turn

现在你就剩最后一步了。通过解决这些练习来完成它。