原文:Data Wrangling

译者:飞龙

协议:CC BY-NC-SA 4.0(原文协议:Apache License 2.0

数据流

7.2 数据整理 - 图1

直接从 GitHub 挖掘数据,VizGitHub API 提供支持,并利用以下内容:

将来,Google BigQueryGitHub Archive 也可以补充 GitHub API。

导入

  1. import re
  2. import pandas as pd

准备仓库数据

加载仓库数据并删除重复:

  1. repos = pd.read_csv("data/2017/repos-dump.csv", quotechar='"', skipinitialspace=True)
  2. print('Shape before dropping duplicates', repos.shape)
  3. repos = repos.drop_duplicates(subset='full_name', keep='last')
  4. print('Shape after dropping duplicates', repos.shape)
  5. repos.head()
  6. '''
  7. Shape before dropping duplicates (8697, 5)
  8. Shape after dropping duplicates (8697, 5)
  9. '''
full_name stars forks description language
0 thedaviddias/Front-End-Checklist 24267 2058 🗂 The perfect Front-End Checklist for modern w… JavaScript
1 GoogleChrome/puppeteer 21976 1259 Headless Chrome Node API JavaScript
2 parcel-bundler/parcel 13981 463 📦🚀 Blazing fast, zero configuration web applic… JavaScript
3 Chalarangelo/30-seconds-of-code 13466 1185 Curated collection of useful Javascript snippe… JavaScript
4 wearehive/project-guidelines 11279 970 A set of best practices for JavaScript projects JavaScript

userrepofull_name分离,变成新列:

  1. def extract_user(line):
  2. return line.split('/')[0]
  3. def extract_repo(line):
  4. return line.split('/')[1]
  5. repos['user'] = repos['full_name'].str[:].apply(extract_user)
  6. repos['repo'] = repos['full_name'].str[:].apply(extract_repo)
  7. print(repos.shape)
  8. repos.head()
  9. # (8697, 7)
full_name stars forks description language user repo
0 thedaviddias/Front-End-Checklist 24267 2058 🗂 The perfect Front-End Checklist for modern w… JavaScript thedaviddias Front-End-Checklist
1 GoogleChrome/puppeteer 21976 1259 Headless Chrome Node API JavaScript GoogleChrome puppeteer
2 parcel-bundler/parcel 13981 463 📦🚀 Blazing fast, zero configuration web applic… JavaScript parcel-bundler parcel
3 Chalarangelo/30-seconds-of-code 13466 1185 Curated collection of useful Javascript snippe… JavaScript Chalarangelo 30-seconds-of-code
4 wearehive/project-guidelines 11279 970 A set of best practices for JavaScript projects JavaScript wearehive project-guidelines

准备用户数据

加载用户数据并删除重复:

  1. users = pd.read_csv("data/2017/user-geocodes-dump.csv", quotechar='"', skipinitialspace=True)
  2. print('Shape before dropping duplicates', users.shape)
  3. users = users.drop_duplicates(subset='id', keep='last')
  4. print('Shape after dropping duplicates', users.shape)
  5. users.head()
  6. '''
  7. Shape before dropping duplicates (6426, 8)
  8. Shape after dropping duplicates (6426, 8)
  9. '''
id name type location lat long city country
0 dns-violations NaN Organization NaN NaN NaN NaN NaN
1 hannob Hanno Böck User Berlin 52.520007 13.404954 Berlin Germany
2 takecian Takeshi Fujiki User Tokyo, Japan 35.689487 139.691706 Tokyo Japan
3 jtomschroeder Tom Schroeder User Chicago 41.878114 -87.629798 Chicago United States
4 wapiflapi Wannes Rombouts User France 46.227638 2.213749 NaN France

id列重命名为user

  1. users.rename(columns={'id': 'user'}, inplace=True)
  2. users.head()
user name type location lat long city country
0 dns-violations NaN Organization NaN NaN NaN NaN NaN
1 hannob Hanno Böck User Berlin 52.520007 13.404954 Berlin Germany
2 takecian Takeshi Fujiki User Tokyo, Japan 35.689487 139.691706 Tokyo Japan
3 jtomschroeder Tom Schroeder User Chicago 41.878114 -87.629798 Chicago United States
4 wapiflapi Wannes Rombouts User France 46.227638 2.213749 NaN France

合并仓库和用户数据

左连接仓库和用户:

  1. repos_users = pd.merge(repos, users, on='user', how='left')
  2. print('Shape repos:', repos.shape)
  3. print('Shape users:', users.shape)
  4. print('Shape repos_users:', repos_users.shape)
  5. repos_users.head()
  6. '''
  7. Shape repos: (8697, 7)
  8. Shape users: (6426, 8)
  9. Shape repos_users: (8697, 14)
  10. '''
full_name stars forks description language user repo name type location lat long city country
0 thedaviddias/Front-End-Checklist 24267 2058 🗂 The perfect Front-End Checklist for modern w… JavaScript thedaviddias Front-End-Checklist David Dias User France, Mauritius, Canada NaN NaN NaN NaN
1 GoogleChrome/puppeteer 21976 1259 Headless Chrome Node API JavaScript GoogleChrome puppeteer NaN Organization NaN NaN NaN NaN NaN
2 parcel-bundler/parcel 13981 463 📦🚀 Blazing fast, zero configuration web applic… JavaScript parcel-bundler parcel Parcel Organization NaN NaN NaN NaN NaN
3 Chalarangelo/30-seconds-of-code 13466 1185 Curated collection of useful Javascript snippe… JavaScript Chalarangelo 30-seconds-of-code Angelos Chalaris User Athens, Greece 37.983810 23.727539 Athens Greece
4 wearehive/project-guidelines 11279 970 A set of best practices for JavaScript projects JavaScript wearehive project-guidelines Hive Organization London 51.507351 -0.127758 London United Kingdom

整理仓库和用户数据

重新排序列:

  1. repos_users = repos_users.reindex_axis(['full_name',
  2. 'repo',
  3. 'description',
  4. 'stars',
  5. 'forks',
  6. 'language',
  7. 'user',
  8. 'name',
  9. 'type',
  10. 'location',
  11. 'lat',
  12. 'long',
  13. 'city',
  14. 'country'], axis=1)
  15. print(repos_users.shape)
  16. repos_users.head()
  17. # (8697, 14)
full_name repo description stars forks language user name type location lat long city country
0 thedaviddias/Front-End-Checklist Front-End-Checklist 🗂 The perfect Front-End Checklist for modern w… 24267 2058 JavaScript thedaviddias David Dias User France, Mauritius, Canada NaN NaN NaN NaN
1 GoogleChrome/puppeteer puppeteer Headless Chrome Node API 21976 1259 JavaScript GoogleChrome NaN Organization NaN NaN NaN NaN NaN
2 parcel-bundler/parcel parcel 📦🚀 Blazing fast, zero configuration web applic… 13981 463 JavaScript parcel-bundler Parcel Organization NaN NaN NaN NaN NaN
3 Chalarangelo/30-seconds-of-code 30-seconds-of-code Curated collection of useful Javascript snippe… 13466 1185 JavaScript Chalarangelo Angelos Chalaris User Athens, Greece 37.983810 23.727539 Athens Greece
4 wearehive/project-guidelines project-guidelines A set of best practices for JavaScript projects 11279 970 JavaScript wearehive Hive Organization London 51.507351 -0.127758 London United Kingdom

添加整体排名

根据星数对每个元素排名:

  1. repos_users['rank'] = repos_users['stars'].rank(ascending=False)
  2. print(repos_users.shape)
  3. repos_users.head()
  4. # (8697, 15)
full_name repo description stars forks language user name type location lat long city country rank
0 thedaviddias/Front-End-Checklist Front-End-Checklist 🗂 The perfect Front-End Checklist for modern w… 24267 2058 JavaScript thedaviddias David Dias User France, Mauritius, Canada NaN NaN NaN NaN 3
1 GoogleChrome/puppeteer puppeteer Headless Chrome Node API 21976 1259 JavaScript GoogleChrome NaN Organization NaN NaN NaN NaN NaN 4
2 parcel-bundler/parcel parcel 📦🚀 Blazing fast, zero configuration web applic… 13981 463 JavaScript parcel-bundler Parcel Organization NaN NaN NaN NaN NaN 11
3 Chalarangelo/30-seconds-of-code 30-seconds-of-code Curated collection of useful Javascript snippe… 13466 1185 JavaScript Chalarangelo Angelos Chalaris User Athens, Greece 37.983810 23.727539 Athens Greece 13
4 wearehive/project-guidelines project-guidelines A set of best practices for JavaScript projects 11279 970 JavaScript wearehive Hive Organization London 51.507351 -0.127758 London United Kingdom 16

验证结果:用户

等价于 GitHub 搜索查询created:2017-01-01..2017-12-31 stars:> = 100 user:donnemartin

注意:数据可能稍微差了一些,因为搜索查询将考虑执行查询时的数据。 此笔记本中的数据于 2017 年 1 月 1 日采集,来“冻结” 2017 年的结果。从 2017 年 1 月 1 日开始,执行搜索的时间越长,差异越大。

  1. repos_users[repos_users['user'] == 'donnemartin']
full_name repo description stars forks language user name type location lat long city country rank
3308 donnemartin/system-design-primer system-design-primer Learn how to design large-scale systems. Prep … 21780 2633 Python donnemartin Donne Martin User Washington, D.C. 38.907192 -77.036871 Washington United States 5

验证结果:Python 仓库

等价于 GitHub 搜索查询created:2017-01-01..2017-12-31 stars:>=100 language:python

注意:数据可能稍微差了一些,因为搜索查询将考虑执行查询时的数据。 此笔记本中的数据于 2017 年 1 月 1 日采集,来“冻结” 2017 年的结果。从 2017 年 1 月 1 日开始,执行搜索的时间越长,差异越大。

  1. print(repos_users[repos_users['language'] == 'Python'].shape)
  2. repos_users[repos_users['language'] == 'Python'].head()
  3. # (1357, 15)
full_name repo description stars forks language user name type location lat long city country rank
3308 donnemartin/system-design-primer system-design-primer Learn how to design large-scale systems. Prep … 21780 2633 Python donnemartin Donne Martin User Washington, D.C. 38.907192 -77.036871 Washington United States 5
3309 python/cpython cpython The Python programming language 15060 3779 Python python Python Organization NaN NaN NaN NaN NaN 9
3310 ageitgey/face_recognition face_recognition The world’s simplest facial recognition api fo… 8487 1691 Python ageitgey Adam Geitgey User Various places NaN NaN NaN NaN 31
3311 tonybeltramelli/pix2code pix2code pix2code: Generating Code from a Graphical Use… 8037 605 Python tonybeltramelli Tony Beltramelli User Denmark NaN NaN NaN NaN 34
3312 google/python-fire python-fire Python Fire is a library for automatically gen… 7663 386 Python google Google Organization NaN NaN NaN NaN NaN 36

验证结果:所有仓库

等价于 GitHub 搜索查询created:2017-01-01..2017-12-31 stars:>=100

注意:数据可能稍微差了一些,因为搜索查询将考虑执行查询时的数据。 此笔记本中的数据于 2017 年 1 月 1 日采集,来“冻结” 2017 年的结果。从 2017 年 1 月 1 日开始,执行搜索的时间越长,差异越大。

  1. print(repos_users.shape)
  2. repos_users.head()
  3. # (8697, 15)
full_name repo description stars forks language user name type location lat long city country rank
0 thedaviddias/Front-End-Checklist Front-End-Checklist 🗂 The perfect Front-End Checklist for modern w… 24267 2058 JavaScript thedaviddias David Dias User France, Mauritius, Canada NaN NaN NaN NaN 3
1 GoogleChrome/puppeteer puppeteer Headless Chrome Node API 21976 1259 JavaScript GoogleChrome NaN Organization NaN NaN NaN NaN NaN 4
2 parcel-bundler/parcel parcel 📦🚀 Blazing fast, zero configuration web applic… 13981 463 JavaScript parcel-bundler Parcel Organization NaN NaN NaN NaN NaN 11
3 Chalarangelo/30-seconds-of-code 30-seconds-of-code Curated collection of useful Javascript snippe… 13466 1185 JavaScript Chalarangelo Angelos Chalaris User Athens, Greece 37.983810 23.727539 Athens Greece 13
4 wearehive/project-guidelines project-guidelines A set of best practices for JavaScript projects 11279 970 JavaScript wearehive Hive Organization London 51.507351 -0.127758 London United Kingdom 16

输出结果

将结果写出到 csv 来在 Tableau 中可视化:

  1. users.to_csv('data/2017/users.csv', index=False)
  2. repos_users.to_csv('data/2017/repos-users-geocodes.csv', index=False)
  3. repos_users.to_csv('data/2017/repos-users.csv', index=False)
  4. repos_rank = repos_users.reindex_axis(['full_name', 'rank'], axis=1)
  5. repos_rank.to_csv('data/2017/repos-ranks.csv', index=False)