在前面的文章中,我们已经讲解来什么是图网络(Graph Network),什么是图节点(Graph Node),在这里,我们将具体用一个实际案例来直观地展现图网络的结构。
我们采用的数据是 Netflix 影视数据集,该数据集包含每个电影的标题、导演、国家、演员、时间等等信息。那么我们可以利用这些数据来构建一个关于影视数据的图网络。该数据集也曾用于 kaggle 公开的比赛中,我们可以从 kaggle下载该数据集。
我们通过 pandas 加载数据,并可以快速看到该数据的结构和每一条影视包含的属性。该数据的每一列含义如下:
属性 | 含义(如果有多个,则以逗号分隔) |
---|---|
show_id | 每一个影视对应的唯一的编号 |
type | 类型为电影还是电视剧 |
title | 影视名称 |
director | 导演 |
cast | 演员 |
country | 影视制片所在国家 |
date_added | 上架 Netflix 的时间 |
release_year | 影视的实际上映年份 |
rating | 打分 |
duration | 影视时长 |
listed_in | 影视类别 |
description | 影视概述 |
这里,我们选择 title、director、cast、country、listed_in 这5个属性,并重新构造一份 data。
In [1]:
import pandas as pd
# 加载数据
df = pd.read_csv('/home/jovyan/kernel/netflix_titles.csv')
# 取出导演属性,如果该属性的值为空,则返回空列表 [] ,否则返回所有导演列表
df['directors'] = df['director'].apply(lambda l: [] if pd.isna(l) else [i.strip() for i in l.split(",")])
# 取出演员属性,如果该属性的值为空,则返回空列表 [] ,否则返回所有演员列表
df['actors'] = df['cast'].apply(lambda l: [] if pd.isna(l) else [i.strip() for i in l.split(",")])
# 取出导演属性,如果该属性的值为空,则返回空列表 [] ,否则返回所有影视类型列表
df['categories'] = df['listed_in'].apply(lambda l: [] if pd.isna(l) else [i.strip() for i in l.split(",")])
# 取出国家属性,如果该属性的值为空,则返回空列表 [] ,否则返回所有国家列表
df['countries'] = df['country'].apply(lambda l: [] if pd.isna(l) else [i.strip() for i in l.split(",")])
# 取出我们需要的5个属性
df = df[["title", "directors", "actors", "categories", "countries"]]
df
Out[1]:
title | directors | actors | categories | countries | |
---|---|---|---|---|---|
0 | Norm of the North: King Sized Adventure | [Richard Finn, Tim Maltby] | [Alan Marriott, Andrew Toth, Brian Dobson, Col… | [Children & Family Movies, Comedies] | [United States, India, South Korea, China] |
1 | Jandino: Whatever it Takes | [] | [Jandino Asporaat] | [Stand-Up Comedy] | [United Kingdom] |
2 | Transformers Prime | [] | [Peter Cullen, Sumalee Montano, Frank Welker, … | [Kids’ TV] | [United States] |
3 | Transformers: Robots in Disguise | [] | [Will Friedle, Darren Criss, Constance Zimmer,… | [Kids’ TV] | [United States] |
4 | #realityhigh | [Fernando Lebrija] | [Nesta Cooper, Kate Walsh, John Michael Higgin… | [Comedies] | [United States] |
5 | Apaches | [] | [Alberto Ammann, Eloy Azorín, Verónica Echegui… | [Crime TV Shows, International TV Shows, Spani… | [Spain] |
6 | Automata | [Gabe Ibáñez] | [Antonio Banderas, Dylan McDermott, Melanie Gr… | [International Movies, Sci-Fi & Fantasy, Thril… | [Bulgaria, United States, Spain, Canada] |
7 | Fabrizio Copano: Solo pienso en mi | [Rodrigo Toro, Francisco Schultz] | [Fabrizio Copano] | [Stand-Up Comedy] | [Chile] |
8 | Fire Chasers | [] | [] | [Docuseries, Science & Nature TV] | [United States] |
9 | Good People | [Henrik Ruben Genz] | [James Franco, Kate Hudson, Tom Wilkinson, Oma… | [Action & Adventure, Thrillers] | [United States, United Kingdom, Denmark, Sweden] |
10 | Joaquín Reyes: Una y no más | [José Miguel Contreras] | [Joaquín Reyes] | [Stand-Up Comedy] | [] |
11 | Kidnapping Mr. Heineken | [Daniel Alfredson] | [Jim Sturgess, Sam Worthington, Ryan Kwanten, … | [Action & Adventure, Dramas, International Mov… | [Netherlands, Belgium, United Kingdom, United … |
12 | Krish Trish and Baltiboy | [] | [Damandeep Singh Baggan, Smita Malhotra, Baba … | [Children & Family Movies] | [] |
13 | Krish Trish and Baltiboy: Battle of Wits | [Munjal Shroff, Tilak Shetty] | [Damandeep Singh Baggan, Smita Malhotra, Baba … | [Children & Family Movies] | [] |
14 | Krish Trish and Baltiboy: Best Friends Forever | [Munjal Shroff, Tilak Shetty] | [Damandeep Singh Baggan, Smita Malhotra, Deepa… | [Children & Family Movies] | [] |
15 | Krish Trish and Baltiboy: Comics of India | [Tilak Shetty] | [Damandeep Singh Baggan, Smita Malhotra, Baba … | [Children & Family Movies] | [] |
16 | Krish Trish and Baltiboy: Oversmartness Never … | [Tilak Shetty] | [Rishi Gambhir, Smita Malhotra, Deepak Chachra] | [Children & Family Movies] | [] |
17 | Krish Trish and Baltiboy: Part II | [] | [Damandeep Singh Baggan, Smita Malhotra, Baba … | [Children & Family Movies] | [] |
18 | Krish Trish and Baltiboy: The Greatest Trick | [Munjal Shroff, Tilak Shetty] | [Damandeep Singh Baggan, Smita Malhotra, Baba … | [Children & Family Movies] | [] |
19 | Love | [Gaspar Noé] | [Karl Glusman, Klara Kristin, Aomi Muyock, Ugo… | [Cult Movies, Dramas, Independent Movies] | [France, Belgium] |
20 | Manhattan Romance | [Tom O’Brien] | [Tom O’Brien, Katherine Waterston, Caitlin Fit… | [Comedies, Independent Movies, Romantic Movies] | [United States] |
21 | Moonwalkers | [Antoine Bardou-Jacquet] | [Ron Perlman, Rupert Grint, Robert Sheehan, St… | [Action & Adventure, Comedies, International M… | [France, Belgium] |
22 | Rolling Papers | [Mitch Dickman] | [] | [Documentaries] | [United States, Uruguay] |
23 | Stonehearst Asylum | [Brad Anderson] | [Kate Beckinsale, Jim Sturgess, David Thewlis,… | [Horror Movies, Thrillers] | [United States] |
24 | The Runner | [Austin Stark] | [Nicolas Cage, Sarah Paulson, Connie Nielsen, … | [Dramas, Independent Movies] | [United States] |
25 | 6 Years | [Hannah Fidell] | [Taissa Farmiga, Ben Rosenfield, Lindsay Burdg… | [Dramas, Independent Movies, Romantic Movies] | [United States] |
26 | Castle of Stars | [] | [Chaiyapol Pupart, Jintanutda Lummakanon, Worr… | [International TV Shows, Romantic TV Shows, TV… | [] |
27 | City of Joy | [Madeleine Gavin] | [] | [Documentaries] | [United States, ] |
28 | First and Last | [] | [] | [Docuseries] | [] |
29 | Laddaland | [Sopon Sukdapisit] | [Saharat Sangkapreecha, Pok Piyatida Woramusik… | [Horror Movies, International Movies] | [Thailand] |
… | … | … | … | … | … |
6204 | Cuckoo | [] | [Andy Samberg, Taylor Lautner, Greg Davies, He… | [British TV Shows, International TV Shows, TV … | [United Kingdom] |
6205 | Pororo - The Little Penguin | [] | [] | [Kids’ TV, Korean TV Shows] | [South Korea] |
6206 | Samantha! | [] | [Emmanuelle Araújo, Douglas Silva, Sabrina Non… | [International TV Shows, TV Comedies] | [Brazil] |
6207 | Murderous Affairs | [] | [] | [Crime TV Shows, Docuseries] | [United States] |
6208 | Lost Girl | [] | [Anna Silk, Kris Holden-Ried, Ksenia Solo, Ric… | [TV Dramas, TV Horror, TV Mysteries] | [Canada] |
6209 | Mr. Young | [] | [Brendan Meyer, Matreya Fedor, Gig Morton, Kur… | [Kids’ TV, TV Comedies] | [Canada] |
6210 | Psiconautas | [] | [Guillermo Toledo, Gabriel Goity, Florencia Pe… | [International TV Shows, Spanish-Language TV S… | [Argentina] |
6211 | The Minimighty Kids | [] | [] | [Kids’ TV, TV Comedies] | [France] |
6212 | Filinta | [] | [Onur Tuna, Serhat Tutumluer, Mehmet Özgür, Na… | [Crime TV Shows, International TV Shows, TV Ac… | [Turkey] |
6213 | Leyla and Mecnun | [Onur Ünlü] | [Ali Atay, Melis Birkan, Serkan Keskin, Ahmet … | [International TV Shows, Romantic TV Shows, TV… | [Turkey] |
6214 | Chelsea | [] | [] | [Stand-Up Comedy & Talk Shows, TV Comedies] | [United States] |
6215 | Crazy Ex-Girlfriend | [] | [Rachel Bloom, Vincent Rodriguez III, Santino … | [Romantic TV Shows, TV Comedies] | [United States] |
6216 | The Magic School Bus Rides Again | [] | [Kate McKinnon, Miles Koseleci-Vieira, Mikaela… | [Kids’ TV] | [United States] |
6217 | New Girl | [] | [Zooey Deschanel, Jake Johnson, Max Greenfield… | [Romantic TV Shows, TV Comedies] | [United States] |
6218 | Talking Tom and Friends | [] | [Colin Hanks, Tom Kenny, James Adomian, Lisa S… | [Kids’ TV, TV Comedies] | [Cyprus, Austria, Thailand] |
6219 | Pokémon the Series | [] | [Sarah Natochenny, Laurie Hymes, Jessica Paque… | [Anime Series, Kids’ TV] | [Japan] |
6220 | Justin Time | [] | [Gage Munroe, Scott McCord, Jenna Warren] | [Kids’ TV] | [Canada] |
6221 | Terrace House: Boys & Girls in the City | [] | [You, Reina Triendl, Ryota Yamasato, Yoshimi T… | [International TV Shows, Reality TV] | [Japan] |
6222 | Weeds | [] | [Mary-Louise Parker, Hunter Parrish, Alexander… | [TV Comedies, TV Dramas] | [United States] |
6223 | Gunslinger Girl | [] | [Yuuka Nanri, Kanako Mitsuhashi, Eri Sendai, A… | [Anime Series, Crime TV Shows] | [Japan] |
6224 | Anthony Bourdain: Parts Unknown | [] | [Anthony Bourdain] | [Docuseries] | [United States] |
6225 | Frasier | [] | [Kelsey Grammer, Jane Leeves, David Hyde Pierc… | [Classic & Cult TV, TV Comedies] | [United States] |
6226 | La Familia P. Luche | [] | [Eugenio Derbez, Consuelo Duval, Luis Manuel Á… | [International TV Shows, Spanish-Language TV S… | [United States] |
6227 | The Adventures of Figaro Pho | [] | [Luke Jurevicius, Craig Behenna, Charlotte Ham… | [Kids’ TV, TV Comedies] | [Australia] |
6228 | Kikoriki | [] | [Igor Dmitriev] | [Kids’ TV] | [] |
6229 | Red vs. Blue | [] | [Burnie Burns, Jason Saldaña, Gustavo Sorola, … | [TV Action & Adventure, TV Comedies, TV Sci-Fi… | [United States] |
6230 | Maron | [] | [Marc Maron, Judd Hirsch, Josh Brener, Nora Ze… | [TV Comedies] | [United States] |
6231 | Little Baby Bum: Nursery Rhyme Friends | [] | [] | [Movies] | [] |
6232 | A Young Doctor’s Notebook and Other Stories | [] | [Daniel Radcliffe, Jon Hamm, Adam Godley, Chri… | [British TV Shows, TV Comedies, TV Dramas] | [United Kingdom] |
6233 | Friends | [] | [Jennifer Aniston, Courteney Cox, Lisa Kudrow,… | [Classic & Cult TV, TV Comedies] | [United States] |
6234 rows × 5 columns
接下来,我们就利用以上的数据来构建图网络来。
我们已经知道,图网络是由图节点和连接图节点的边构成。
针对以上5个属性,我们可以建立影视节点,影视类型节点,人物节点和国家节点,那么我们应该怎么构造连接这些图节点的边呢?
我们可以将这些属性之间的关系作如下表达:
- ACTED_IN:演员参演于该影视
- DERECTED_BY:导演执导该影视
- CATEGORY_IN:影视属于该类型
- COUNTRY_IN:影视制片于该国家
定义好图节点和边之后,我们使用 networkx 来构建我们的图网络。
In [2]:
import networkx as nx
# 初始化一个图网络 graph network (gn)
gn = nx.Graph(label="Netflix")
# 遍历数据来给图网络添加节点和边
for i, row in df.iterrows():
# 添加影视节点
gn.add_node(row['title'], label="MOVIE")
# 遍历演员列表
for actor in row['actors']:
# 添加人物节点
gn.add_node(actor, label="PERSON")
# 添加该人物与该影视之间的边,关系为 ACTED_IN
gn.add_edge(row['title'], actor, label="ACTED_IN")
# 遍历导演列表
for director in row['directors']:
# 添加人物节点
gn.add_node(director, label="PERSON")
# 添加该人物与该影视之间的边, 关系为 DERECTED_BY
gn.add_edge(row['title'], director, label="DERECTED_BY")
# 遍历影视类型列表
for cat in row['categories']:
# 添加影视类型节点
gn.add_node(cat, label="CATEGORY")
# 添加该影视类型与该影视之间的边, 关系为 CATEGORY_IN
gn.add_edge(row['title'], cat, label="CATEGORY_IN")
# 遍历涉及的国家列表
for cou in row['countries']:
# 添加国家节点
gn.add_node(cou, label="COUNTRY")
# 添加该国家与该影视之间的边, 关系为 COUNTRY_IN
gn.add_edge(row['title'], cou, label="COUNTRY_IN")
当我们把图网络构建完成后,我们就可以通过绘图可视化,来更加直观的看图网络的结构。
由于这里我们建立的图 gn 比较大,节点数量比较多,显示出来有点杂乱,所以我们绘制子图(Sub-graph)来看部分图节点之间的关系。
为了方便绘制子图,我们定义如下两个函数:
- get_adjacent_nodes:给定一些图节点,取出所有临近的图节点来构成一个子图
- draw_sub_graph:绘制子图
In [3]:
import matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = [14,14]
def get_adjacent_nodes(G, nodes):
sub_graph=set()
for n in nodes:
sub_graph.add(n)
for e in G.neighbors(n):
sub_graph.add(e)
return list(sub_graph)
def draw_sub_graph(G, sub_graph):
# 从图网络 G 中取出子图 sub_graph
subgraph = G.subgraph(sub_graph)
pos = nx.spring_layout(subgraph)
# 为每一种图节点标注一种颜色和大小
node_colors=[]
node_sizes = []
for n in subgraph.nodes():
if G.nodes[n]['label'] == "MOVIE":
node_colors.append('blue')
node_sizes.append(700)
elif G.nodes[n]['label'] == "PERSON":
node_colors.append('red')
node_sizes.append(600)
elif G.nodes[n]['label'] == "CATEGORY":
node_colors.append('green')
node_sizes.append(500)
elif G.nodes[n]['label'] == "COUNTRY":
node_colors.append('yellow')
node_sizes.append(400)
nx.draw(subgraph, pos, with_labels=True, node_color=node_colors, node_size=node_sizes, width=2, font_size=15)
# 给每一条边绘制 label
edge_labels = {}
for e in subgraph.edges():
if G.edges[e]['label'] == "ACTED_IN":
edge_labels[e] = "ACTED_IN"
if G.edges[e]['label'] == "DERECTED_BY":
edge_labels[e] = "DERECTED_BY"
if G.edges[e]['label'] == "CATEGORY_IN":
edge_labels[e] = "CATEGORY_IN"
if G.edges[e]['label'] == "COUNTRY_IN":
edge_labels[e] = "COUNTRY_IN"
nx.draw_networkx_edge_labels(subgraph, pos, edge_labels=edge_labels, font_color="red")
接下来,我们就可以利用这个图网络来分析我们的影视作品了。
我们首先来看一个例子,观察包含 Ocean’s Twelve(十二罗汉)和 Ocean’s Thirteen(十三罗汉)的子图。
In [4]:
nodes = [“Ocean’s Twelve”, “Ocean’s Thirteen”] sub_graph = get_adjacent_nodes(gn, nodes) draw_sub_graph(gn, sub_graph)
上图非常直观的反映出 Ocean’s Twelve(十二罗汉)和 Ocean’s Thirteen(十三罗汉)是非常相近的两部影视作品:它们 DERECTED_BY 同一个导演,CATERGORY_IN 相同的影视类型,COUNTRY_IN 同一个国家,大部分相同的演员 ACTED_IN 这两部影视。
这个子图能够非常清晰的展现出这两部影视作品的信息。
我们再来看一个例子,分析 Superman Returns(超人归来)和 Tom and Jerry: The Magic Ring(猫和老鼠)这两个影视作品的相关信息。
In [5]:
nodes = ["Ocean's Twelve", "Ocean's Thirteen"]
sub_graph = get_adjacent_nodes(gn, nodes)
draw_sub_graph(gn, sub_graph)
可以看到 Superman Returns(超人归来)和 Tom and Jerry: The Magic Ring(猫和老鼠)这两个影视作品没有直接连接的图节点。
如果我们想继续分析这两个影视作品的关联性,我们可以通过寻找图节点最短路径的算法 dijkstra 来寻找一条从 Superman Returns(超人归来)到 Tom and Jerry: The Magic Ring(猫和老鼠) 这两个节点的最短路径。
In [6]:
nodes = ["Superman Returns", "Tom and Jerry: The Magic Ring"]
sub_graph = get_adjacent_nodes(gn, nodes)
draw_sub_graph(gn, sub_graph)
最短路径图节点: [‘Superman Returns’, ‘Brandon Routh’, ‘Scott Pilgrim vs. the World’, ‘Comedies’, ‘Tom and Jerry: The Magic Ring’] 最短路径图节点以及它们直接的关系: Superman Returns Brandon Routh ACTED_IN Brandon Routh Scott Pilgrim vs. the World ACTED_IN Scott Pilgrim vs. the World Comedies CATEGORY_IN Comedies Tom and Jerry: The Magic Ring CATEGORY_IN