在前面的文章中,我们已经讲解来什么是图网络(Graph Network),什么是图节点(Graph Node),在这里,我们将具体用一个实际案例来直观地展现图网络的结构。
    我们采用的数据是 Netflix 影视数据集,该数据集包含每个电影的标题、导演、国家、演员、时间等等信息。那么我们可以利用这些数据来构建一个关于影视数据的图网络。该数据集也曾用于 kaggle 公开的比赛中,我们可以从 kaggle下载该数据集。
    我们通过 pandas 加载数据,并可以快速看到该数据的结构和每一条影视包含的属性。该数据的每一列含义如下:

    属性 含义(如果有多个,则以逗号分隔)
    show_id 每一个影视对应的唯一的编号
    type 类型为电影还是电视剧
    title 影视名称
    director 导演
    cast 演员
    country 影视制片所在国家
    date_added 上架 Netflix 的时间
    release_year 影视的实际上映年份
    rating 打分
    duration 影视时长
    listed_in 影视类别
    description 影视概述

    这里,我们选择 title、director、cast、country、listed_in 这5个属性,并重新构造一份 data。
    In [1]:

    1. import pandas as pd
    2. # 加载数据
    3. df = pd.read_csv('/home/jovyan/kernel/netflix_titles.csv')
    4. # 取出导演属性,如果该属性的值为空,则返回空列表 [] ,否则返回所有导演列表
    5. df['directors'] = df['director'].apply(lambda l: [] if pd.isna(l) else [i.strip() for i in l.split(",")])
    6. # 取出演员属性,如果该属性的值为空,则返回空列表 [] ,否则返回所有演员列表
    7. df['actors'] = df['cast'].apply(lambda l: [] if pd.isna(l) else [i.strip() for i in l.split(",")])
    8. # 取出导演属性,如果该属性的值为空,则返回空列表 [] ,否则返回所有影视类型列表
    9. df['categories'] = df['listed_in'].apply(lambda l: [] if pd.isna(l) else [i.strip() for i in l.split(",")])
    10. # 取出国家属性,如果该属性的值为空,则返回空列表 [] ,否则返回所有国家列表
    11. df['countries'] = df['country'].apply(lambda l: [] if pd.isna(l) else [i.strip() for i in l.split(",")])
    12. # 取出我们需要的5个属性
    13. df = df[["title", "directors", "actors", "categories", "countries"]]
    14. df

    Out[1]:

    title directors actors categories countries
    0 Norm of the North: King Sized Adventure [Richard Finn, Tim Maltby] [Alan Marriott, Andrew Toth, Brian Dobson, Col… [Children & Family Movies, Comedies] [United States, India, South Korea, China]
    1 Jandino: Whatever it Takes [] [Jandino Asporaat] [Stand-Up Comedy] [United Kingdom]
    2 Transformers Prime [] [Peter Cullen, Sumalee Montano, Frank Welker, … [Kids’ TV] [United States]
    3 Transformers: Robots in Disguise [] [Will Friedle, Darren Criss, Constance Zimmer,… [Kids’ TV] [United States]
    4 #realityhigh [Fernando Lebrija] [Nesta Cooper, Kate Walsh, John Michael Higgin… [Comedies] [United States]
    5 Apaches [] [Alberto Ammann, Eloy Azorín, Verónica Echegui… [Crime TV Shows, International TV Shows, Spani… [Spain]
    6 Automata [Gabe Ibáñez] [Antonio Banderas, Dylan McDermott, Melanie Gr… [International Movies, Sci-Fi & Fantasy, Thril… [Bulgaria, United States, Spain, Canada]
    7 Fabrizio Copano: Solo pienso en mi [Rodrigo Toro, Francisco Schultz] [Fabrizio Copano] [Stand-Up Comedy] [Chile]
    8 Fire Chasers [] [] [Docuseries, Science & Nature TV] [United States]
    9 Good People [Henrik Ruben Genz] [James Franco, Kate Hudson, Tom Wilkinson, Oma… [Action & Adventure, Thrillers] [United States, United Kingdom, Denmark, Sweden]
    10 Joaquín Reyes: Una y no más [José Miguel Contreras] [Joaquín Reyes] [Stand-Up Comedy] []
    11 Kidnapping Mr. Heineken [Daniel Alfredson] [Jim Sturgess, Sam Worthington, Ryan Kwanten, … [Action & Adventure, Dramas, International Mov… [Netherlands, Belgium, United Kingdom, United …
    12 Krish Trish and Baltiboy [] [Damandeep Singh Baggan, Smita Malhotra, Baba … [Children & Family Movies] []
    13 Krish Trish and Baltiboy: Battle of Wits [Munjal Shroff, Tilak Shetty] [Damandeep Singh Baggan, Smita Malhotra, Baba … [Children & Family Movies] []
    14 Krish Trish and Baltiboy: Best Friends Forever [Munjal Shroff, Tilak Shetty] [Damandeep Singh Baggan, Smita Malhotra, Deepa… [Children & Family Movies] []
    15 Krish Trish and Baltiboy: Comics of India [Tilak Shetty] [Damandeep Singh Baggan, Smita Malhotra, Baba … [Children & Family Movies] []
    16 Krish Trish and Baltiboy: Oversmartness Never … [Tilak Shetty] [Rishi Gambhir, Smita Malhotra, Deepak Chachra] [Children & Family Movies] []
    17 Krish Trish and Baltiboy: Part II [] [Damandeep Singh Baggan, Smita Malhotra, Baba … [Children & Family Movies] []
    18 Krish Trish and Baltiboy: The Greatest Trick [Munjal Shroff, Tilak Shetty] [Damandeep Singh Baggan, Smita Malhotra, Baba … [Children & Family Movies] []
    19 Love [Gaspar Noé] [Karl Glusman, Klara Kristin, Aomi Muyock, Ugo… [Cult Movies, Dramas, Independent Movies] [France, Belgium]
    20 Manhattan Romance [Tom O’Brien] [Tom O’Brien, Katherine Waterston, Caitlin Fit… [Comedies, Independent Movies, Romantic Movies] [United States]
    21 Moonwalkers [Antoine Bardou-Jacquet] [Ron Perlman, Rupert Grint, Robert Sheehan, St… [Action & Adventure, Comedies, International M… [France, Belgium]
    22 Rolling Papers [Mitch Dickman] [] [Documentaries] [United States, Uruguay]
    23 Stonehearst Asylum [Brad Anderson] [Kate Beckinsale, Jim Sturgess, David Thewlis,… [Horror Movies, Thrillers] [United States]
    24 The Runner [Austin Stark] [Nicolas Cage, Sarah Paulson, Connie Nielsen, … [Dramas, Independent Movies] [United States]
    25 6 Years [Hannah Fidell] [Taissa Farmiga, Ben Rosenfield, Lindsay Burdg… [Dramas, Independent Movies, Romantic Movies] [United States]
    26 Castle of Stars [] [Chaiyapol Pupart, Jintanutda Lummakanon, Worr… [International TV Shows, Romantic TV Shows, TV… []
    27 City of Joy [Madeleine Gavin] [] [Documentaries] [United States, ]
    28 First and Last [] [] [Docuseries] []
    29 Laddaland [Sopon Sukdapisit] [Saharat Sangkapreecha, Pok Piyatida Woramusik… [Horror Movies, International Movies] [Thailand]
    6204 Cuckoo [] [Andy Samberg, Taylor Lautner, Greg Davies, He… [British TV Shows, International TV Shows, TV … [United Kingdom]
    6205 Pororo - The Little Penguin [] [] [Kids’ TV, Korean TV Shows] [South Korea]
    6206 Samantha! [] [Emmanuelle Araújo, Douglas Silva, Sabrina Non… [International TV Shows, TV Comedies] [Brazil]
    6207 Murderous Affairs [] [] [Crime TV Shows, Docuseries] [United States]
    6208 Lost Girl [] [Anna Silk, Kris Holden-Ried, Ksenia Solo, Ric… [TV Dramas, TV Horror, TV Mysteries] [Canada]
    6209 Mr. Young [] [Brendan Meyer, Matreya Fedor, Gig Morton, Kur… [Kids’ TV, TV Comedies] [Canada]
    6210 Psiconautas [] [Guillermo Toledo, Gabriel Goity, Florencia Pe… [International TV Shows, Spanish-Language TV S… [Argentina]
    6211 The Minimighty Kids [] [] [Kids’ TV, TV Comedies] [France]
    6212 Filinta [] [Onur Tuna, Serhat Tutumluer, Mehmet Özgür, Na… [Crime TV Shows, International TV Shows, TV Ac… [Turkey]
    6213 Leyla and Mecnun [Onur Ünlü] [Ali Atay, Melis Birkan, Serkan Keskin, Ahmet … [International TV Shows, Romantic TV Shows, TV… [Turkey]
    6214 Chelsea [] [] [Stand-Up Comedy & Talk Shows, TV Comedies] [United States]
    6215 Crazy Ex-Girlfriend [] [Rachel Bloom, Vincent Rodriguez III, Santino … [Romantic TV Shows, TV Comedies] [United States]
    6216 The Magic School Bus Rides Again [] [Kate McKinnon, Miles Koseleci-Vieira, Mikaela… [Kids’ TV] [United States]
    6217 New Girl [] [Zooey Deschanel, Jake Johnson, Max Greenfield… [Romantic TV Shows, TV Comedies] [United States]
    6218 Talking Tom and Friends [] [Colin Hanks, Tom Kenny, James Adomian, Lisa S… [Kids’ TV, TV Comedies] [Cyprus, Austria, Thailand]
    6219 Pokémon the Series [] [Sarah Natochenny, Laurie Hymes, Jessica Paque… [Anime Series, Kids’ TV] [Japan]
    6220 Justin Time [] [Gage Munroe, Scott McCord, Jenna Warren] [Kids’ TV] [Canada]
    6221 Terrace House: Boys & Girls in the City [] [You, Reina Triendl, Ryota Yamasato, Yoshimi T… [International TV Shows, Reality TV] [Japan]
    6222 Weeds [] [Mary-Louise Parker, Hunter Parrish, Alexander… [TV Comedies, TV Dramas] [United States]
    6223 Gunslinger Girl [] [Yuuka Nanri, Kanako Mitsuhashi, Eri Sendai, A… [Anime Series, Crime TV Shows] [Japan]
    6224 Anthony Bourdain: Parts Unknown [] [Anthony Bourdain] [Docuseries] [United States]
    6225 Frasier [] [Kelsey Grammer, Jane Leeves, David Hyde Pierc… [Classic & Cult TV, TV Comedies] [United States]
    6226 La Familia P. Luche [] [Eugenio Derbez, Consuelo Duval, Luis Manuel Á… [International TV Shows, Spanish-Language TV S… [United States]
    6227 The Adventures of Figaro Pho [] [Luke Jurevicius, Craig Behenna, Charlotte Ham… [Kids’ TV, TV Comedies] [Australia]
    6228 Kikoriki [] [Igor Dmitriev] [Kids’ TV] []
    6229 Red vs. Blue [] [Burnie Burns, Jason Saldaña, Gustavo Sorola, … [TV Action & Adventure, TV Comedies, TV Sci-Fi… [United States]
    6230 Maron [] [Marc Maron, Judd Hirsch, Josh Brener, Nora Ze… [TV Comedies] [United States]
    6231 Little Baby Bum: Nursery Rhyme Friends [] [] [Movies] []
    6232 A Young Doctor’s Notebook and Other Stories [] [Daniel Radcliffe, Jon Hamm, Adam Godley, Chri… [British TV Shows, TV Comedies, TV Dramas] [United Kingdom]
    6233 Friends [] [Jennifer Aniston, Courteney Cox, Lisa Kudrow,… [Classic & Cult TV, TV Comedies] [United States]

    6234 rows × 5 columns

    接下来,我们就利用以上的数据来构建图网络来。
    我们已经知道,图网络是由图节点和连接图节点的边构成。
    针对以上5个属性,我们可以建立影视节点影视类型节点人物节点国家节点,那么我们应该怎么构造连接这些图节点的边呢?
    我们可以将这些属性之间的关系作如下表达:

    • ACTED_IN:演员参演于该影视
    • DERECTED_BY:导演执导该影视
    • CATEGORY_IN:影视属于该类型
    • COUNTRY_IN:影视制片于该国家

    定义好图节点和边之后,我们使用 networkx 来构建我们的图网络。
    In [2]:

    1. import networkx as nx
    2. # 初始化一个图网络 graph network (gn)
    3. gn = nx.Graph(label="Netflix")
    4. # 遍历数据来给图网络添加节点和边
    5. for i, row in df.iterrows():
    6. # 添加影视节点
    7. gn.add_node(row['title'], label="MOVIE")
    8. # 遍历演员列表
    9. for actor in row['actors']:
    10. # 添加人物节点
    11. gn.add_node(actor, label="PERSON")
    12. # 添加该人物与该影视之间的边,关系为 ACTED_IN
    13. gn.add_edge(row['title'], actor, label="ACTED_IN")
    14. # 遍历导演列表
    15. for director in row['directors']:
    16. # 添加人物节点
    17. gn.add_node(director, label="PERSON")
    18. # 添加该人物与该影视之间的边, 关系为 DERECTED_BY
    19. gn.add_edge(row['title'], director, label="DERECTED_BY")
    20. # 遍历影视类型列表
    21. for cat in row['categories']:
    22. # 添加影视类型节点
    23. gn.add_node(cat, label="CATEGORY")
    24. # 添加该影视类型与该影视之间的边, 关系为 CATEGORY_IN
    25. gn.add_edge(row['title'], cat, label="CATEGORY_IN")
    26. # 遍历涉及的国家列表
    27. for cou in row['countries']:
    28. # 添加国家节点
    29. gn.add_node(cou, label="COUNTRY")
    30. # 添加该国家与该影视之间的边, 关系为 COUNTRY_IN
    31. gn.add_edge(row['title'], cou, label="COUNTRY_IN")

    当我们把图网络构建完成后,我们就可以通过绘图可视化,来更加直观的看图网络的结构。
    由于这里我们建立的图 gn 比较大,节点数量比较多,显示出来有点杂乱,所以我们绘制子图(Sub-graph)来看部分图节点之间的关系。
    为了方便绘制子图,我们定义如下两个函数:

    • get_adjacent_nodes:给定一些图节点,取出所有临近的图节点来构成一个子图
    • draw_sub_graph:绘制子图

    In [3]:

    1. import matplotlib.pyplot as plt
    2. plt.style.use('seaborn')
    3. plt.rcParams['figure.figsize'] = [14,14]
    4. def get_adjacent_nodes(G, nodes):
    5. sub_graph=set()
    6. for n in nodes:
    7. sub_graph.add(n)
    8. for e in G.neighbors(n):
    9. sub_graph.add(e)
    10. return list(sub_graph)
    11. def draw_sub_graph(G, sub_graph):
    12. # 从图网络 G 中取出子图 sub_graph
    13. subgraph = G.subgraph(sub_graph)
    14. pos = nx.spring_layout(subgraph)
    15. # 为每一种图节点标注一种颜色和大小
    16. node_colors=[]
    17. node_sizes = []
    18. for n in subgraph.nodes():
    19. if G.nodes[n]['label'] == "MOVIE":
    20. node_colors.append('blue')
    21. node_sizes.append(700)
    22. elif G.nodes[n]['label'] == "PERSON":
    23. node_colors.append('red')
    24. node_sizes.append(600)
    25. elif G.nodes[n]['label'] == "CATEGORY":
    26. node_colors.append('green')
    27. node_sizes.append(500)
    28. elif G.nodes[n]['label'] == "COUNTRY":
    29. node_colors.append('yellow')
    30. node_sizes.append(400)
    31. nx.draw(subgraph, pos, with_labels=True, node_color=node_colors, node_size=node_sizes, width=2, font_size=15)
    32. # 给每一条边绘制 label
    33. edge_labels = {}
    34. for e in subgraph.edges():
    35. if G.edges[e]['label'] == "ACTED_IN":
    36. edge_labels[e] = "ACTED_IN"
    37. if G.edges[e]['label'] == "DERECTED_BY":
    38. edge_labels[e] = "DERECTED_BY"
    39. if G.edges[e]['label'] == "CATEGORY_IN":
    40. edge_labels[e] = "CATEGORY_IN"
    41. if G.edges[e]['label'] == "COUNTRY_IN":
    42. edge_labels[e] = "COUNTRY_IN"
    43. nx.draw_networkx_edge_labels(subgraph, pos, edge_labels=edge_labels, font_color="red")

    接下来,我们就可以利用这个图网络来分析我们的影视作品了。
    我们首先来看一个例子,观察包含 Ocean’s Twelve(十二罗汉)和 Ocean’s Thirteen(十三罗汉)的子图。
    In [4]:
    nodes = [“Ocean’s Twelve”, “Ocean’s Thirteen”] sub_graph = get_adjacent_nodes(gn, nodes) draw_sub_graph(gn, sub_graph)
    image.png

    上图非常直观的反映出 Ocean’s Twelve(十二罗汉)和 Ocean’s Thirteen(十三罗汉)是非常相近的两部影视作品:它们 DERECTED_BY 同一个导演,CATERGORY_IN 相同的影视类型,COUNTRY_IN 同一个国家,大部分相同的演员 ACTED_IN 这两部影视。
    这个子图能够非常清晰的展现出这两部影视作品的信息。
    我们再来看一个例子,分析 Superman Returns(超人归来)和 Tom and Jerry: The Magic Ring(猫和老鼠)这两个影视作品的相关信息。
    In [5]:

    nodes = ["Ocean's Twelve", "Ocean's Thirteen"]
    sub_graph = get_adjacent_nodes(gn, nodes)
    
    draw_sub_graph(gn, sub_graph)
    

    image.png

    可以看到 Superman Returns(超人归来)和 Tom and Jerry: The Magic Ring(猫和老鼠)这两个影视作品没有直接连接的图节点。
    如果我们想继续分析这两个影视作品的关联性,我们可以通过寻找图节点最短路径的算法 dijkstra 来寻找一条从 Superman Returns(超人归来)到 Tom and Jerry: The Magic Ring(猫和老鼠) 这两个节点的最短路径。
    In [6]:

    nodes = ["Superman Returns", "Tom and Jerry: The Magic Ring"]
    sub_graph = get_adjacent_nodes(gn, nodes)
    
    draw_sub_graph(gn, sub_graph)
    

    最短路径图节点: [‘Superman Returns’, ‘Brandon Routh’, ‘Scott Pilgrim vs. the World’, ‘Comedies’, ‘Tom and Jerry: The Magic Ring’] 最短路径图节点以及它们直接的关系: Superman Returns Brandon Routh ACTED_IN Brandon Routh Scott Pilgrim vs. the World ACTED_IN Scott Pilgrim vs. the World Comedies CATEGORY_IN Comedies Tom and Jerry: The Magic Ring CATEGORY_IN