说到 GNN, 各位炼丹师会想到哪些算法呢? 不管想到哪些算法, 我们真正用到过哪些呢? 确实我们可能都看过 GNN 相关论文, 但是还是缺乏实战经验的. 特别是对于推荐系统而言, 我们又该如何应用这些模型呢? 下面我们就从 DeepWalk 这篇论文开始, 先讲原理, 再讲实战, 最后讲应用.

GNN 相关背景知识

GNN 的本质, 是要学习网络中每个节点的表达的, 这些潜在的表达对图中每个节点的 “社交” 关系进行了编码, 把离散值的节点编码成稠密向量, 后续可用于分类回归, 或者作为下游任务的特征. Deepwalk 充分利用了随机游走提取的“句子”, 去学习句子中每个单词的表达. Deepwalk 原文就提到了在数据稀疏的情况下可以把 F1-scores 提升 10%, 在一些实验中, 能够用更少的训练数据获得了更好的效果. 看下图的例子:

大有可为的GNN:DeepWalk - 图1

Deepwalk

问题定义:

先把问题定义为给社交网络中的每个节点进行分类, 图可以表达为 G=,V 就是图上所有节点, E 是所有边. 有一部分有 label 的数据 GL=(V,E,X,Y),X 就是节点的特征, Y 就是分类的 label. 在传统机器学习算法中, 我们都是直接学习 (X,Y), 并没有充分利用节点间的依赖关系. Deepwalk 可以捕捉图上的拓扑关系, 通过无监督方法学习每个节点的特征, 学到的图特征和标签的分布是相互独立的.

随机游走:

选定一个根节点,“随机” 走出一条路径, 基于相邻的节点必然相似, 我们就可以用这种策略去挖掘网络的社群信息. 随机游走很方便并行, 可以同时提取一张图上各个部分的信息. 即使图有小的改动, 这些路径也不需要重新计算. 和 word 的出现频率类似, 通过随机游走得到的节点访问频率都符合幂律分布, 所以我们就可以用 NLP 相关方法对随机游走结果做相似处理, 如下图所示:

大有可为的GNN:DeepWalk - 图2

所以在随机游走后, 我们只需要通过下公式, 学习节点向量即可:

大有可为的GNN:DeepWalk - 图3

该公式就是 skip-gram, 通过某个节点学习它左右的节点. 我们都知道 skip-gram 用于文本时的语料库就是一个个句子, 现在我们提取图的句子. 如下所示:

大有可为的GNN:DeepWalk - 图4
大有可为的GNN:DeepWalk - 图5

算法很简单, 把所有节点顺序打乱 (加速收敛), 然后遍历这些节点随机游走出序列, 再通过 skipgram 算法去拟合每个节点的向量. 如此反复. 注: 这里的随机是均匀分布去随机. 当然有些图会有些 “副产物”, 比如用户浏览网页的顺序, 可以直接输入到模型.

接下来我们看下 deepwalks 的核心代码:

  1. # 代码来源
  2. # phanein/deepwalk
  3. # Random walk
  4. with open(f, 'w') as fout:
  5. for walk in graph.build_deepwalk_corpus_iter(G=G, # 图
  6. num_paths=num_paths, # 路径数
  7. path_length=path_length, # 路径长度
  8. alpha=alpha, #
  9. rand=rand): #
  10. fout.write(u"{}\n".format(u" ".join(v for v in walk)))
  11. class Graph(defaultdict):
  12. """
  13. Efficient basic implementation of nx
  14. 这里我们看到,
  15. 该类继承defaultdict,
  16. 图其实可以简单的表示为dict,
  17. key为节点,value为与之相连的节点
  18. """
  19. def __init__(self):
  20. super(Graph, self).__init__(list)
  21. def nodes(self):
  22. return self.keys()
  23. def adjacency_iter(self):
  24. return self.iteritems()
  25. def subgraph(self, nodes={}):
  26. # 提取子图
  27. subgraph = Graph()
  28. for n in nodes:
  29. if n in self:
  30. subgraph[n] = [x for x in self[n] if x in nodes]
  31. return subgraph
  32. def make_undirected(self):
  33. #因为是无向图,所以v in self[u]并且 u in self[v]
  34. t0 = time()
  35. for v in list(self):
  36. for other in self[v]:
  37. if v != other:
  38. self[other].append(v)
  39. t1 = time()
  40. logger.info('make_directed: added missing edges {}s'.format(t1-t0))
  41. self.make_consistent()
  42. return self
  43. def make_consistent(self):
  44. # 去重
  45. t0 = time()
  46. for k in iterkeys(self):
  47. self[k] = list(sorted(set(self[k])))
  48. t1 = time()
  49. logger.info('make_consistent: made consistent in {}s'.format(t1-t0))
  50. self.remove_self_loops()
  51. return self
  52. def remove_self_loops(self):
  53. # 自已不会与自己有连接
  54. removed = 0
  55. t0 = time()
  56. for x in self:
  57. if x in self[x]:
  58. self[x].remove(x)
  59. removed += 1
  60. t1 = time()
  61. logger.info('remove_self_loops: removed {} loops in {}s'.format(removed, (t1-t0)))
  62. return self
  63. def check_self_loops(self):
  64. for x in self:
  65. for y in self[x]:
  66. if x == y:
  67. return True
  68. return False
  69. def has_edge(self, v1, v2):
  70. # 两个节点是否有边
  71. if v2 in self[v1] or v1 in self[v2]:
  72. return True
  73. return False
  74. def degree(self, nodes=None):
  75. # 节点的度数
  76. if isinstance(nodes, Iterable):
  77. return {v:len(self[v]) for v in nodes}
  78. else:
  79. return len(self[nodes])
  80. def order(self):
  81. "Returns the number of nodes in the graph"
  82. return len(self)
  83. def number_of_edges(self):
  84. # 图中边的数量
  85. "Returns the number of nodes in the graph"
  86. return sum([self.degree(x) for x in self.keys()])/2
  87. def number_of_nodes(self):
  88. # 图中结点数量
  89. "Returns the number of nodes in the graph"
  90. return self.order()
  91. # 核心代码
  92. def random_walk(self, path_length, alpha=0, rand=random.Random(), start=None):
  93. """ Returns a truncated random walk.
  94. path_length: Length of the random walk.
  95. alpha: probability of restarts.
  96. start: the start node of the random walk.
  97. """
  98. G = self
  99. if start:
  100. path = [start]
  101. else:
  102. # Sampling is uniform w.r.t V, and not w.r.t E
  103. path = [rand.choice(list(G.keys()))]
  104. while len(path) < path_length:
  105. cur = path[-1]
  106. if len(G[cur]) > 0:
  107. if rand.random() >= alpha:
  108. path.append(rand.choice(G[cur])) # 相邻节点随机选
  109. else:
  110. path.append(path[0]) # 有一定概率选择回到起点
  111. else:
  112. break
  113. return [str(node) for node in path]
  114. # TODO add build_walks in here
  115. def build_deepwalk_corpus(G, num_paths, path_length, alpha=0,
  116. rand=random.Random(0)):
  117. walks = []
  118. nodes = list(G.nodes())
  119. # 这里和上面论文算法流程对应
  120. for cnt in range(num_paths): # 外循环,相当于要迭代多少epoch
  121. rand.shuffle(nodes) # 打乱nodes顺序,加速收敛
  122. for node in nodes: # 每个node都会产生一条路径
  123. walks.append(G.random_walk(path_length, rand=rand, alpha=alpha, start=node))
  124. return walks
  125. def build_deepwalk_corpus_iter(G, num_paths, path_length, alpha=0,
  126. rand=random.Random(0)):
  127. # 流式处理用
  128. walks = []
  129. nodes = list(G.nodes())
  130. for cnt in range(num_paths):
  131. rand.shuffle(nodes)
  132. for node in nodes:
  133. yield G.random_walk(path_length, rand=rand, alpha=alpha, start=node)
  134. def clique(size):
  135. return from_adjlist(permutations(range(1,size+1)))
  136. # How do you split a list into evenly sized chunks?
  137. def grouper(n, iterable, padvalue=None):
  138. "grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
  139. return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)
  140. def parse_adjacencylist(f):
  141. adjlist = []
  142. for l in f:
  143. if l and l[0] != "#":
  144. introw = [int(x) for x in l.strip().split()]
  145. row = [introw[0]]
  146. row.extend(set(sorted(introw[1:])))
  147. adjlist.extend([row])
  148. return adjlist
  149. def parse_adjacencylist_unchecked(f):
  150. adjlist = []
  151. for l in f:
  152. if l and l[0] != "#":
  153. adjlist.extend([[int(x) for x in l.strip().split()]])
  154. return adjlist
  155. def load_adjacencylist(file_, undirected=False, chunksize=10000, unchecked=True):
  156. if unchecked:
  157. parse_func = parse_adjacencylist_unchecked
  158. convert_func = from_adjlist_unchecked
  159. else:
  160. parse_func = parse_adjacencylist
  161. convert_func = from_adjlist
  162. adjlist = []
  163. t0 = time()
  164. total = 0
  165. with open(file_) as f:
  166. for idx, adj_chunk in enumerate(map(parse_func, grouper(int(chunksize), f))):
  167. adjlist.extend(adj_chunk)
  168. total += len(adj_chunk)
  169. t1 = time()
  170. logger.info('Parsed {} edges with {} chunks in {}s'.format(total, idx, t1-t0))
  171. t0 = time()
  172. G = convert_func(adjlist)
  173. t1 = time()
  174. logger.info('Converted edges to graph in {}s'.format(t1-t0))
  175. if undirected:
  176. t0 = time()
  177. G = G.make_undirected()
  178. t1 = time()
  179. logger.info('Made graph undirected in {}s'.format(t1-t0))
  180. return G
  181. def load_edgelist(file_, undirected=True):
  182. G = Graph()
  183. with open(file_) as f:
  184. for l in f:
  185. x, y = l.strip().split()[:2]
  186. x = int(x)
  187. y = int(y)
  188. G[x].append(y)
  189. if undirected:
  190. G[y].append(x)
  191. G.make_consistent()
  192. return G
  193. def load_matfile(file_, variable_name="network", undirected=True):
  194. mat_varables = loadmat(file_)
  195. mat_matrix = mat_varables[variable_name]
  196. return from_numpy(mat_matrix, undirected)
  197. def from_networkx(G_input, undirected=True):
  198. G = Graph()
  199. for idx, x in enumerate(G_input.nodes()):
  200. for y in iterkeys(G_input[x]):
  201. G[x].append(y)
  202. if undirected:
  203. G.make_undirected()
  204. return G
  205. def from_numpy(x, undirected=True):
  206. G = Graph()
  207. if issparse(x):
  208. cx = x.tocoo()
  209. for i,j,v in zip(cx.row, cx.col, cx.data):
  210. G[i].append(j)
  211. else:
  212. raise Exception("Dense matrices not yet supported.")
  213. if undirected:
  214. G.make_undirected()
  215. G.make_consistent()
  216. return G
  217. def from_adjlist(adjlist):
  218. G = Graph()
  219. for row in adjlist:
  220. node = row[0]
  221. neighbors = row[1:]
  222. G[node] = list(sorted(set(neighbors)))
  223. return G
  224. def from_adjlist_unchecked(adjlist):
  225. G = Graph()
  226. for row in adjlist:
  227. node = row[0]
  228. neighbors = row[1:]
  229. G[node] = neighbors
  230. return G

至于 skipgram, 大家可以直接用 gensim 工具即可.

from gensim.models import Word2Vec
from gensim.models.word2vec import Vocab

logger = logging.getLogger("deepwalk")

class Skipgram(Word2Vec):
    """A subclass to allow more customization of the Word2Vec internals."""

    def __init__(self, vocabulary_counts=None, **kwargs):

        self.vocabulary_counts = None

        kwargs["min_count"] = kwargs.get("min_count", 0)
        kwargs["workers"] = kwargs.get("workers", cpu_count())
        kwargs["size"] = kwargs.get("size", 128)
        kwargs["sentences"] = kwargs.get("sentences", None)
        kwargs["window"] = kwargs.get("window", 10)
        kwargs["sg"] = 1
        kwargs["hs"] = 1

        if vocabulary_counts != None:
          self.vocabulary_counts = vocabulary_counts

        super(Skipgram, self).__init__(**kwargs)

应用

在推荐场景中, 无论是推荐商品还是广告, 用户和 item 其实都可以通过点击 / 转化 / 购买等行为构建二部图, 在此二部图中进行随机游走, 学习每个节点的向量, 在特定场景, 缺乏特征和标签的情况下, 可以通过 user2user 或者 item2iterm 的方式, 很好的泛化到其他的标签. GNN 提取的向量也可以用于下游双塔召回模型或者排序模型. 如果有社交网络, 通过挖掘人与人直接的关系提取特征, 供下游任务也是个不错的选择. 当然大家也可以尝试在一些推荐比赛中用于丰富特征.

大有可为的GNN:DeepWalk - 图6

听说 GNN 大有可为, 从这篇开始学以致用mp.weixin.qq.com大有可为的GNN:DeepWalk - 图7

https://zhuanlan.zhihu.com/p/389218052