Spark GraphX官方文档
GraphX继承了RDD,并引入了新的抽象
(边和点上附加了属性的有向多图)
Property Graph:
1、A directed multigraph is a directed graph with potentially multiple parallel edges
sharing the same source and destination vertex.
2、Each vertex is keyed by a unique 64-bit long identifier (VertexId).
GraphX does not impose any ordering constraints on the vertex identifiers
3、Graphx会优化原始数据类型的存储
4、Graph的不可变,分布式和鲁棒性(immutable, distributed, and fault-tolerant)。运算中的可复用性
5、The classes VertexRDD[VD] and EdgeRDD[ED] extend and are optimized versions of RDD[(VertexId, VD)] and RDD[Edge[ED]] respectively
for now they can be thought of as simply RDDs of the form: RDD[(VertexId, VD)] and RDD[Edge[ED]].
Example Property Graph:
val users: RDD[(VertexId, (String, String))] =
sc.parallelize(Array((3L, (“rxin”, “student”)), (7L, (“jgonzal”, “postdoc”)),
(5L, (“franklin”, “prof”)), (2L, (“istoica”, “prof”))))
Graph Operators:
其中支持的一些算法:
PageRank
Connected components
Label propagation 标签传播算法
SVD++ 奇异值分解
Strongly connected components
Triangle count 三角形计数
对于大多数算法而言, 你的输入RDD的分区数至少应该和集群的CPU核心数相当,这样才能达到完全的并行。
用于图及图并行计算的Spark组件
The Property Graph
用户定义对象添加到每一个节点和边的有向多重图