为什么要使用广播(broadcast)变量
Spark中因为算子真正的逻辑是发送到Executor中执行的,当Executor中需要引用外部变量时,需要使用广播变量。如果Executor端用到了Driver变量:
- 不使用广播变量:Executor有多少task就有多少driver端变量副本
使用广播变量:在每个Executor中只有一份Driver端的变量副本
定义
A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
Broadcast variables are created from a variable v by calling SparkContext.broadcast(T, scala.reflect.ClassTag). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The interpreter session below shows this: scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int} = Broadcast(0)
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
工作流程
注意事项
- 不能将RDD广播出去,因为RDD是不存储数据的,可以将RDD的结果广播出去
- 广播变量只能在Driver端定义,不能在Executor端定义
- 在Driver端可以修改广播变量的值,在Executor端无法修改广播变量的值