为什么要使用广播(broadcast)变量

Spark中因为算子真正的逻辑是发送到Executor中执行的,当Executor中需要引用外部变量时,需要使用广播变量。如果Executor端用到了Driver变量:

  • 不使用广播变量:Executor有多少task就有多少driver端变量副本
  • 使用广播变量:在每个Executor中只有一份Driver端的变量副本

    定义

    A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
    Broadcast variables are created from a variable v by calling SparkContext.broadcast(T, scala.reflect.ClassTag). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The interpreter session below shows this:

    1. scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
    2. broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int} = Broadcast(0)
    3. scala> broadcastVar.value
    4. res0: Array[Int] = Array(1, 2, 3)

    工作流程

    image.png

    注意事项

  1. 不能将RDD广播出去,因为RDD是不存储数据的,可以将RDD的结果广播出去
  2. 广播变量只能在Driver端定义,不能在Executor端定义
  3. 在Driver端可以修改广播变量的值,在Executor端无法修改广播变量的值