Spark 广播

浏览 104 扫码分享 2023-11-23 13:37:54

为什么要使用广播（broadcast）变量
定义
工作流程
注意事项

为什么要使用广播（broadcast）变量

Spark中因为算子真正的逻辑是发送到Executor中执行的，当Executor中需要引用外部变量时，需要使用广播变量。如果Executor端用到了Driver变量：

不使用广播变量：Executor有多少task就有多少driver端变量副本
使用广播变量：在每个Executor中只有一份Driver端的变量副本

定义
A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
Broadcast variables are created from a variable v by calling SparkContext.broadcast(T, scala.reflect.ClassTag). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The interpreter session below shows this:
```
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int} = Broadcast(0)
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
```
工作流程

注意事项

不能将RDD广播出去，因为RDD是不存储数据的，可以将RDD的结果广播出去
广播变量只能在Driver端定义，不能在Executor端定义
在Driver端可以修改广播变量的值，在Executor端无法修改广播变量的值

若有收获，就点个赞吧

上一篇:

下一篇:

让时间为你证明

展开/收起文章目录