6.1 Lineage
RDD只支持粗粒度转换,即在大量记录上执行单个操作。每次操作都会将创建RDD的Lineage记录下来。Lineage记录:
- RDD元数据
- 转换行为
- 查看Lineage及依赖类型: ```scala val sparkConf = new SparkConf().setAppName(“lineageDemo”).setMaster(“local[*]”) val sparkContext = new SparkContext(sparkConf)
val wordAndOne = sparkContext.textFile(“text.txt”).flatMap(.split(“\t”).map((,1))) val wordAndCount = wordAndOne.reduceByKey(+)
println(wordAndOne.toDebugString) println(wordAndCount.toDebugString) println(wordAndOne.dependencies.toString) println(wordAndCount.dependencies.toString)
```
(2) MapPartitionsRDD[2] at flatMap at lineageDemo.scala:10 []
| text.txt MapPartitionsRDD[1] at textFile at lineageDemo.scala:10 []
| text.txt HadoopRDD[0] at textFile at lineageDemo.scala:10 []
(2) ShuffledRDD[3] at reduceByKey at lineageDemo.scala:11 []
+-(2) MapPartitionsRDD[2] at flatMap at lineageDemo.scala:10 []
| text.txt MapPartitionsRDD[1] at textFile at lineageDemo.scala:10 []
| text.txt HadoopRDD[0] at textFile at lineageDemo.scala:10 []
List(org.apache.spark.OneToOneDependency@5762658b)
List(org.apache.spark.ShuffleDependency@2629d5dc)
6.2 窄依赖
每一个父RDD的Partition最多被子RDD的一个Partition使用,即OneToOneDependency
6.3 宽依赖
多个子RDD的Partition会依赖同一个父RDD的Partition,会引起shuffle,即ShuffleDependency
6.4 DAG
DAG有向无环图,原始的RDD经过一系列转换形成DAG,根据依赖关系的不同将DAG划分为多个Stage,其中只有宽依赖划分Stage,窄依赖不需要shuffle所以在同一个Stage中完成。
6.5 Spark任务划分
RDD任务切分中分为:Application、Job、Stage和Task
- Application与SparkContext一一对应
- Job在每一个Action算子中生成
- Stage由宽依赖划分Job
- Task是Stage中的执行单元,即Stage是TaskSet