6.1 Lineage

RDD只支持粗粒度转换,即在大量记录上执行单个操作。每次操作都会将创建RDD的Lineage记录下来。Lineage记录:

  • RDD元数据
  • 转换行为
  • 查看Lineage及依赖类型: ```scala val sparkConf = new SparkConf().setAppName(“lineageDemo”).setMaster(“local[*]”) val sparkContext = new SparkContext(sparkConf)

val wordAndOne = sparkContext.textFile(“text.txt”).flatMap(.split(“\t”).map((,1))) val wordAndCount = wordAndOne.reduceByKey(+)

println(wordAndOne.toDebugString) println(wordAndCount.toDebugString) println(wordAndOne.dependencies.toString) println(wordAndCount.dependencies.toString)

  1. ```
  2. (2) MapPartitionsRDD[2] at flatMap at lineageDemo.scala:10 []
  3. | text.txt MapPartitionsRDD[1] at textFile at lineageDemo.scala:10 []
  4. | text.txt HadoopRDD[0] at textFile at lineageDemo.scala:10 []
  5. (2) ShuffledRDD[3] at reduceByKey at lineageDemo.scala:11 []
  6. +-(2) MapPartitionsRDD[2] at flatMap at lineageDemo.scala:10 []
  7. | text.txt MapPartitionsRDD[1] at textFile at lineageDemo.scala:10 []
  8. | text.txt HadoopRDD[0] at textFile at lineageDemo.scala:10 []
  9. List(org.apache.spark.OneToOneDependency@5762658b)
  10. List(org.apache.spark.ShuffleDependency@2629d5dc)

6.2 窄依赖

每一个父RDD的Partition最多被子RDD的一个Partition使用,即OneToOneDependency

6.3 宽依赖

多个子RDD的Partition会依赖同一个父RDD的Partition,会引起shuffle,即ShuffleDependency

6.4 DAG

DAG有向无环图,原始的RDD经过一系列转换形成DAG,根据依赖关系的不同将DAG划分为多个Stage,其中只有宽依赖划分Stage,窄依赖不需要shuffle所以在同一个Stage中完成。
image.png

6.5 Spark任务划分

RDD任务切分中分为:Application、Job、Stage和Task

  • Application与SparkContext一一对应
  • Job在每一个Action算子中生成
  • Stage由宽依赖划分Job
  • Task是Stage中的执行单元,即Stage是TaskSet