07-源码-SparkSubmit.txt
08-源码-ApplicationMaster.txt09-源码- ExecutorBackend.txt
读源码时候中间可能会有部分Yarn相关源码缺失,pom文件中导入如下依赖即可:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-yarn_2.12</artifactId>
<version>3.0.0</version>
</dependency>
- 环境
(1) Local[*]
(2) Standalone
(3) Yarn - 组件
(1) SparkSubmit
(2) ApplicationMaster
(3) Driver(Thread)
(4) SparkContext
(5) ExecutorBackend
(6) Executor - 通信
(1) Actor(邮件)
(2) Netty( NIO ,Epoll)
(3) Endpoint(reply,rec)
(4) EndpointRef (send, ask)
(5) Inbox(1)
(6) Outbox(N) - 计算
(1) RDD,累加器,广播变量
(2) 算子(转换,行动)
(3) 血缘(依赖)
(4) 持久化(cache, persit, checkpoint)
(5) Stage, Task - Shuffle
(1) 宽依赖
(2) 优化:预聚合(AppendOnlyMap)
(3) 实现方式:bypass, sort( 外排,mergeSort )
(4) 原理 - 内存
(1) 内存分类- 存储内存,执行内存,其他内存
- 堆内,堆外
(2) 动态占用机制
- SparkSQL
(1) RDD, DataFrame, Dataset
(2) SparkSession
(3) SQL & DSL
(4) UDF & UDAF - SparkStreaming (Flink)
(1) 原理
(2) Kafka + SparkStreaming
(3) Window
(4) 优雅地关闭0