A0621.1 - Checkpointing设置目录应该本地还是远程?
看的基本书中没有交代这个问题,只是说有缓存提效的作用,但我想这对于spark的并行计算来讲是问题,我做实验的时候都是一个机子,没关注这个问题,后面看《Definitive Guide》想到这个确实需要弄清楚一下。
Checkpointing can be local or reliable which defines how reliable the checkpoint directory is. Local checkpointing uses executor storage to write checkpoint files to and due to the executor lifecycle is considered unreliable. Reliable checkpointing uses a reliable data storage like Hadoop HDFS. ref
RDD Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. ref
两种都支持,但从最上层暴露的接口,看不出什么时候是Local什么时候是用Reliable?内部根据指定的文件协议决定?
A0627.1 - RDD的分区转换后没有数据,是否保留?
例如rdd有3个分区,但使用filter后,只有2个分区有数据?新rdd的分区是2还是3了?
有下面这个结果可知,rdd还是保持了分区个数,说明允许有些分区无数据。
val arr0 = Array((1, 'a'), (2, 'b'), (3, 'c'), (4, 'd'), (2, 'e'), (3, 'f'), (2, 'g'), (1, 'h'))
val rdd1 = sc.parallelize(arr0, 3)
rdd1.getNumPartitions // 3
rdd1.mapPartitionsWithIndex((idx, it) => it.toSeq.map(i => s"At ${idx}, ${i._1}_${i._2}").toIterator).foreach(println)
/*
At 1, 3_c
At 2, 3_f
At 2, 2_g
At 2, 1_h
At 0, 1_a
At 0, 2_b
At 1, 4_d
At 1, 2_e
*/
val rdd2 = rdd1.filter(r => r._1 == 1)
rdd2.getNumPartitions // 3
rdd2.mapPartitionsWithIndex((idx, it) => it.toSeq.map(i => s"At ${idx}, ${i._1}_${i._2}").toIterator).foreach(println)
/*
At 0, 1_a
At 2, 1_h
*/