交集、并集、差集

交集会产生shuffle操作

差集

  1. /**
  2. * 交集
  3. */
  4. @Test
  5. def intersection(): Unit ={
  6. val rdd1 = sc.parallelize(List(1,2,3,4,5))
  7. val rdd2 = sc.parallelize(List(4,5,6,7,8,9))
  8. val rdd3 = rdd1.intersection(rdd2)
  9. println(rdd3.collect().toList)
  10. }
  11. /**
  12. * 差集
  13. */
  14. @Test
  15. def subtract(): Unit ={
  16. val rdd1 = sc.parallelize(List(1,2,3,4,5))
  17. val rdd2 = sc.parallelize(List(4,5,6,7,8,9))
  18. val rdd3 = rdd1.subtract(rdd2)
  19. println(rdd3.collect().toList)
  20. }
  21. @Test
  22. def union(): Unit ={
  23. val rdd1 = sc.parallelize(List(1,2,3,4,5))
  24. val rdd2 = sc.parallelize(List(4,5,6,7,8,9))
  25. val rdd3 = rdd1.union(rdd2)
  26. println(rdd3.partitions.length)
  27. println(rdd3.collect().toList)
  28. }

zip拉链

  1. /**
  2. * spark的拉链必须要求两个RDD的分区数以及元素个数一致
  3. */
  4. @Test
  5. def zip(): Unit ={
  6. val rdd1 = sc.parallelize(List("zhangsan","lisi","wangwu"),5)
  7. val rdd2 = sc.parallelize(List(20,30,40))
  8. val rdd3 = rdd1.zip(rdd2)
  9. println(rdd3.collect().toList)
  10. }

join

一个join会产生两个shuffler,用广播变量减少shuffler

union

union并集没有shuffle,RDD分区数=union的两个rdd分区数总和