JOIN种类

sort merge join

broadcast join

触发条件:
表小于spark.sql.autoBroadcastJoinThreshold设定的值(默认10M),

shuffle hash join

触发条件:
1. 分区的平均大小不超过spark.sql.autoBroadcastJoinThreshold所配置的值,默认是10M
2. 基表不能被广播,比如left outer join时,只能广播右表
3. 一侧的表要明显小于另外一侧,小的一侧将被广播(明显小于的定义为3倍小,此处为经验值)

小文件问题

SQL触发SQL可以调小这个,spark.sql.shuffle.partitions

Spark on Hive

代码样例

hive-site.xml放到resource

  1. val spark = SparkSession
  2. .builder()
  3. .appName("WordCount")
  4. .config("spark.master", "local")
  5. .config("spark.sql.shuffle.partitions", 4)
  6. .config("spark.sql.adaptive.enabled", true)
  7. .config("hive.exec.dynamici.partition", true)
  8. .config("hive.exec.dynamic.partition.mode", "nonstrict")
  9. .enableHiveSupport()
  10. .getOrCreate()
  11. spark.sparkContext.setLogLevel("DEBUG")
  12. val sc = spark.sparkContext
  13. val path = "/etc/hadoop"
  14. sc.hadoopConfiguration.addResource(new Path(s"${path}/core-site.xml"))
  15. sc.hadoopConfiguration.addResource(new Path(s"${path}/hdfs-site.xml"))