JOIN种类
sort merge join
broadcast join
触发条件:
表小于spark.sql.autoBroadcastJoinThreshold设定的值(默认10M),
shuffle hash join
触发条件:
1. 分区的平均大小不超过spark.sql.autoBroadcastJoinThreshold所配置的值,默认是10M
2. 基表不能被广播,比如left outer join时,只能广播右表
3. 一侧的表要明显小于另外一侧,小的一侧将被广播(明显小于的定义为3倍小,此处为经验值)
小文件问题
SQL触发SQL可以调小这个,spark.sql.shuffle.partitions
Spark on Hive
代码样例
hive-site.xml放到resource
val spark = SparkSession
.builder()
.appName("WordCount")
.config("spark.master", "local")
.config("spark.sql.shuffle.partitions", 4)
.config("spark.sql.adaptive.enabled", true)
.config("hive.exec.dynamici.partition", true)
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.enableHiveSupport()
.getOrCreate()
spark.sparkContext.setLogLevel("DEBUG")
val sc = spark.sparkContext
val path = "/etc/hadoop"
sc.hadoopConfiguration.addResource(new Path(s"${path}/core-site.xml"))
sc.hadoopConfiguration.addResource(new Path(s"${path}/hdfs-site.xml"))