在这里就列举一写导致 map 阶段处理很慢的原因。部分案例来源于网络。

1.不同数据类型关联产生数据倾斜

场景:用户表中user_id字段为int,log表中user_id字段既有string类型也有int类型。当按照user_id进行两个表的Join操作时,默认的Hash操作会按int型的id来进行分配,这样会导致所有string类型id的记录都分配到一个Reducer中。

解决方法:把数字类型转换成字符串类型

select * from users a left outer join logs b on a.usr_id = cast(b.user_id as string)


现实描述:(默认是mapjoin)

fin_ihotel_ceq_external_ctrip表(大表)的countryid和cityid 为int类型 ,

而dim_ihotel_country_ctrip(小表)的country_id,dim_ihotel_city_ctrip(小表)的city_id都为string类型

from fin_ihotel_ceq_external_ctrip t1
left join dim_ihotel_country_ctrip t2
on t1.countryid=t2.country_id
left join dim_ihotel_city_ctrip t3
on t1.cityid=t3.city_id and t1.countryid=t2.country_id

改成

from fin_ihotel_ceq_external_ctrip t1

left join dim_ihotel_country_ctrip t2

on cast(t1.countryid as string)=t2.country_id

left join dim_ihotel_city_ctrip t3

on cast(t1.cityid as string)=t3.city_id and cast(t1.countryid as string)=t2.country_id

速度提升1个半小时到3分钟

【注】:以后需要关联的字段最好都 用string类型定义