在这里就列举一写导致 map 阶段处理很慢的原因。部分案例来源于网络。
1.不同数据类型关联产生数据倾斜
场景:用户表中user_id字段为int,log表中user_id字段既有string类型也有int类型。当按照user_id进行两个表的Join操作时,默认的Hash操作会按int型的id来进行分配,这样会导致所有string类型id的记录都分配到一个Reducer中。
解决方法:把数字类型转换成字符串类型
select * from users a left outer join logs b on a.usr_id = cast(b.user_id as string)
现实描述:(默认是mapjoin)
fin_ihotel_ceq_external_ctrip表(大表)的countryid和cityid 为int类型 ,
而dim_ihotel_country_ctrip(小表)的country_id,dim_ihotel_city_ctrip(小表)的city_id都为string类型
from fin_ihotel_ceq_external_ctrip t1
left join dim_ihotel_country_ctrip t2
on t1.countryid=t2.country_id
left join dim_ihotel_city_ctrip t3
on t1.cityid=t3.city_id and t1.countryid=t2.country_id
改成
from fin_ihotel_ceq_external_ctrip t1
left join dim_ihotel_country_ctrip t2
on cast(t1.countryid as string)=t2.country_id
left join dim_ihotel_city_ctrip t3
on cast(t1.cityid as string)=t3.city_id and cast(t1.countryid as string)=t2.country_id
速度提升1个半小时到3分钟
【注】:以后需要关联的字段最好都 用string类型定义