一、行列转换
描述:表中记录了各年份各部门的平均绩效考核成绩。
\
表名:t1
\
表结构:
a -- 年份b -- 部门c -- 绩效得分
表内容:
a b c2014 B 92015 A 82014 A 102015 B 7
问题一:多行转多列
问题描述:将上述表内容转为如下输出结果所示:
a col_A col_B2014 10 92015 8 7
参考答案:
selecta,max(case when b="A" then c end) col_A,max(case when b="B" then c end) col_Bfrom t1group by a;
问题二:如何将结果转成源表?(多列转多行)
问题描述:将问题一的结果转成源表,问题一结果表名为t1_2。
参考答案:
selecta,b,cfrom (select a,"A" as b,col_a as c from t1_2union allselect a,"B" as b,col_b as c from t1_2)tmp;
问题三:同一部门会有多个绩效,求多行转多列结果
问题描述:2014年公司组织架构调整,导致部门出现多个绩效,业务及人员不同,无法合并算绩效,源表内容如下:
2014 B 92015 A 82014 A 102015 B 72014 B 6
输出结果如下所示:
a col_A col_B2014 10 6,92015 8 7
参考答案:
selecta,max(case when b="A" then c end) col_A,max(case when b="B" then c end) col_Bfrom (selecta,b,concat_ws(",",collect_set(cast(c as string))) as cfrom t1group by a,b)tmpgroup by a;
二、排名中取他值
表名:t2
\
表字段及内容:
a b c2014 A 32014 B 12014 C 22015 A 42015 D 3
问题一:按a分组取b字段最小时对应的c字段
输出结果如下所示:
a min_c2014 32015 4
参考答案:
selecta,c as min_cfrom(selecta,b,c,row_number() over(partition by a order by b) as rnfrom t2)awhere rn = 1;
问题二:按a分组取b字段排第二时对应的c字段
输出结果如下所示:
a second_c2014 12015 3
参考答案:
selecta,c as second_cfrom(selecta,b,c,row_number() over(partition by a order by b) as rnfrom t2)awhere rn = 2;
问题三:按a分组取b字段最小和最大时对应的c字段
输出结果如下所示:
a min_c max_c2014 3 22015 4 3
参考答案:
selecta,min(if(asc_rn = 1, c, null)) as min_c,max(if(desc_rn = 1, c, null)) as max_cfrom(selecta,b,c,row_number() over(partition by a order by b) as asc_rn,row_number() over(partition by a order by b desc) as desc_rnfrom t2)awhere asc_rn = 1 or desc_rn = 1group by a;
问题四:按a分组取b字段第二小和第二大时对应的c字段
输出结果如下所示:
a min_c max_c2014 1 12015 3 4
参考答案:
selectret.a,max(case when ret.rn_min = 2 then ret.c else null end) as min_c,max(case when ret.rn_max = 2 then ret.c else null end) as max_cfrom (select*,row_number() over(partition by t2.a order by t2.b) as rn_min,row_number() over(partition by t2.a order by t2.b desc) as rn_maxfrom t2) as retwhere ret.rn_min = 2or ret.rn_max = 2group by ret.a;
问题五:按a分组取b字段前两小和前两大时对应的c字段
注意:需保持b字段最小、最大排首位
输出结果如下所示:
a min_c max_c2014 3,1 2,12015 4,3 3,4
参考答案:
selecttmp1.a as a,min_c,max_cfrom(selecta,concat_ws(',', collect_list(c)) as min_cfrom(selecta,b,c,row_number() over(partition by a order by b) as asc_rnfrom t2)awhere asc_rn <= 2group by a)tmp1join(selecta,concat_ws(',', collect_list(c)) as max_cfrom(selecta,b,c,row_number() over(partition by a order by b desc) as desc_rnfrom t2)awhere desc_rn <= 2group by a)tmp2on tmp1.a = tmp2.a;
三、累计求值
表名:t3
\
表字段及内容:
a b c2014 A 32014 B 12014 C 22015 A 42015 D 3
问题一:按a分组按b字段排序,对c累计求和
输出结果如下所示:
a b sum_c2014 A 32014 B 42014 C 62015 A 42015 D 7
参考答案:
selecta,b,c,sum(c) over(partition by a order by b) as sum_cfrom t3;
问题二:按a分组按b字段排序,对c取累计平均值
输出结果如下所示:
a b avg_c2014 A 32014 B 22014 C 22015 A 42015 D 3.5
参考答案:
selecta,b,c,avg(c) over(partition by a order by b) as avg_cfrom t3;
问题三:按a分组按b字段排序,对b取累计排名比例
输出结果如下所示:
a b ratio_c2014 A 0.332014 B 0.672014 C 1.002015 A 0.502015 D 1.00
参考答案:
selecta,b,c,round(row_number() over(partition by a order by b) / (count(c) over(partition by a)),2) as ratio_cfrom t3order by a,b;
问题四:按a分组按b字段排序,对b取累计求和比例
输出结果如下所示:
a b ratio_c2014 A 0.502014 B 0.672014 C 1.002015 A 0.572015 D 1.00
参考答案:
selecta,b,c,round(sum(c) over(partition by a order by b) / (sum(c) over(partition by a)),2) as ratio_cfrom t3order by a,b;
四、窗口大小控制
表名:t4
\
表字段及内容:
a b c2014 A 32014 B 12014 C 22015 A 42015 D 3
问题一:按a分组按b字段排序,对c取前后各一行的和
输出结果如下所示:
a b sum_c2014 A 12014 B 52014 C 12015 A 32015 D 4
参考答案:
selecta,b,lag(c,1,0) over(partition by a order by b)+lead(c,1,0) over(partition by a order by b) as sum_cfrom t4;
问题二:按a分组按b字段排序,对c取平均值
问题描述:前一行与当前行的均值!
输出结果如下所示:
a b avg_c2014 A 32014 B 22014 C 1.52015 A 42015 D 3.5
参考答案:
selecta,b,case when lag_c is null then celse (c+lag_c)/2 end as avg_cfrom(selecta,b,c,lag(c,1) over(partition by a order by b) as lag_cfrom t4)temp;
五、产生连续数值
输出结果如下所示:
12345...100
参考答案:
\
不借助其他任何外表,实现产生连续数值
\
此处给出两种解法,其一:
selectid_start+pos as idfrom(select1 as id_start,1000000 as id_end) m lateral view posexplode(split(space(id_end-id_start), '')) t as pos, val
其二:
selectrow_number() over() as idfrom(select split(space(99), ' ') as x) tlateral viewexplode(x) ex;
那如何产生1至1000000连续数值?
参考答案:
selectrow_number() over() as idfrom(select split(space(999999), ' ') as x) tlateral viewexplode(x) ex;
六、数据扩充与收缩
表名:t6
\
表字段及内容:
a324
问题一:数据扩充
输出结果如下所示:
a b3 3、2、12 2、14 4、3、2、1
参考答案:
selectt.a,concat_ws('、',collect_set(cast(t.rn as string))) as bfrom(selectt6.a,b.rnfrom t6left join(selectrow_number() over() as rnfrom(select split(space(5), ' ') as x) t -- space(5)可根据t6表的最大值灵活调整lateral viewexplode(x) pe) bon 1 = 1where t6.a >= b.rnorder by t6.a, b.rn desc) tgroup by t.a;
问题二:数据扩充,排除偶数
输出结果如下所示:
a b3 3、12 14 3、1
参考答案:
selectt.a,concat_ws('、',collect_set(cast(t.rn as string))) as bfrom(selectt6.a,b.rnfrom t6left join(selectrow_number() over() as rnfrom(select split(space(5), ' ') as x) tlateral viewexplode(x) pe) bon 1 = 1where t6.a >= b.rn and b.rn % 2 = 1order by t6.a, b.rn desc) tgroup by t.a;
问题三:如何处理字符串累计拼接
问题描述:将小于等于a字段的值聚合拼接起来
输出结果如下所示:
a b3 2、32 24 2、3、4
参考答案:
selectt.a,concat_ws('、',collect_set(cast(t.a1 as string))) as bfrom(selectt6.a,b.a1from t6left join(select a as a1from t6) bon 1 = 1where t6.a >= b.a1order by t6.a, b.a1) tgroup by t.a;
问题四:如果a字段有重复,如何实现字符串累计拼接
输出结果如下所示:
a b2 23 2、33 2、3、34 2、3、3、4
参考答案:
selecta,bfrom(selectt.a,t.rn,concat_ws('、',collect_list(cast(t.a1 as string))) as bfrom(selecta.a,a.rn,b.a1from(selecta,row_number() over(order by a ) as rnfrom t6) aleft join(select a as a1,row_number() over(order by a ) as rnfrom t6) bon 1 = 1where a.a >= b.a1 and a.rn >= b.rnorder by a.a, b.a1) tgroup by t.a,t.rnorder by t.a,t.rn) tt;
问题五:数据展开
问题描述:如何将字符串”1-5,16,11-13,9”扩展成”1,2,3,4,5,16,11,12,13,9”?注意顺序不变。
参考答案:
selectconcat_ws(',',collect_list(cast(rn as string)))from(selecta.rn,b.num,b.posfrom(selectrow_number() over() as rnfrom (select split(space(20), ' ') as x) t -- space(20)可灵活调整lateral viewexplode(x) pe) a lateral view outerposexplode(split('1-5,16,11-13,9', ',')) b as pos, numwhere a.rn between cast(split(num, '-')[0] as int) and cast(split(num, '-')[1] as int) or a.rn = numorder by pos, rn) t;
七、合并与拆分
表名:t7
\
表字段及内容:
a b2014 A2014 B2015 B2015 D
问题一:合并
输出结果如下所示:
2014 A、B2015 B、D
参考答案:
selecta,concat_ws('、', collect_set(t.b)) bfrom t7group by a;
问题二:拆分
问题描述:将分组合并的结果拆分出来
参考答案:
selectt.a,dfrom(selecta,concat_ws('、', collect_set(t7.b)) bfrom t7group by a)tlateral viewexplode(split(t.b, '、')) table_tmp as d;
八、模拟循环操作
表名:t8
\
表字段及内容:
a10110101
问题一:如何将字符’1’的位置提取出来
输出结果如下所示:
1,3,42,4
参考答案:
selecta,concat_ws(",",collect_list(cast(index as string))) as resfrom (selecta,index+1 as index,chrfrom (selecta,concat_ws(",",substr(a,1,1),substr(a,2,1),substr(a,3,1),substr(a,-1)) strfrom t8) tmp1lateral view posexplode(split(str,",")) t as index,chrwhere chr = "1") tmp2group by a;
九、不使用distinct或group by去重
表名:t9
\
表字段及内容:
a b c d2014 2016 2014 A2014 2015 2015 B
问题一:不使用distinct或group by去重
输出结果如下所示:
2014 A2016 A2014 B2015 B
参考答案:
selectt2.year,t2.numfrom(select*,row_number() over (partition by t1.year,t1.num) as rank_1from(selecta as year,d as numfrom t9union allselectb as year,d as numfrom t9union allselectc as year,d as numfrom t9)t1)t2where rank_1=1order by num;
十、容器—反转内容
表名:t10
\
表字段及内容:
aAB,CA,BADBD,EA
问题一:反转逗号分隔的数据:改变顺序,内容不变
输出结果如下所示:
BAD,CA,ABEA,BD
参考答案:
selecta,concat_ws(",",collect_list(reverse(str)))from(selecta,strfrom t10lateral view explode(split(reverse(a),",")) t as str) tmp1group by a;
问题二:反转逗号分隔的数据:改变内容,顺序不变
输出结果如下所示:
BA,AC,DABDB,AE
参考答案:
selecta,concat_ws(",",collect_list(reverse(str)))from(selecta,strfrom t10lateral view explode(split(a,",")) t as str) tmp1group by a;
十一、多容器—成对提取数据
表名:t11
\
表字段及内容:
a bA/B 1/3B/C/D 4/5/2
问题一:成对提取数据,字段一一对应
输出结果如下所示:
a bA 1B 3B 4C 5D 2
参考答案:
selecta_inx,b_inxfrom(selecta,b,a_id,a_inx,b_id,b_inxfrom t11lateral view posexplode(split(a,'/')) t as a_id,a_inxlateral view posexplode(split(b,'/')) t as b_id,b_inx) tmpwhere a_id=b_id;
十二、多容器—转多行
表名:t12
\
表字段及内容:
a b c001 A/B 1/3/5002 B/C/D 4/5
问题一:转多行
输出结果如下所示:
a d e001 type_b A001 type_b B001 type_c 1001 type_c 3001 type_c 5002 type_b B002 type_b C002 type_b D002 type_c 4002 type_c 5
参考答案:
selecta,d,efrom(selecta,"type_b" as d,str as efrom t12lateral view explode(split(b,"/")) t as strunion allselecta,"type_c" as d,str as efrom t12lateral view explode(split(c,"/")) t as str) tmporder by a,d;
十三、抽象分组—断点排序
表名:t13
\
表字段及内容:
a b2014 12015 12016 12017 02018 02019 -12020 -12021 -12022 12023 1
问题一:断点排序
输出结果如下所示:
a b c2014 1 12015 1 22016 1 32017 0 12018 0 22019 -1 12020 -1 22021 -1 32022 1 12023 1 2
参考答案:
selecta,b,row_number() over( partition by b,repair_a order by a asc) as c--按照b列和[b的组首]分组,排序from(selecta,b,a-b_rn as repair_a--根据b列值出现的次序,修复a列值为b首次出现的a列值,称为b的[组首]from(selecta,b,row_number() over( partition by b order by a asc ) as b_rn--按b列分组,按a列排序,得到b列各值出现的次序from t13)tmp1)tmp2--注意,如果不同的b列值,可能出现同样的组首值,但组首值需要和a列值 一并参与分组,故并不影响排序。order by a asc;
十四、业务逻辑的分类与抽象—时效
日期表:d_date
\
表字段及内容:
date_id is_work2017-04-13 12017-04-14 12017-04-15 02017-04-16 02017-04-17 1
工作日:周一至周五09:30-18:30
客户申请表:t14
\
表字段及内容:
a b c1 申请 2017-04-14 18:03:001 通过 2017-04-17 09:43:002 申请 2017-04-13 17:02:002 通过 2017-04-15 09:42:00
问题一:计算上表中从申请到通过占用的工作时长
输出结果如下所示:
a d1 0.67h2 10.67h
参考答案:
selecta,round(sum(diff)/3600,2) as dfrom (selecta,apply_time,pass_time,dates,rn,ct,is_work,case when is_work=1 and rn=1 then unix_timestamp(concat(dates,' 18:30:00'),'yyyy-MM-dd HH:mm:ss')-unix_timestamp(apply_time,'yyyy-MM-dd HH:mm:ss')when is_work=0 then 0when is_work=1 and rn=ct then unix_timestamp(pass_time,'yyyy-MM-dd HH:mm:ss')-unix_timestamp(concat(dates,' 09:30:00'),'yyyy-MM-dd HH:mm:ss')when is_work=1 and rn!=ct then 9*3600end difffrom (selecta,apply_time,pass_time,time_diff,day_diff,rn,ct,date_add(start,rn-1) datesfrom (selecta,apply_time,pass_time,time_diff,day_diff,strs,start,row_number() over(partition by a) as rn,count(*) over(partition by a) as ctfrom (selecta,apply_time,pass_time,time_diff,day_diff,substr(repeat(concat(substr(apply_time,1,10),','),day_diff+1),1,11*(day_diff+1)-1) strsfrom (selecta,apply_time,pass_time,unix_timestamp(pass_time,'yyyy-MM-dd HH:mm:ss')-unix_timestamp(apply_time,'yyyy-MM-dd HH:mm:ss') time_diff,datediff(substr(pass_time,1,10),substr(apply_time,1,10)) day_difffrom (selecta,max(case when b='申请' then c end) apply_time,max(case when b='通过' then c end) pass_timefrom t14group by a) tmp1) tmp2) tmp3lateral view explode(split(strs,",")) t as start) tmp4) tmp5join d_dateon tmp5.dates = d_date.date_id) tmp6group by a;
十五、时间序列—进度及剩余
表名:t15
\
表字段及内容:
date_id is_work2017-07-30 02017-07-31 12017-08-01 12017-08-02 12017-08-03 12017-08-04 12017-08-05 02017-08-06 02017-08-07 1
问题一:求每天的累计周工作日,剩余周工作日
输出结果如下所示:
date_id week_to_work week_left_work2017-07-31 1 42017-08-01 2 32017-08-02 3 22017-08-03 4 12017-08-04 5 02017-08-05 5 02017-08-06 5 0
参考答案:
\
此处给出两种解法,其一:
selectdate_id,case date_format(date_id,'u')when 1 then 1when 2 then 2when 3 then 3when 4 then 4when 5 then 5when 6 then 5when 7 then 5end as week_to_work,case date_format(date_id,'u')when 1 then 4when 2 then 3when 3 then 2when 4 then 1when 5 then 0when 6 then 0when 7 then 0end as week_to_workfrom t15
其二:
selectdate_id,week_to_work,week_sum_work-week_to_work as week_left_workfrom(selectdate_id,sum(is_work) over(partition by year,week order by date_id) as week_to_work,sum(is_work) over(partition by year,week) as week_sum_workfrom(selectdate_id,is_work,year(date_id) as year,weekofyear(date_id) as weekfrom t15) ta) tb order by date_id;
十六、时间序列—构造日期
问题一:直接使用SQL实现一张日期维度表,包含以下字段:
date string 日期d_week string 年内第几周weeks int 周几w_start string 周开始日w_end string 周结束日d_month int 第几月m_start string 月开始日m_end string 月结束日d_quarter int 第几季q_start string 季开始日q_end string 季结束日d_year int 年份y_start string 年开始日y_end string 年结束日
参考答案:
drop table if exists dim_date;create table if not exists dim_date(`date` string comment '日期',d_week string comment '年内第几周',weeks string comment '周几',w_start string comment '周开始日',w_end string comment '周结束日',d_month string comment '第几月',m_start string comment '月开始日',m_end string comment '月结束日',d_quarter int comment '第几季',q_start string comment '季开始日',q_end string comment '季结束日',d_year int comment '年份',y_start string comment '年开始日',y_end string comment '年结束日');--自然月: 指每月的1号到那个月的月底,它是按照阳历来计算的。就是从每月1号到月底,不管这个月有30天,31天,29天或者28天,都算是一个自然月。insert overwrite table dim_dateselect `date`, d_week --年内第几周, case weekidwhen 0 then '周日'when 1 then '周一'when 2 then '周二'when 3 then '周三'when 4 then '周四'when 5 then '周五'when 6 then '周六'end as weeks -- 周, date_add(next_day(`date`,'MO'),-7) as w_start --周一, date_add(next_day(`date`,'MO'),-1) as w_end -- 周日_end-- 月份日期, concat('第', monthid, '月') as d_month, m_start, m_end-- 季节, quarterid as d_quart, concat(d_year, '-', substr(concat('0', (quarterid - 1) * 3 + 1), -2), '-01') as q_start --季开始日, date_sub(concat(d_year, '-', substr(concat('0', (quarterid) * 3 + 1), -2), '-01'), 1) as q_end --季结束日-- 年, d_year, y_start, y_endfrom (select `date`, pmod(datediff(`date`, '2012-01-01'), 7) as weekid --获取周几, cast(substr(`date`, 6, 2) as int) as monthid --获取月份, casewhen cast(substr(`date`, 6, 2) as int) <= 3 then 1when cast(substr(`date`, 6, 2) as int) <= 6 then 2when cast(substr(`date`, 6, 2) as int) <= 9 then 3when cast(substr(`date`, 6, 2) as int) <= 12 then 4end as quarterid --获取季节 可以直接使用 quarter(`date`), substr(`date`, 1, 4) as d_year -- 获取年份, trunc(`date`, 'YYYY') as y_start --年开始日, date_sub(trunc(add_months(`date`, 12), 'YYYY'), 1) as y_end --年结束日, date_sub(`date`, dayofmonth(`date`) - 1) as m_start --当月第一天, last_day(date_sub(`date`, dayofmonth(`date`) - 1)) m_end --当月最后一天, weekofyear(`date`) as d_week --年内第几周from (-- '2021-04-01'是开始日期, '2022-03-31'是截止日期select date_add('2021-04-01', t0.pos) as `date`from (select posexplode(split(repeat('o', datediff(from_unixtime(unix_timestamp('2022-03-31', 'yyyy-mm-dd'),'yyyy-mm-dd'),'2021-04-01')), 'o'))) t0) t1) t2;
十七、时间序列—构造累积日期
表名:t17
\
表字段及内容:
date_id2017-08-012017-08-022017-08-03
问题一:每一日期,都扩展成月初至当天
输出结果如下所示:
date_id date_to_day2017-08-01 2017-08-012017-08-02 2017-08-012017-08-02 2017-08-022017-08-03 2017-08-012017-08-03 2017-08-022017-08-03 2017-08-03
这种累积相关的表,常做桥接表。
参考答案:
selectdate_id,date_add(date_start_id,pos) as date_to_dayfrom(selectdate_id,date_sub(date_id,dayofmonth(date_id)-1) as date_start_idfrom t17) m lateral viewposexplode(split(space(datediff(from_unixtime(unix_timestamp(date_id,'yyyy-MM-dd')),from_unixtime(unix_timestamp(date_start_id,'yyyy-MM-dd')))), '')) t as pos, val;
十八、时间序列—构造连续日期
表名:t18
\
表字段及内容:
a b c101 2018-01-01 10101 2018-01-03 20101 2018-01-06 40102 2018-01-02 20102 2018-01-04 30102 2018-01-07 60
问题一:构造连续日期
问题描述:将表中数据的b字段扩充至范围[2018-01-01, 2018-01-07],并累积对c求和。
\
b字段的值是较稀疏的。
输出结果如下所示:
a b c d101 2018-01-01 10 10101 2018-01-02 0 10101 2018-01-03 20 30101 2018-01-04 0 30101 2018-01-05 0 30101 2018-01-06 40 70101 2018-01-07 0 70102 2018-01-01 0 0102 2018-01-02 20 20102 2018-01-03 0 20102 2018-01-04 30 50102 2018-01-05 0 50102 2018-01-06 0 50102 2018-01-07 60 110
参考答案:
selecta,b,c,sum(c) over(partition by a order by b) as dfrom(selectt1.a,t1.b,casewhen t18.b is not null then t18.celse 0end as cfrom(selecta,date_add(s,pos) as bfrom(selecta,'2018-01-01' as s,'2018-01-07' as rfrom (select a from t18 group by a) ta) m lateral viewposexplode(split(space(datediff(from_unixtime(unix_timestamp(r,'yyyy-MM-dd')),from_unixtime(unix_timestamp(s,'yyyy-MM-dd')))), '')) t as pos, val) t1left join t18on t1.a = t18.a and t1.b = t18.b) ts;
十九、时间序列—取多个字段最新的值
表名:t19
\
表字段及内容:
date_id a b c2014 AB 12 bc2015 232016 d2017 BC
问题一:如何一并取出最新日期
输出结果如下所示:
date_a a date_b b date_c c2017 BC 2015 23 2016 d
参考答案:
\
此处给出三种解法,其一:
SELECT max(CASE WHEN rn_a = 1 THEN date_id else 0 END) AS date_a,max(CASE WHEN rn_a = 1 THEN a else null END) AS a,max(CASE WHEN rn_b = 1 THEN date_id else 0 END) AS date_b,max(CASE WHEN rn_b = 1 THEN b else NULL END) AS b,max(CASE WHEN rn_c = 1 THEN date_id else 0 END) AS date_c,max(CASE WHEN rn_c = 1 THEN c else null END) AS cFROM (SELECT date_id,a,b,c--对每列上不为null的值 的 日期 进行排序,row_number()OVER( PARTITION BY 1 ORDER BY CASE WHEN a IS NULL THEN 0 ELSE date_id END DESC) AS rn_a,row_number()OVER(PARTITION BY 1 ORDER BY CASE WHEN b IS NULL THEN 0 ELSE date_id END DESC) AS rn_b,row_number()OVER(PARTITION BY 1 ORDER BY CASE WHEN c IS NULL THEN 0 ELSE date_id END DESC) AS rn_cFROM t19) tWHERE t.rn_a = 1OR t.rn_b = 1OR t.rn_c = 1;
其二:
SELECTa.date_id,a.a,b.date_id,b.b,c.date_id,c.cFROM(SELECTt.date_id,t.aFROM(SELECTt.date_id,t.a,t.b,t.cFROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.a IS NOT NULL) tORDER BY t.date_id DESCLIMIT 1) aLEFT JOIN(SELECTt.date_id,t.bFROM(SELECTt.date_id,t.bFROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.b IS NOT NULL) tORDER BY t.date_id DESCLIMIT 1) b ON 1 = 1LEFT JOIN(SELECTt.date_id,t.cFROM(SELECTt.date_id,t.cFROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.c IS NOT NULL) tORDER BY t.date_id DESCLIMIT 1) cON 1 = 1;
其三:
select*from(select t1.date_id as date_a,t1.a from (select t1.date_id,t1.a from t19 t1 where t1.a is not null) t1inner join (select max(t1.date_id) as date_id from t19 t1 where t1.a is not null) t2on t1.date_id=t2.date_id) t1cross join(select t1.date_b,t1.b from (select t1.date_id as date_b,t1.b from t19 t1 where t1.b is not null) t1inner join (select max(t1.date_id) as date_id from t19 t1 where t1.b is not null)t2on t1.date_b=t2.date_id) t2cross join(select t1.date_c,t1.c from (select t1.date_id as date_c,t1.c from t19 t1 where t1.c is not null) t1inner join (select max(t1.date_id) as date_id from t19 t1 where t1.c is not null)t2on t1.date_c=t2.date_id) t3;
二十、时间序列—补全数据
表名:t20
\
表字段及内容:
date_id a b c2014 AB 12 bc2015 232016 d2017 BC
问题一:如何使用最新数据补全表格
输出结果如下所示:
date_id a b c2014 AB 12 bc2015 AB 23 bc2016 AB 23 d2017 BC 23 d
参考答案:
selectdate_id,first_value(a) over(partition by aa order by date_id) as a,first_value(b) over(partition by bb order by date_id) as b,first_value(c) over(partition by cc order by date_id) as cfrom(selectdate_id,a,b,c,count(a) over(order by date_id) as aa,count(b) over(order by date_id) as bb,count(c) over(order by date_id) as ccfrom t20)tmp1;
二十一、时间序列—取最新完成状态的前一个状态
表名:t21
\
表字段及内容:
date_id a b2014 1 A2015 1 B2016 1 A2017 1 B2013 2 A2014 2 B2015 2 A2014 3 A2015 3 A2016 3 B2017 3 A
问题一:取最新完成状态的前一个状态
输出结果如下所示:
date_id a b2016 1 A2013 2 A2015 3 A
参考答案:
\
此处给出两种解法,其一:
selectt21.date_id,t21.a,t21.bfrom(selectmax(date_id) date_id,afromt21whereb = 'B'group bya) t1inner join t21 on t1.date_id -1 = t21.date_idand t1.a = t21.a;
其二:
selectnext_date_id as date_id,a,next_b as bfrom(select*,min(nk) over(partition by a,b) as minbfrom(select*,row_number() over(partition by a order by date_id desc) nk,lead(date_id) over(partition by a order by date_id desc) next_date_id,lead(b) over(partition by a order by date_id desc) next_bfrom(select * from t21) t) t) twhere minb = nk and b = 'B';
问题二:如何将完成状态的过程合并
输出结果如下所示:
a b_merge1 A、B、A、B2 A、B3 A、A、B
参考答案:
selecta,collect_list(b) as bfrom(select*,min(if(b = 'B',nk,null)) over(partition by a) as minbfrom(select*,row_number() over(partition by a order by date_id desc) nkfrom(select * from t21) t) t) twhere nk >= minbgroup by a;
二十二、非等值连接—范围匹配
表f是事实表,表d是匹配表,在hive中如何将匹配表中的值关联到事实表中?
表d相当于拉链过的变化维,但日期范围可能是不全的。
表f:
date_id p_id2017 C2018 B2019 A2013 C
表d:
d_start d_end p_id p_value2016 2018 A 12016 2018 B 22008 2009 C 42010 2015 C 3
问题一:范围匹配
输出结果如下所示:
date_id p_id p_value2017 C null2018 B 22019 A null2013 C 3
**参考答案:
\
此处给出两种解法,其一:
selectf.date_id,f.p_id,A.p_valuefrom fleft join(selectdate_id,p_id,p_valuefrom(selectf.date_id,f.p_id,d.p_valuefrom fleft join d on f.p_id = d.p_idwhere f.date_id >= d.d_start and f.date_id <= d.d_end)A)AON f.date_id = A.date_id;
其二:
selectdate_id,p_id,flag as p_valuefrom (selectf.date_id,f.p_id,d.d_start,d.d_end,d.p_value,if(f.date_id between d.d_start and d.d_end,d.p_value,null) flag,max(d.d_end) over(partition by date_id) max_endfrom fleft join don f.p_id = d.p_id) tmpwhere d_end = max_end;
二十三、非等值连接—最近匹配
表t23_1和表t23_2通过a和b关联时,有相等的取相等的值匹配,不相等时每一个a的值在b中找差值最小的来匹配。
t23_1和t23_2为两个班的成绩单,t23_1班的每个学生成绩在t23_2班中找出成绩最接近的成绩。
表t23_1:a中无重复值
a1245810
表t23_2:b中无重复值
b2371113
问题一:单向最近匹配
输出结果如下所示:
\
注意:b的值可能会被丢弃
a b1 22 24 35 35 78 710 11
参考答案:
select*from(selectttt1.a,ttt1.bfrom(selecttt1.a,t23_2.b,dense_rank() over(partition by tt1.a order by abs(tt1.a-t23_2.b)) as drfrom(selectt23_1.afrom t23_1left join t23_2 on t23_1.a=t23_2.bwhere t23_2.b is null) tt1cross join t23_2) ttt1where ttt1.dr=1union allselectt23_1.a,t23_2.bfrom t23_1inner join t23_2 on t23_1.a=t23_2.b) result_torder by result_t.a;
二十四、N指标—累计去重
假设表A为事件流水表,客户当天有一条记录则视为当天活跃。
表A:
time_id user_id2018-01-01 10:00:00 0012018-01-01 11:03:00 0022018-01-01 13:18:00 0012018-01-02 08:34:00 0042018-01-02 10:08:00 0022018-01-02 10:40:00 0032018-01-02 14:21:00 0022018-01-02 15:39:00 0042018-01-03 08:34:00 0052018-01-03 10:08:00 0032018-01-03 10:40:00 0012018-01-03 14:21:00 005
问题一:累计去重
输出结果如下所示:
日期 当日活跃人数 月累计活跃人数_截至当日date_id user_cnt_act user_cnt_act_month2018-01-01 2 22018-01-02 3 42018-01-03 3 5
参考答案:
SELECT tt1.date_id,tt2.user_cnt_act,tt1.user_cnt_act_monthFROM( -- ④ 按照t.date_id分组求出user_cnt_act_month,得到tt1SELECT t.date_id,COUNT(user_id) AS user_cnt_act_monthFROM( -- ③ 表a和表b进行笛卡尔积,按照a.date_id,b.user_id分组,保证截止到当日的用户唯一,得出表t。SELECT a.date_id,b.user_idFROM( -- ① 按照日期分组,取出date_id字段当主表的维度字段 得出表aSELECT from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_idFROM test.temp_tanhaidi_20211213_1GROUP BY from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')) aINNER JOIN( -- ② 按照date_id、user_id分组,保证每天每个用户只有一条记录,得出表bSELECT from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id,user_idFROM test.temp_tanhaidi_20211213_1GROUP BY from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd'),user_id) bON 1 = 1WHERE a.date_id >= b.date_idGROUP BY a.date_id,b.user_id) tGROUP BY t.date_id) tt1LEFT JOIN( -- ⑥ 按照date_id分组求出user_cnt_act,得到tt2SELECT date_id,COUNT(user_id) AS user_cnt_actFROM( -- ⑤ 按照日期分组,取出date_id字段当主表的维度字段 得出表aSELECT from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id,user_idFROM test.temp_tanhaidi_20211213_1GROUP BY from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd'),user_id) aGROUP BY date_id) tt2ON tt2.date_id = tt1.date_id
