最强最全面的大数据SQL经典面试题

一、行列转换

描述:表中记录了各年份各部门的平均绩效考核成绩。
\
表名:t1
\
表结构:

  1. a -- 年份
  2. b -- 部门
  3. c -- 绩效得分

表内容

  1. a b c
  2. 2014 B 9
  3. 2015 A 8
  4. 2014 A 10
  5. 2015 B 7

问题一:多行转多列

问题描述:将上述表内容转为如下输出结果所示:

  1. a col_A col_B
  2. 2014 10 9
  3. 2015 8 7

参考答案

  1. select
  2. a,
  3. max(case when b="A" then c end) col_A,
  4. max(case when b="B" then c end) col_B
  5. from t1
  6. group by a;

问题二:如何将结果转成源表?(多列转多行)

问题描述:将问题一的结果转成源表,问题一结果表名为t1_2
参考答案

  1. select
  2. a,
  3. b,
  4. c
  5. from (
  6. select a,"A" as b,col_a as c from t1_2
  7. union all
  8. select a,"B" as b,col_b as c from t1_2
  9. )tmp;

问题三:同一部门会有多个绩效,求多行转多列结果

问题描述:2014年公司组织架构调整,导致部门出现多个绩效,业务及人员不同,无法合并算绩效,源表内容如下:

  1. 2014 B 9
  2. 2015 A 8
  3. 2014 A 10
  4. 2015 B 7
  5. 2014 B 6

输出结果如下所示

  1. a col_A col_B
  2. 2014 10 6,9
  3. 2015 8 7

参考答案:

  1. select
  2. a,
  3. max(case when b="A" then c end) col_A,
  4. max(case when b="B" then c end) col_B
  5. from (
  6. select
  7. a,
  8. b,
  9. concat_ws(",",collect_set(cast(c as string))) as c
  10. from t1
  11. group by a,b
  12. )tmp
  13. group by a;

二、排名中取他值

表名t2
\
表字段及内容

  1. a b c
  2. 2014 A 3
  3. 2014 B 1
  4. 2014 C 2
  5. 2015 A 4
  6. 2015 D 3

问题一:按a分组取b字段最小时对应的c字段

输出结果如下所示

  1. a min_c
  2. 2014 3
  3. 2015 4

参考答案:

  1. select
  2. a,
  3. c as min_c
  4. from
  5. (
  6. select
  7. a,
  8. b,
  9. c,
  10. row_number() over(partition by a order by b) as rn
  11. from t2
  12. )a
  13. where rn = 1;

问题二:按a分组取b字段排第二时对应的c字段

输出结果如下所示

  1. a second_c
  2. 2014 1
  3. 2015 3

参考答案

  1. select
  2. a,
  3. c as second_c
  4. from
  5. (
  6. select
  7. a,
  8. b,
  9. c,
  10. row_number() over(partition by a order by b) as rn
  11. from t2
  12. )a
  13. where rn = 2;

问题三:按a分组取b字段最小和最大时对应的c字段

输出结果如下所示

  1. a min_c max_c
  2. 2014 3 2
  3. 2015 4 3

参考答案:

  1. select
  2. a,
  3. min(if(asc_rn = 1, c, null)) as min_c,
  4. max(if(desc_rn = 1, c, null)) as max_c
  5. from
  6. (
  7. select
  8. a,
  9. b,
  10. c,
  11. row_number() over(partition by a order by b) as asc_rn,
  12. row_number() over(partition by a order by b desc) as desc_rn
  13. from t2
  14. )a
  15. where asc_rn = 1 or desc_rn = 1
  16. group by a;

问题四:按a分组取b字段第二小和第二大时对应的c字段

输出结果如下所示

  1. a min_c max_c
  2. 2014 1 1
  3. 2015 3 4

参考答案

  1. select
  2. ret.a
  3. ,max(case when ret.rn_min = 2 then ret.c else null end) as min_c
  4. ,max(case when ret.rn_max = 2 then ret.c else null end) as max_c
  5. from (
  6. select
  7. *
  8. ,row_number() over(partition by t2.a order by t2.b) as rn_min
  9. ,row_number() over(partition by t2.a order by t2.b desc) as rn_max
  10. from t2
  11. ) as ret
  12. where ret.rn_min = 2
  13. or ret.rn_max = 2
  14. group by ret.a;

问题五:按a分组取b字段前两小和前两大时对应的c字段

注意:需保持b字段最小、最大排首位
输出结果如下所示

  1. a min_c max_c
  2. 2014 3,1 2,1
  3. 2015 4,3 3,4

参考答案

  1. select
  2. tmp1.a as a,
  3. min_c,
  4. max_c
  5. from
  6. (
  7. select
  8. a,
  9. concat_ws(',', collect_list(c)) as min_c
  10. from
  11. (
  12. select
  13. a,
  14. b,
  15. c,
  16. row_number() over(partition by a order by b) as asc_rn
  17. from t2
  18. )a
  19. where asc_rn <= 2
  20. group by a
  21. )tmp1
  22. join
  23. (
  24. select
  25. a,
  26. concat_ws(',', collect_list(c)) as max_c
  27. from
  28. (
  29. select
  30. a,
  31. b,
  32. c,
  33. row_number() over(partition by a order by b desc) as desc_rn
  34. from t2
  35. )a
  36. where desc_rn <= 2
  37. group by a
  38. )tmp2
  39. on tmp1.a = tmp2.a;

三、累计求值

表名t3
\
表字段及内容

  1. a b c
  2. 2014 A 3
  3. 2014 B 1
  4. 2014 C 2
  5. 2015 A 4
  6. 2015 D 3

问题一:按a分组按b字段排序,对c累计求和

输出结果如下所示

  1. a b sum_c
  2. 2014 A 3
  3. 2014 B 4
  4. 2014 C 6
  5. 2015 A 4
  6. 2015 D 7

参考答案

  1. select
  2. a,
  3. b,
  4. c,
  5. sum(c) over(partition by a order by b) as sum_c
  6. from t3;

问题二:按a分组按b字段排序,对c取累计平均值

输出结果如下所示

  1. a b avg_c
  2. 2014 A 3
  3. 2014 B 2
  4. 2014 C 2
  5. 2015 A 4
  6. 2015 D 3.5

参考答案

  1. select
  2. a,
  3. b,
  4. c,
  5. avg(c) over(partition by a order by b) as avg_c
  6. from t3;

问题三:按a分组按b字段排序,对b取累计排名比例

输出结果如下所示

  1. a b ratio_c
  2. 2014 A 0.33
  3. 2014 B 0.67
  4. 2014 C 1.00
  5. 2015 A 0.50
  6. 2015 D 1.00

参考答案

  1. select
  2. a,
  3. b,
  4. c,
  5. round(row_number() over(partition by a order by b) / (count(c) over(partition by a)),2) as ratio_c
  6. from t3
  7. order by a,b;

问题四:按a分组按b字段排序,对b取累计求和比例

输出结果如下所示

  1. a b ratio_c
  2. 2014 A 0.50
  3. 2014 B 0.67
  4. 2014 C 1.00
  5. 2015 A 0.57
  6. 2015 D 1.00

参考答案

  1. select
  2. a,
  3. b,
  4. c,
  5. round(sum(c) over(partition by a order by b) / (sum(c) over(partition by a)),2) as ratio_c
  6. from t3
  7. order by a,b;

四、窗口大小控制

表名t4
\
表字段及内容

  1. a b c
  2. 2014 A 3
  3. 2014 B 1
  4. 2014 C 2
  5. 2015 A 4
  6. 2015 D 3

问题一:按a分组按b字段排序,对c取前后各一行的和

输出结果如下所示

  1. a b sum_c
  2. 2014 A 1
  3. 2014 B 5
  4. 2014 C 1
  5. 2015 A 3
  6. 2015 D 4

参考答案

  1. select
  2. a,
  3. b,
  4. lag(c,1,0) over(partition by a order by b)+lead(c,1,0) over(partition by a order by b) as sum_c
  5. from t4;

问题二:按a分组按b字段排序,对c取平均值

问题描述:前一行与当前行的均值!
输出结果如下所示

  1. a b avg_c
  2. 2014 A 3
  3. 2014 B 2
  4. 2014 C 1.5
  5. 2015 A 4
  6. 2015 D 3.5

参考答案

  1. select
  2. a,
  3. b,
  4. case when lag_c is null then c
  5. else (c+lag_c)/2 end as avg_c
  6. from
  7. (
  8. select
  9. a,
  10. b,
  11. c,
  12. lag(c,1) over(partition by a order by b) as lag_c
  13. from t4
  14. )temp;

五、产生连续数值

输出结果如下所示

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. ...
  7. 100

参考答案
\
不借助其他任何外表,实现产生连续数值
\
此处给出两种解法,其一:

  1. select
  2. id_start+pos as id
  3. from(
  4. select
  5. 1 as id_start,
  6. 1000000 as id_end
  7. ) m lateral view posexplode(split(space(id_end-id_start), '')) t as pos, val

其二:

  1. select
  2. row_number() over() as id
  3. from
  4. (select split(space(99), ' ') as x) t
  5. lateral view
  6. explode(x) ex;

那如何产生1至1000000连续数值?
参考答案

  1. select
  2. row_number() over() as id
  3. from
  4. (select split(space(999999), ' ') as x) t
  5. lateral view
  6. explode(x) ex;

六、数据扩充与收缩

表名t6
\
表字段及内容

  1. a
  2. 3
  3. 2
  4. 4

问题一:数据扩充

输出结果如下所示

  1. a b
  2. 3 321
  3. 2 21
  4. 4 4321

参考答案

  1. select
  2. t.a,
  3. concat_ws('、',collect_set(cast(t.rn as string))) as b
  4. from
  5. (
  6. select
  7. t6.a,
  8. b.rn
  9. from t6
  10. left join
  11. (
  12. select
  13. row_number() over() as rn
  14. from
  15. (select split(space(5), ' ') as x) t -- space(5)可根据t6表的最大值灵活调整
  16. lateral view
  17. explode(x) pe
  18. ) b
  19. on 1 = 1
  20. where t6.a >= b.rn
  21. order by t6.a, b.rn desc
  22. ) t
  23. group by t.a;

问题二:数据扩充,排除偶数

输出结果如下所示

  1. a b
  2. 3 31
  3. 2 1
  4. 4 31

参考答案

  1. select
  2. t.a,
  3. concat_ws('、',collect_set(cast(t.rn as string))) as b
  4. from
  5. (
  6. select
  7. t6.a,
  8. b.rn
  9. from t6
  10. left join
  11. (
  12. select
  13. row_number() over() as rn
  14. from
  15. (select split(space(5), ' ') as x) t
  16. lateral view
  17. explode(x) pe
  18. ) b
  19. on 1 = 1
  20. where t6.a >= b.rn and b.rn % 2 = 1
  21. order by t6.a, b.rn desc
  22. ) t
  23. group by t.a;

问题三:如何处理字符串累计拼接

问题描述:将小于等于a字段的值聚合拼接起来
输出结果如下所示

  1. a b
  2. 3 23
  3. 2 2
  4. 4 234

参考答案

  1. select
  2. t.a,
  3. concat_ws('、',collect_set(cast(t.a1 as string))) as b
  4. from
  5. (
  6. select
  7. t6.a,
  8. b.a1
  9. from t6
  10. left join
  11. (
  12. select a as a1
  13. from t6
  14. ) b
  15. on 1 = 1
  16. where t6.a >= b.a1
  17. order by t6.a, b.a1
  18. ) t
  19. group by t.a;

问题四:如果a字段有重复,如何实现字符串累计拼接

输出结果如下所示

  1. a b
  2. 2 2
  3. 3 23
  4. 3 233
  5. 4 2334

参考答案

  1. select
  2. a,
  3. b
  4. from
  5. (
  6. select
  7. t.a,
  8. t.rn,
  9. concat_ws('、',collect_list(cast(t.a1 as string))) as b
  10. from
  11. (
  12. select
  13. a.a,
  14. a.rn,
  15. b.a1
  16. from
  17. (
  18. select
  19. a,
  20. row_number() over(order by a ) as rn
  21. from t6
  22. ) a
  23. left join
  24. (
  25. select a as a1,
  26. row_number() over(order by a ) as rn
  27. from t6
  28. ) b
  29. on 1 = 1
  30. where a.a >= b.a1 and a.rn >= b.rn
  31. order by a.a, b.a1
  32. ) t
  33. group by t.a,t.rn
  34. order by t.a,t.rn
  35. ) tt;

问题五:数据展开

问题描述:如何将字符串”1-5,16,11-13,9”扩展成”1,2,3,4,5,16,11,12,13,9”?注意顺序不变。
参考答案

  1. select
  2. concat_ws(',',collect_list(cast(rn as string)))
  3. from
  4. (
  5. select
  6. a.rn,
  7. b.num,
  8. b.pos
  9. from
  10. (
  11. select
  12. row_number() over() as rn
  13. from (select split(space(20), ' ') as x) t -- space(20)可灵活调整
  14. lateral view
  15. explode(x) pe
  16. ) a lateral view outer
  17. posexplode(split('1-5,16,11-13,9', ',')) b as pos, num
  18. where a.rn between cast(split(num, '-')[0] as int) and cast(split(num, '-')[1] as int) or a.rn = num
  19. order by pos, rn
  20. ) t;

七、合并与拆分

表名t7
\
表字段及内容

  1. a b
  2. 2014 A
  3. 2014 B
  4. 2015 B
  5. 2015 D

问题一:合并

输出结果如下所示

  1. 2014 AB
  2. 2015 BD

参考答案:

  1. select
  2. a,
  3. concat_ws('、', collect_set(t.b)) b
  4. from t7
  5. group by a;

问题二:拆分

问题描述:将分组合并的结果拆分出来
参考答案

  1. select
  2. t.a,
  3. d
  4. from
  5. (
  6. select
  7. a,
  8. concat_ws('、', collect_set(t7.b)) b
  9. from t7
  10. group by a
  11. )t
  12. lateral view
  13. explode(split(t.b, '、')) table_tmp as d;

八、模拟循环操作

表名t8
\
表字段及内容

  1. a
  2. 1011
  3. 0101

问题一:如何将字符’1’的位置提取出来

输出结果如下所示:

  1. 1,3,4
  2. 2,4

参考答案

  1. select
  2. a,
  3. concat_ws(",",collect_list(cast(index as string))) as res
  4. from (
  5. select
  6. a,
  7. index+1 as index,
  8. chr
  9. from (
  10. select
  11. a,
  12. concat_ws(",",substr(a,1,1),substr(a,2,1),substr(a,3,1),substr(a,-1)) str
  13. from t8
  14. ) tmp1
  15. lateral view posexplode(split(str,",")) t as index,chr
  16. where chr = "1"
  17. ) tmp2
  18. group by a;

九、不使用distinct或group by去重

表名t9
\
表字段及内容

  1. a b c d
  2. 2014 2016 2014 A
  3. 2014 2015 2015 B

问题一:不使用distinct或group by去重

输出结果如下所示

  1. 2014 A
  2. 2016 A
  3. 2014 B
  4. 2015 B

参考答案

  1. select
  2. t2.year
  3. ,t2.num
  4. from
  5. (
  6. select
  7. *
  8. ,row_number() over (partition by t1.year,t1.num) as rank_1
  9. from
  10. (
  11. select
  12. a as year,
  13. d as num
  14. from t9
  15. union all
  16. select
  17. b as year,
  18. d as num
  19. from t9
  20. union all
  21. select
  22. c as year,
  23. d as num
  24. from t9
  25. )t1
  26. )t2
  27. where rank_1=1
  28. order by num;

十、容器—反转内容

表名t10
\
表字段及内容

  1. a
  2. AB,CA,BAD
  3. BD,EA

问题一:反转逗号分隔的数据:改变顺序,内容不变

输出结果如下所示

  1. BAD,CA,AB
  2. EA,BD

参考答案

  1. select
  2. a,
  3. concat_ws(",",collect_list(reverse(str)))
  4. from
  5. (
  6. select
  7. a,
  8. str
  9. from t10
  10. lateral view explode(split(reverse(a),",")) t as str
  11. ) tmp1
  12. group by a;

问题二:反转逗号分隔的数据:改变内容,顺序不变

输出结果如下所示

  1. BA,AC,DAB
  2. DB,AE

参考答案

  1. select
  2. a,
  3. concat_ws(",",collect_list(reverse(str)))
  4. from
  5. (
  6. select
  7. a,
  8. str
  9. from t10
  10. lateral view explode(split(a,",")) t as str
  11. ) tmp1
  12. group by a;

十一、多容器—成对提取数据

表名t11
\
表字段及内容

  1. a b
  2. A/B 1/3
  3. B/C/D 4/5/2

问题一:成对提取数据,字段一一对应

输出结果如下所示

  1. a b
  2. A 1
  3. B 3
  4. B 4
  5. C 5
  6. D 2

参考答案:

  1. select
  2. a_inx,
  3. b_inx
  4. from
  5. (
  6. select
  7. a,
  8. b,
  9. a_id,
  10. a_inx,
  11. b_id,
  12. b_inx
  13. from t11
  14. lateral view posexplode(split(a,'/')) t as a_id,a_inx
  15. lateral view posexplode(split(b,'/')) t as b_id,b_inx
  16. ) tmp
  17. where a_id=b_id;

十二、多容器—转多行

表名t12
\
表字段及内容

  1. a b c
  2. 001 A/B 1/3/5
  3. 002 B/C/D 4/5

问题一:转多行

输出结果如下所示

  1. a d e
  2. 001 type_b A
  3. 001 type_b B
  4. 001 type_c 1
  5. 001 type_c 3
  6. 001 type_c 5
  7. 002 type_b B
  8. 002 type_b C
  9. 002 type_b D
  10. 002 type_c 4
  11. 002 type_c 5

参考答案:

  1. select
  2. a,
  3. d,
  4. e
  5. from
  6. (
  7. select
  8. a,
  9. "type_b" as d,
  10. str as e
  11. from t12
  12. lateral view explode(split(b,"/")) t as str
  13. union all
  14. select
  15. a,
  16. "type_c" as d,
  17. str as e
  18. from t12
  19. lateral view explode(split(c,"/")) t as str
  20. ) tmp
  21. order by a,d;

十三、抽象分组—断点排序

表名t13
\
表字段及内容

  1. a b
  2. 2014 1
  3. 2015 1
  4. 2016 1
  5. 2017 0
  6. 2018 0
  7. 2019 -1
  8. 2020 -1
  9. 2021 -1
  10. 2022 1
  11. 2023 1

问题一:断点排序

输出结果如下所示

  1. a b c
  2. 2014 1 1
  3. 2015 1 2
  4. 2016 1 3
  5. 2017 0 1
  6. 2018 0 2
  7. 2019 -1 1
  8. 2020 -1 2
  9. 2021 -1 3
  10. 2022 1 1
  11. 2023 1 2

参考答案:

  1. select
  2. a,
  3. b,
  4. row_number() over( partition by b,repair_a order by a asc) as c--按照b列和[b的组首]分组,排序
  5. from
  6. (
  7. select
  8. a,
  9. b,
  10. a-b_rn as repair_a--根据b列值出现的次序,修复a列值为b首次出现的a列值,称为b的[组首]
  11. from
  12. (
  13. select
  14. a,
  15. b,
  16. row_number() over( partition by b order by a asc ) as b_rn--按b列分组,按a列排序,得到b列各值出现的次序
  17. from t13
  18. )tmp1
  19. )tmp2--注意,如果不同的b列值,可能出现同样的组首值,但组首值需要和a列值 一并参与分组,故并不影响排序。
  20. order by a asc;

十四、业务逻辑的分类与抽象—时效

日期表d_date
\
表字段及内容

  1. date_id is_work
  2. 2017-04-13 1
  3. 2017-04-14 1
  4. 2017-04-15 0
  5. 2017-04-16 0
  6. 2017-04-17 1

工作日:周一至周五09:30-18:30
客户申请表t14
\
表字段及内容

  1. a b c
  2. 1 申请 2017-04-14 18:03:00
  3. 1 通过 2017-04-17 09:43:00
  4. 2 申请 2017-04-13 17:02:00
  5. 2 通过 2017-04-15 09:42:00

问题一:计算上表中从申请到通过占用的工作时长

输出结果如下所示

  1. a d
  2. 1 0.67h
  3. 2 10.67h

参考答案:

  1. select
  2. a,
  3. round(sum(diff)/3600,2) as d
  4. from (
  5. select
  6. a,
  7. apply_time,
  8. pass_time,
  9. dates,
  10. rn,
  11. ct,
  12. is_work,
  13. case when is_work=1 and rn=1 then unix_timestamp(concat(dates,' 18:30:00'),'yyyy-MM-dd HH:mm:ss')-unix_timestamp(apply_time,'yyyy-MM-dd HH:mm:ss')
  14. when is_work=0 then 0
  15. when is_work=1 and rn=ct then unix_timestamp(pass_time,'yyyy-MM-dd HH:mm:ss')-unix_timestamp(concat(dates,' 09:30:00'),'yyyy-MM-dd HH:mm:ss')
  16. when is_work=1 and rn!=ct then 9*3600
  17. end diff
  18. from (
  19. select
  20. a,
  21. apply_time,
  22. pass_time,
  23. time_diff,
  24. day_diff,
  25. rn,
  26. ct,
  27. date_add(start,rn-1) dates
  28. from (
  29. select
  30. a,
  31. apply_time,
  32. pass_time,
  33. time_diff,
  34. day_diff,
  35. strs,
  36. start,
  37. row_number() over(partition by a) as rn,
  38. count(*) over(partition by a) as ct
  39. from (
  40. select
  41. a,
  42. apply_time,
  43. pass_time,
  44. time_diff,
  45. day_diff,
  46. substr(repeat(concat(substr(apply_time,1,10),','),day_diff+1),1,11*(day_diff+1)-1) strs
  47. from (
  48. select
  49. a,
  50. apply_time,
  51. pass_time,
  52. unix_timestamp(pass_time,'yyyy-MM-dd HH:mm:ss')-unix_timestamp(apply_time,'yyyy-MM-dd HH:mm:ss') time_diff,
  53. datediff(substr(pass_time,1,10),substr(apply_time,1,10)) day_diff
  54. from (
  55. select
  56. a,
  57. max(case when b='申请' then c end) apply_time,
  58. max(case when b='通过' then c end) pass_time
  59. from t14
  60. group by a
  61. ) tmp1
  62. ) tmp2
  63. ) tmp3
  64. lateral view explode(split(strs,",")) t as start
  65. ) tmp4
  66. ) tmp5
  67. join d_date
  68. on tmp5.dates = d_date.date_id
  69. ) tmp6
  70. group by a;

十五、时间序列—进度及剩余

表名t15
\
表字段及内容

  1. date_id is_work
  2. 2017-07-30 0
  3. 2017-07-31 1
  4. 2017-08-01 1
  5. 2017-08-02 1
  6. 2017-08-03 1
  7. 2017-08-04 1
  8. 2017-08-05 0
  9. 2017-08-06 0
  10. 2017-08-07 1

问题一:求每天的累计周工作日,剩余周工作日

输出结果如下所示

  1. date_id week_to_work week_left_work
  2. 2017-07-31 1 4
  3. 2017-08-01 2 3
  4. 2017-08-02 3 2
  5. 2017-08-03 4 1
  6. 2017-08-04 5 0
  7. 2017-08-05 5 0
  8. 2017-08-06 5 0

参考答案:
\
此处给出两种解法,其一:

  1. select
  2. date_id
  3. ,case date_format(date_id,'u')
  4. when 1 then 1
  5. when 2 then 2
  6. when 3 then 3
  7. when 4 then 4
  8. when 5 then 5
  9. when 6 then 5
  10. when 7 then 5
  11. end as week_to_work
  12. ,case date_format(date_id,'u')
  13. when 1 then 4
  14. when 2 then 3
  15. when 3 then 2
  16. when 4 then 1
  17. when 5 then 0
  18. when 6 then 0
  19. when 7 then 0
  20. end as week_to_work
  21. from t15

其二:

  1. select
  2. date_id,
  3. week_to_work,
  4. week_sum_work-week_to_work as week_left_work
  5. from(
  6. select
  7. date_id,
  8. sum(is_work) over(partition by year,week order by date_id) as week_to_work,
  9. sum(is_work) over(partition by year,week) as week_sum_work
  10. from(
  11. select
  12. date_id,
  13. is_work,
  14. year(date_id) as year,
  15. weekofyear(date_id) as week
  16. from t15
  17. ) ta
  18. ) tb order by date_id;

十六、时间序列—构造日期

问题一:直接使用SQL实现一张日期维度表,包含以下字段:

  1. date string 日期
  2. d_week string 年内第几周
  3. weeks int 周几
  4. w_start string 周开始日
  5. w_end string 周结束日
  6. d_month int 第几月
  7. m_start string 月开始日
  8. m_end string 月结束日
  9. d_quarter int 第几季
  10. q_start string 季开始日
  11. q_end string 季结束日
  12. d_year int 年份
  13. y_start string 年开始日
  14. y_end string 年结束日

参考答案

  1. drop table if exists dim_date;
  2. create table if not exists dim_date(
  3. `date` string comment '日期',
  4. d_week string comment '年内第几周',
  5. weeks string comment '周几',
  6. w_start string comment '周开始日',
  7. w_end string comment '周结束日',
  8. d_month string comment '第几月',
  9. m_start string comment '月开始日',
  10. m_end string comment '月结束日',
  11. d_quarter int comment '第几季',
  12. q_start string comment '季开始日',
  13. q_end string comment '季结束日',
  14. d_year int comment '年份',
  15. y_start string comment '年开始日',
  16. y_end string comment '年结束日'
  17. );
  18. --自然月: 指每月的1号到那个月的月底,它是按照阳历来计算的。就是从每月1号到月底,不管这个月有30天,31天,29天或者28天,都算是一个自然月。
  19. insert overwrite table dim_date
  20. select `date`
  21. , d_week --年内第几周
  22. , case weekid
  23. when 0 then '周日'
  24. when 1 then '周一'
  25. when 2 then '周二'
  26. when 3 then '周三'
  27. when 4 then '周四'
  28. when 5 then '周五'
  29. when 6 then '周六'
  30. end as weeks --
  31. , date_add(next_day(`date`,'MO'),-7) as w_start --周一
  32. , date_add(next_day(`date`,'MO'),-1) as w_end -- 周日_end
  33. -- 月份日期
  34. , concat('第', monthid, '月') as d_month
  35. , m_start
  36. , m_end
  37. -- 季节
  38. , quarterid as d_quart
  39. , concat(d_year, '-', substr(concat('0', (quarterid - 1) * 3 + 1), -2), '-01') as q_start --季开始日
  40. , date_sub(concat(d_year, '-', substr(concat('0', (quarterid) * 3 + 1), -2), '-01'), 1) as q_end --季结束日
  41. --
  42. , d_year
  43. , y_start
  44. , y_end
  45. from (
  46. select `date`
  47. , pmod(datediff(`date`, '2012-01-01'), 7) as weekid --获取周几
  48. , cast(substr(`date`, 6, 2) as int) as monthid --获取月份
  49. , case
  50. when cast(substr(`date`, 6, 2) as int) <= 3 then 1
  51. when cast(substr(`date`, 6, 2) as int) <= 6 then 2
  52. when cast(substr(`date`, 6, 2) as int) <= 9 then 3
  53. when cast(substr(`date`, 6, 2) as int) <= 12 then 4
  54. end as quarterid --获取季节 可以直接使用 quarter(`date`)
  55. , substr(`date`, 1, 4) as d_year -- 获取年份
  56. , trunc(`date`, 'YYYY') as y_start --年开始日
  57. , date_sub(trunc(add_months(`date`, 12), 'YYYY'), 1) as y_end --年结束日
  58. , date_sub(`date`, dayofmonth(`date`) - 1) as m_start --当月第一天
  59. , last_day(date_sub(`date`, dayofmonth(`date`) - 1)) m_end --当月最后一天
  60. , weekofyear(`date`) as d_week --年内第几周
  61. from (
  62. -- '2021-04-01'是开始日期, '2022-03-31'是截止日期
  63. select date_add('2021-04-01', t0.pos) as `date`
  64. from (
  65. select posexplode(
  66. split(
  67. repeat('o', datediff(
  68. from_unixtime(unix_timestamp('2022-03-31', 'yyyy-mm-dd'),
  69. 'yyyy-mm-dd'),
  70. '2021-04-01')), 'o'
  71. )
  72. )
  73. ) t0
  74. ) t1
  75. ) t2;

十七、时间序列—构造累积日期

表名t17
\
表字段及内容

  1. date_id
  2. 2017-08-01
  3. 2017-08-02
  4. 2017-08-03

问题一:每一日期,都扩展成月初至当天

输出结果如下所示

  1. date_id date_to_day
  2. 2017-08-01 2017-08-01
  3. 2017-08-02 2017-08-01
  4. 2017-08-02 2017-08-02
  5. 2017-08-03 2017-08-01
  6. 2017-08-03 2017-08-02
  7. 2017-08-03 2017-08-03

这种累积相关的表,常做桥接表。

参考答案:

  1. select
  2. date_id,
  3. date_add(date_start_id,pos) as date_to_day
  4. from
  5. (
  6. select
  7. date_id,
  8. date_sub(date_id,dayofmonth(date_id)-1) as date_start_id
  9. from t17
  10. ) m lateral view
  11. posexplode(split(space(datediff(from_unixtime(unix_timestamp(date_id,'yyyy-MM-dd')),from_unixtime(unix_timestamp(date_start_id,'yyyy-MM-dd')))), '')) t as pos, val;

十八、时间序列—构造连续日期

表名t18
\
表字段及内容

  1. a b c
  2. 101 2018-01-01 10
  3. 101 2018-01-03 20
  4. 101 2018-01-06 40
  5. 102 2018-01-02 20
  6. 102 2018-01-04 30
  7. 102 2018-01-07 60

问题一:构造连续日期

问题描述:将表中数据的b字段扩充至范围[2018-01-01, 2018-01-07],并累积对c求和。
\
b字段的值是较稀疏的。
输出结果如下所示

  1. a b c d
  2. 101 2018-01-01 10 10
  3. 101 2018-01-02 0 10
  4. 101 2018-01-03 20 30
  5. 101 2018-01-04 0 30
  6. 101 2018-01-05 0 30
  7. 101 2018-01-06 40 70
  8. 101 2018-01-07 0 70
  9. 102 2018-01-01 0 0
  10. 102 2018-01-02 20 20
  11. 102 2018-01-03 0 20
  12. 102 2018-01-04 30 50
  13. 102 2018-01-05 0 50
  14. 102 2018-01-06 0 50
  15. 102 2018-01-07 60 110

参考答案:

  1. select
  2. a,
  3. b,
  4. c,
  5. sum(c) over(partition by a order by b) as d
  6. from
  7. (
  8. select
  9. t1.a,
  10. t1.b,
  11. case
  12. when t18.b is not null then t18.c
  13. else 0
  14. end as c
  15. from
  16. (
  17. select
  18. a,
  19. date_add(s,pos) as b
  20. from
  21. (
  22. select
  23. a,
  24. '2018-01-01' as s,
  25. '2018-01-07' as r
  26. from (select a from t18 group by a) ta
  27. ) m lateral view
  28. posexplode(split(space(datediff(from_unixtime(unix_timestamp(r,'yyyy-MM-dd')),from_unixtime(unix_timestamp(s,'yyyy-MM-dd')))), '')) t as pos, val
  29. ) t1
  30. left join t18
  31. on t1.a = t18.a and t1.b = t18.b
  32. ) ts;

十九、时间序列—取多个字段最新的值

表名t19
\
表字段及内容

  1. date_id a b c
  2. 2014 AB 12 bc
  3. 2015 23
  4. 2016 d
  5. 2017 BC

问题一:如何一并取出最新日期

输出结果如下所示

  1. date_a a date_b b date_c c
  2. 2017 BC 2015 23 2016 d

参考答案:
\
此处给出三种解法,其一:

  1. SELECT max(CASE WHEN rn_a = 1 THEN date_id else 0 END) AS date_a
  2. ,max(CASE WHEN rn_a = 1 THEN a else null END) AS a
  3. ,max(CASE WHEN rn_b = 1 THEN date_id else 0 END) AS date_b
  4. ,max(CASE WHEN rn_b = 1 THEN b else NULL END) AS b
  5. ,max(CASE WHEN rn_c = 1 THEN date_id else 0 END) AS date_c
  6. ,max(CASE WHEN rn_c = 1 THEN c else null END) AS c
  7. FROM (
  8. SELECT date_id
  9. ,a
  10. ,b
  11. ,c
  12. --对每列上不为null的值 日期 进行排序
  13. ,row_number()OVER( PARTITION BY 1 ORDER BY CASE WHEN a IS NULL THEN 0 ELSE date_id END DESC) AS rn_a
  14. ,row_number()OVER(PARTITION BY 1 ORDER BY CASE WHEN b IS NULL THEN 0 ELSE date_id END DESC) AS rn_b
  15. ,row_number()OVER(PARTITION BY 1 ORDER BY CASE WHEN c IS NULL THEN 0 ELSE date_id END DESC) AS rn_c
  16. FROM t19
  17. ) t
  18. WHERE t.rn_a = 1
  19. OR t.rn_b = 1
  20. OR t.rn_c = 1;

其二:

  1. SELECT
  2. a.date_id
  3. ,a.a
  4. ,b.date_id
  5. ,b.b
  6. ,c.date_id
  7. ,c.c
  8. FROM
  9. (
  10. SELECT
  11. t.date_id,
  12. t.a
  13. FROM
  14. (
  15. SELECT
  16. t.date_id
  17. ,t.a
  18. ,t.b
  19. ,t.c
  20. FROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.a IS NOT NULL
  21. ) t
  22. ORDER BY t.date_id DESC
  23. LIMIT 1
  24. ) a
  25. LEFT JOIN
  26. (
  27. SELECT
  28. t.date_id
  29. ,t.b
  30. FROM
  31. (
  32. SELECT
  33. t.date_id
  34. ,t.b
  35. FROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.b IS NOT NULL
  36. ) t
  37. ORDER BY t.date_id DESC
  38. LIMIT 1
  39. ) b ON 1 = 1
  40. LEFT JOIN
  41. (
  42. SELECT
  43. t.date_id
  44. ,t.c
  45. FROM
  46. (
  47. SELECT
  48. t.date_id
  49. ,t.c
  50. FROM t19 t INNER JOIN t19 t1 ON t.date_id = t1.date_id AND t.c IS NOT NULL
  51. ) t
  52. ORDER BY t.date_id DESC
  53. LIMIT 1
  54. ) c
  55. ON 1 = 1;

其三:

  1. select
  2. *
  3. from
  4. (
  5. select t1.date_id as date_a,t1.a from (select t1.date_id,t1.a from t19 t1 where t1.a is not null) t1
  6. inner join (select max(t1.date_id) as date_id from t19 t1 where t1.a is not null) t2
  7. on t1.date_id=t2.date_id
  8. ) t1
  9. cross join
  10. (
  11. select t1.date_b,t1.b from (select t1.date_id as date_b,t1.b from t19 t1 where t1.b is not null) t1
  12. inner join (select max(t1.date_id) as date_id from t19 t1 where t1.b is not null)t2
  13. on t1.date_b=t2.date_id
  14. ) t2
  15. cross join
  16. (
  17. select t1.date_c,t1.c from (select t1.date_id as date_c,t1.c from t19 t1 where t1.c is not null) t1
  18. inner join (select max(t1.date_id) as date_id from t19 t1 where t1.c is not null)t2
  19. on t1.date_c=t2.date_id
  20. ) t3;

二十、时间序列—补全数据

表名t20
\
表字段及内容

  1. date_id a b c
  2. 2014 AB 12 bc
  3. 2015 23
  4. 2016 d
  5. 2017 BC

问题一:如何使用最新数据补全表格

输出结果如下所示

  1. date_id a b c
  2. 2014 AB 12 bc
  3. 2015 AB 23 bc
  4. 2016 AB 23 d
  5. 2017 BC 23 d

参考答案:

  1. select
  2. date_id,
  3. first_value(a) over(partition by aa order by date_id) as a,
  4. first_value(b) over(partition by bb order by date_id) as b,
  5. first_value(c) over(partition by cc order by date_id) as c
  6. from
  7. (
  8. select
  9. date_id,
  10. a,
  11. b,
  12. c,
  13. count(a) over(order by date_id) as aa,
  14. count(b) over(order by date_id) as bb,
  15. count(c) over(order by date_id) as cc
  16. from t20
  17. )tmp1;

二十一、时间序列—取最新完成状态的前一个状态

表名t21
\
表字段及内容

  1. date_id a b
  2. 2014 1 A
  3. 2015 1 B
  4. 2016 1 A
  5. 2017 1 B
  6. 2013 2 A
  7. 2014 2 B
  8. 2015 2 A
  9. 2014 3 A
  10. 2015 3 A
  11. 2016 3 B
  12. 2017 3 A

上表中B为完成状态

问题一:取最新完成状态的前一个状态

输出结果如下所示

  1. date_id a b
  2. 2016 1 A
  3. 2013 2 A
  4. 2015 3 A

参考答案:
\
此处给出两种解法,其一:

  1. select
  2. t21.date_id,
  3. t21.a,
  4. t21.b
  5. from
  6. (
  7. select
  8. max(date_id) date_id,
  9. a
  10. from
  11. t21
  12. where
  13. b = 'B'
  14. group by
  15. a
  16. ) t1
  17. inner join t21 on t1.date_id -1 = t21.date_id
  18. and t1.a = t21.a;

其二:

  1. select
  2. next_date_id as date_id
  3. ,a
  4. ,next_b as b
  5. from(
  6. select
  7. *,min(nk) over(partition by a,b) as minb
  8. from(
  9. select
  10. *,row_number() over(partition by a order by date_id desc) nk
  11. ,lead(date_id) over(partition by a order by date_id desc) next_date_id
  12. ,lead(b) over(partition by a order by date_id desc) next_b
  13. from(
  14. select * from t21
  15. ) t
  16. ) t
  17. ) t
  18. where minb = nk and b = 'B';

问题二:如何将完成状态的过程合并

输出结果如下所示:

  1. a b_merge
  2. 1 ABAB
  3. 2 AB
  4. 3 AAB

参考答案

  1. select
  2. a
  3. ,collect_list(b) as b
  4. from(
  5. select
  6. *
  7. ,min(if(b = 'B',nk,null)) over(partition by a) as minb
  8. from(
  9. select
  10. *,row_number() over(partition by a order by date_id desc) nk
  11. from(
  12. select * from t21
  13. ) t
  14. ) t
  15. ) t
  16. where nk >= minb
  17. group by a;

二十二、非等值连接—范围匹配

表f是事实表,表d是匹配表,在hive中如何将匹配表中的值关联到事实表中?
表d相当于拉链过的变化维,但日期范围可能是不全的。
表f

  1. date_id p_id
  2. 2017 C
  3. 2018 B
  4. 2019 A
  5. 2013 C

表d

  1. d_start d_end p_id p_value
  2. 2016 2018 A 1
  3. 2016 2018 B 2
  4. 2008 2009 C 4
  5. 2010 2015 C 3

问题一:范围匹配

输出结果如下所示

  1. date_id p_id p_value
  2. 2017 C null
  3. 2018 B 2
  4. 2019 A null
  5. 2013 C 3

**参考答案
\
此处给出两种解法,其一:

  1. select
  2. f.date_id,
  3. f.p_id,
  4. A.p_value
  5. from f
  6. left join
  7. (
  8. select
  9. date_id,
  10. p_id,
  11. p_value
  12. from
  13. (
  14. select
  15. f.date_id,
  16. f.p_id,
  17. d.p_value
  18. from f
  19. left join d on f.p_id = d.p_id
  20. where f.date_id >= d.d_start and f.date_id <= d.d_end
  21. )A
  22. )A
  23. ON f.date_id = A.date_id;

其二:

  1. select
  2. date_id,
  3. p_id,
  4. flag as p_value
  5. from (
  6. select
  7. f.date_id,
  8. f.p_id,
  9. d.d_start,
  10. d.d_end,
  11. d.p_value,
  12. if(f.date_id between d.d_start and d.d_end,d.p_value,null) flag,
  13. max(d.d_end) over(partition by date_id) max_end
  14. from f
  15. left join d
  16. on f.p_id = d.p_id
  17. ) tmp
  18. where d_end = max_end;

二十三、非等值连接—最近匹配

表t23_1和表t23_2通过a和b关联时,有相等的取相等的值匹配,不相等时每一个a的值在b中找差值最小的来匹配。

t23_1和t23_2为两个班的成绩单,t23_1班的每个学生成绩在t23_2班中找出成绩最接近的成绩。
表t23_1:a中无重复值

  1. a
  2. 1
  3. 2
  4. 4
  5. 5
  6. 8
  7. 10

表t23_2:b中无重复值

  1. b
  2. 2
  3. 3
  4. 7
  5. 11
  6. 13

问题一:单向最近匹配

输出结果如下所示
\
注意:b的值可能会被丢弃

  1. a b
  2. 1 2
  3. 2 2
  4. 4 3
  5. 5 3
  6. 5 7
  7. 8 7
  8. 10 11

参考答案

  1. select
  2. *
  3. from
  4. (
  5. select
  6. ttt1.a,
  7. ttt1.b
  8. from
  9. (
  10. select
  11. tt1.a,
  12. t23_2.b,
  13. dense_rank() over(partition by tt1.a order by abs(tt1.a-t23_2.b)) as dr
  14. from
  15. (
  16. select
  17. t23_1.a
  18. from t23_1
  19. left join t23_2 on t23_1.a=t23_2.b
  20. where t23_2.b is null
  21. ) tt1
  22. cross join t23_2
  23. ) ttt1
  24. where ttt1.dr=1
  25. union all
  26. select
  27. t23_1.a,
  28. t23_2.b
  29. from t23_1
  30. inner join t23_2 on t23_1.a=t23_2.b
  31. ) result_t
  32. order by result_t.a;

二十四、N指标—累计去重

假设表A为事件流水表,客户当天有一条记录则视为当天活跃。
表A

  1. time_id user_id
  2. 2018-01-01 10:00:00 001
  3. 2018-01-01 11:03:00 002
  4. 2018-01-01 13:18:00 001
  5. 2018-01-02 08:34:00 004
  6. 2018-01-02 10:08:00 002
  7. 2018-01-02 10:40:00 003
  8. 2018-01-02 14:21:00 002
  9. 2018-01-02 15:39:00 004
  10. 2018-01-03 08:34:00 005
  11. 2018-01-03 10:08:00 003
  12. 2018-01-03 10:40:00 001
  13. 2018-01-03 14:21:00 005

假设客户活跃非常,一天产生的事件记录平均达千条。

问题一:累计去重

输出结果如下所示

  1. 日期 当日活跃人数 月累计活跃人数_截至当日
  2. date_id user_cnt_act user_cnt_act_month
  3. 2018-01-01 2 2
  4. 2018-01-02 3 4
  5. 2018-01-03 3 5

参考答案

  1. SELECT tt1.date_id
  2. ,tt2.user_cnt_act
  3. ,tt1.user_cnt_act_month
  4. FROM
  5. ( -- 按照t.date_id分组求出user_cnt_act_month,得到tt1
  6. SELECT t.date_id
  7. ,COUNT(user_id) AS user_cnt_act_month
  8. FROM
  9. ( -- a和表b进行笛卡尔积,按照a.date_id,b.user_id分组,保证截止到当日的用户唯一,得出表t
  10. SELECT a.date_id
  11. ,b.user_id
  12. FROM
  13. ( -- 按照日期分组,取出date_id字段当主表的维度字段 得出表a
  14. SELECT from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id
  15. FROM test.temp_tanhaidi_20211213_1
  16. GROUP BY from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')
  17. ) a
  18. INNER JOIN
  19. ( -- 按照date_iduser_id分组,保证每天每个用户只有一条记录,得出表b
  20. SELECT from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id
  21. ,user_id
  22. FROM test.temp_tanhaidi_20211213_1
  23. GROUP BY from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')
  24. ,user_id
  25. ) b
  26. ON 1 = 1
  27. WHERE a.date_id >= b.date_id
  28. GROUP BY a.date_id
  29. ,b.user_id
  30. ) t
  31. GROUP BY t.date_id
  32. ) tt1
  33. LEFT JOIN
  34. ( -- 按照date_id分组求出user_cnt_act,得到tt2
  35. SELECT date_id
  36. ,COUNT(user_id) AS user_cnt_act
  37. FROM
  38. ( -- 按照日期分组,取出date_id字段当主表的维度字段 得出表a
  39. SELECT from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd') AS date_id
  40. ,user_id
  41. FROM test.temp_tanhaidi_20211213_1
  42. GROUP BY from_unixtime(unix_timestamp(time_id),'yyyy-MM-dd')
  43. ,user_id
  44. ) a
  45. GROUP BY date_id
  46. ) tt2
  47. ON tt2.date_id = tt1.date_id

参考
最强最全面的大数据SQL经典面试题完整PDF版