一、Hive内容补充

1、with… as

with as 也叫做子查询部分,hive 可以通过with查询来提高查询性能,因为先通过with语法将数据查询到内存,然后后面其它查询可以直接使用。

  1. with t as (select *, row_number()over(partition by id order by salary desc) ranking from tmp_learning_mary)
  2. select * from t where ranking = 1;

with as就类似于一个视图或临时表,可以用来存储一部分的sql语句作为别名,不同的是with as 属于一次性的,而且必须要和其他sql一起使用才可以!

其最大的好处就是适当的提高代码可读性,而且如果with子句在后面要多次使用到,这可以大大的简化SQL;更重要的是:一次分析,多次使用,这也是为什么会提供性能的地方,达到了“少读”的目标。
使用注意事项:
1.with子句必须在引用的select语句之前定义,而且后面必须要跟select查询,否则报错。
2.with as后面不能加分号,with关键字在同级中只能使用一次,允许跟多个子句,用逗号隔开,最后一个子句与后面的查询语句之间只能用右括号分隔,不能用逗号。

  1. create table a as
  2. with t1 as (select * from firstTable),
  3. t2 as (select * from secondTable),
  4. t3 as (select * from thirdTable)
  5. select * from t1,t2,t3;

3.前面的with子句定义的查询在后面的with子句中可以使用。但是一个with子句内部不能嵌套with子句。

  1. with t1 as (select * from firstTable),
  2. t2 as (select t1.id from t1) #第二个子句t2中用了第一个子句的查询t1
  3. select * from t2

2、hive的多维分析(mysql 8.0也开始支持)

先看数据:在/root/hivedata/duowei01.txt

  1. A|3|男|45
  2. B|3|女|55
  3. C|2|女|74
  4. D|3|男|90
  5. E|1|男|80
  6. F|2|女|92
  7. G|1|女|95
  8. H|1|男|95
  9. I|3|女|99
  10. J|3|男|99
  1. create table tmp_student
  2. (
  3. name string,
  4. class int,
  5. sex string,
  6. score int
  7. )
  8. row format delimited fields terminated by '|';
  1. load data local inpath '/root/hivedata/duowei01.txt' into table tmp_student;
  1. hive> select * from tmp_student;
  2. OK
  3. A 3 45
  4. B 3 55
  5. C 2 74
  6. D 3 90
  7. E 1 80
  8. F 2 92
  9. G 1 95
  10. H 1 95
  11. I 3 99
  12. J 3 99
  13. Time taken: 0.377 seconds, Fetched: 10 row(s)
  1. 数据表有4个字段:姓名、班级、性别、分数。
  2. 如果我想按照以下3个粒度汇总:
  3. 1、每个班级的平均分数
  4. 2、每个性别的平均分数
  5. 3、每个班级下不同性别的平均分数
  6. 则使用多个sql分别汇总的方案为:
  7. select class, avg(score) from tmp_student group by class;
  8. select sex, avg(score) from tmp_student group by sex;
  9. select class, sex, avg(score) from tmp_student group by class, sex;
  10. 修改,添加多维分析
  11. select class, sex, avg(score) from tmp_student group by class, sex grouping sets(class,sex,(class,sex));

image.png

  1. select
  2. grouping__id, -- 内置变量,只要使用grouping sets就可以调用
  3. class,
  4. sex,
  5. avg(score) as avg_score
  6. from
  7. tmp_student
  8. group by
  9. class,
  10. sex
  11. grouping sets(
  12. class,
  13. sex,
  14. (class, sex)
  15. )

image.png

  1. 可以看到,使用 grouping sets 同时汇总出了 3 种不同粒度的平均分数。
  2. 注意:
  3. 1grouping sets 只能用于 group by 之后。
  4. 2grouping sets 中可以包含多种粒度,粒度之间用逗号连接。
  5. 3grouping sets 中的所有字段,都必须出现在 group by 中,相当于 group by 后面的字段是最细粒度。
  6. 4、如果 select 中的字段,没有包含在某个 grouping set 中,那么这个粒度下的这个字段值为 NULL
  7. 4、不同的粒度,可以使用内置变量 grouping__id 进行区分。

grouping_id计算方法

grouping sets 中的每一种粒度,都对应唯一的 grouping__id 值,其计算公式与 group by 的顺序、当前粒度的字段有关。
具体计算方法如下:

将 group by 的所有字段 倒序 排列。
对于每个字段,如果该字段出现在了当前粒度中,则该字段位置赋值为1,否则为0。
这样就形成了一个二进制数,这个二进制数转为十进制,即为当前粒度对应的 grouping__id。
以上述对 tmp_student 的3种粒度的统计结果为例:
1、group by 的所有字段倒序排列为:sex class
2、对于 3 种 grouping sets,分别对应的二进制数为:

序号 grouping set 给倒序排列的字段(sex class)赋值 对应的十进制(grouping__id 的值)

维度 sex class 对应的十进制
class 0 1 1
sex 1 0 2
class,sex 1 1 3

1 class 0 1 => 1
2 sex 1 0 => 2
3 class,sex 1 1 => 3
这就是上面 grouping sets 的结果中 grouping__id 值的由来。

官网介绍:
image.png
再看一个案例:

  1. [whalex@hadoop102data]$ vim test.txt
  2. 1001 John male sale beijing
  3. 1002 Tom female sale shanghai
  4. 1003 Bob male sale beijing
  5. 1004 Alex male product shanghai
  1. create table test_duowei(id string,name string,sex string,dept_name string,addr string) row format delimited fields terminated by '\t';
  2. load data local inpath '/root/hivedata/duowei02.txt' overwrite into table test_duowei;

需求:

  1. 按照性别分组计算每种性别的人数
  2. 按照部门和性别分组计算每个部门每种性别的人数
  3. 按照地区和性别分组计算每个地区每种性别的人数

解答:

  1. select sex,count(*) ct from test group by sex;
  2. select dept_name,sex,count(*) ct from test group by dept_name,sex;
  3. select addr,sex,count(*) ct from test group by addr,sex;

多维分析:

  1. select dept_name,addr,sex,count(*) ct from test group by dept_name,addr,sex grouping sets((sex),(dept_name,sex),(addr,sex)) order by dept_name,addr,sex;
  2. +------------+-----------+---------+-----+
  3. | dept_name | addr | sex | ct |
  4. +------------+-----------+---------+-----+
  5. | NULL | NULL | female | 1 |
  6. | NULL | NULL | male | 3 |
  7. | NULL | beijing | male | 2 |
  8. | NULL | shanghai | female | 1 |
  9. | NULL | shanghai | male | 1 |
  10. | product | NULL | male | 1 |
  11. | sale | NULL | female | 1 |
  12. | sale | NULL | male | 2 |
  13. +------------+-----------+---------+-----+

HQL中的order by只是为了让输出结果更好看一些,没有其他意义。

With Cube解决计算所有维度组合问题:

此时需要计算地区、部门以及性别三个维度所有组合的聚合数据
我们当然还可以使用Grouping Sets来解决问题,HQL如下:

  1. select dept_name,addr,sex,count(*) ct from test group by dept_name,addr,sex grouping sets((addr,dept_name,sex),(addr,dept_name),(addr,sex),(dept_name,sex),(addr),(dept_name),(sex),()) order by dept_name,addr,sex;
  2. +------------+-----------+---------+-----+
  3. | dept_name | addr | sex | ct |
  4. +------------+-----------+---------+-----+
  5. | NULL | NULL | NULL | 4 |
  6. | NULL | NULL | female | 1 |
  7. | NULL | NULL | male | 3 |
  8. | NULL | beijing | NULL | 2 |
  9. | NULL | beijing | male | 2 |
  10. | NULL | shanghai | NULL | 2 |
  11. | NULL | shanghai | female | 1 |
  12. | NULL | shanghai | male | 1 |
  13. | product | NULL | NULL | 1 |
  14. | product | NULL | male | 1 |
  15. | product | shanghai | NULL | 1 |
  16. | product | shanghai | male | 1 |
  17. | sale | NULL | NULL | 3 |
  18. | sale | NULL | female | 1 |
  19. | sale | NULL | male | 2 |
  20. | sale | beijing | NULL | 2 |
  21. | sale | beijing | male | 2 |
  22. | sale | shanghai | NULL | 1 |
  23. | sale | shanghai | female | 1 |
  24. +------------+-----------+---------+-----+

此时我们也可以使用With Cube来解决问题:

  1. GROUP BY a, b, c WITH CUBE 等同于
  2. GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( )).

所以我们的HQL语句是这样的:

  1. select dept_name,addr,sex,count(*) ct from test_duowei group by dept_name,addr,sex with cube order by dept_name,addr,sex;
  2. +------------+-----------+---------+-----+
  3. | dept_name | addr | sex | ct |
  4. +------------+-----------+---------+-----+
  5. | NULL | NULL | NULL | 4 |
  6. | NULL | NULL | female | 1 |
  7. | NULL | NULL | male | 3 |
  8. | NULL | beijing | NULL | 2 |
  9. | NULL | beijing | male | 2 |
  10. | NULL | shanghai | NULL | 2 |
  11. | NULL | shanghai | female | 1 |
  12. | NULL | shanghai | male | 1 |
  13. | product | NULL | NULL | 1 |
  14. | product | NULL | male | 1 |
  15. | product | shanghai | NULL | 1 |
  16. | product | shanghai | male | 1 |
  17. | sale | NULL | NULL | 3 |
  18. | sale | NULL | female | 1 |
  19. | sale | NULL | male | 2 |
  20. | sale | beijing | NULL | 2 |
  21. | sale | beijing | male | 2 |
  22. | sale | shanghai | NULL | 1 |
  23. | sale | shanghai | female | 1 |
  24. +------------+-----------+---------+-----+

With Rollup解决计算层级维度组合问题:

数据:

  1. 1 2020 12 31 56
  2. 2 2020 12 31 89
  3. 3 2021 01 01 78
  4. 4 2021 01 01 67
  5. 5 2021 01 02 56
  1. create table test_rollup(order_id int,year string,month string,day string,order_amount int) row format delimited fields terminated by '\t';
  1. load data local inpath "/root/hivedata/duowei03.txt" overwrite into table test_rollup;

现在需要按照时间统计销售总额,注意按照时间聚合数据的时候,单独的月份、单独的天、月份和天以及年和天的维度聚合出来的指标是没有意义的,也就是说按照时间维度聚合指标时只有“年月日”、“年月”、“年”以及不添加维度信息计算总和时才是有意义的。
这种像年月日维度的可以称为层级维度,同样可以使用Grouping Sets来解决该需求,HQL如下:

  1. select year,month,day,sum(order_amount) from test_rollup group by year,month,day grouping sets((year,month,day),(year,month),(year),()) order by year,month,day;
  2. +-------+--------+-------+------+
  3. | year | month | day | _c3 |
  4. +-------+--------+-------+------+
  5. | NULL | NULL | NULL | 346 |
  6. | 2020 | NULL | NULL | 145 |
  7. | 2020 | 12 | NULL | 145 |
  8. | 2020 | 12 | 31 | 145 |
  9. | 2021 | NULL | NULL | 201 |
  10. | 2021 | 01 | NULL | 201 |
  11. | 2021 | 01 | 01 | 145 |
  12. | 2021 | 01 | 02 | 56 |
  13. +-------+--------+-------+------+

此时我们还可以使用With Rollup来解决该需求:

  1. GROUP BY a, b, c, WITH ROLLUP 等同于
  2. GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( )).

所以HQL语句为:

  1. select year,month,day,sum(order_amount) from test_rollup group by year,month,day with rollup order by year,month,day;
  2. +-------+--------+-------+------+
  3. | year | month | day | _c3 |
  4. +-------+--------+-------+------+
  5. | NULL | NULL | NULL | 346 |
  6. | 2020 | NULL | NULL | 145 |
  7. | 2020 | 12 | NULL | 145 |
  8. | 2020 | 12 | 31 | 145 |
  9. | 2021 | NULL | NULL | 201 |
  10. | 2021 | 01 | NULL | 201 |
  11. | 2021 | 01 | 01 | 145 |
  12. | 2021 | 01 | 02 | 56 |
  13. +-------+--------+-------+------+

总结:

  1. Hive中提供了grouping setswith cube以及with rollup来解决多维分析的问题,当维度特别多时,我们根据具体情况可以使用对应的函数来解决问题。
  2. 我们可以将结果输出到一张表中方便后续查询,比如我们是a,b,c三个维度使用with cube计算的结果集,当我们只需要group by a的结果时,我们可以使用select a,ct from result_table where a is not null and b is null and c is null;这样的HQL获取需要的结果,避免了我们在开头提到了多条HQL需要使用多张表来保存数据给最后查询带来麻烦的问题。

3、复习hive函数

1)hive中解析url的parse_url函数解析url
举例:

  1. https://www.baidu.com/hzy?user_id=10000&platform=ios

例子:

  1. protocol(协议):https,一种可以通过安全的 HTTPS 访问该资源的传输协议
  2. select parse_url('https://www.baidu.com/hzy?user_id=10000&platform=ios','PROTOCOL') ;
  3. https
  4. 解析主机名:hostname(主机名)
  5. select parse_url('https://www.baidu.com/hzy?user_id=10000&platform=ios','HOST');
  6. www.baidu.com
  7. path(路径):由零或多个“/”符号隔开的字符串,一般用来表示主机上的一个目录或文件地址
  8. select parse_url('https://www.baidu.com/hzy?user_id=10000&platform=ios','PATH');
  9. /hzy
  10. query(查询):可选,用于给动态网页传递参数,可有多个参数,用“&”符号隔开,每个参数的名和值用“=”符号隔开
  11. select parse_url('https://www.baidu.com/hzy?user_id=10000&platform=ios','QUERY');
  12. user_id=10000&platform=ios
  13. 解析QUERY部分的value值:
  14. select parse_url('https://www.baidu.com/hzy?user_id=10000&platform=ios','QUERY','user_id');
  15. 10000
  16. select parse_url('https://www.baidu.com/hzy?user_id=10000&platform=ios','QUERY','platform');
  17. ios

2)get_json_object 函数

  1. [{"name":"大郎","sex":"男","age":"25"},{"name":"西门庆","sex":"男","age":"47"}]

获取一个json对象:

  1. select get_json_object('[{"name":"大郎","sex":"男","age":"25"},{"name":"西门庆","sex":"男","age":"47"}]','$[0]');
  2. 结果是:{"name":"大郎","sex":"男","age":"25"}

获取age:

  1. hive (gmall)>
  2. SELECT get_json_object('[{"name":"大郎","sex":"男","age":"25"},{"name":"西门庆","sex":"男","age":"47"}]',"$[0].name");
  3. 结果是:25
  4. 如果数据直接是一个json对象,获取时使用$.字段名字 即可获取。
  5. SELECT get_json_object('{"name":"大郎","sex":"男","age":"25"}',"$.name");

4、动态分区

动态分区是分区在数据插入的时候,根据某一列的列值动态生成.

往hive分区表中插入数据时,如果需要创建的分区很多,比如以表中某个字段进行分区存储,则需要复制粘贴修改很多sql去执行,效率低。因为hive是批处理系统,所以hive提供了一个动态分区功能,其可以基于查询参数的位置去推断分区的名称,从而建立分区。

静态分区与动态分区的主要区别在于静态分区是手动指定,而动态分区是通过数据来进行判断。详细来说,静态分区的列实在编译时期,通过用户传递来决定的;动态分区只有在SQL执行时才能决定。
例如:

  1. hive> insert overwrite table partition_test partition(stat_date='20110728',province='henan')
  2. select
  3. member_id
  4. ,name
  5. from
  6. partition_test_input
  7. where
  8. stat_date='20110728' and province='henan';
  1. insert overwrite table partition_test partition(stat_date='20110728',province)
  2. select
  3. member_id
  4. ,name
  5. ,province
  6. from
  7. partition_test_input
  8. where
  9. stat_date='20110728';

案例:

  1. [root@zjj101 soft]# cat order_created.txt
  2. 10703007267488 usa 2014-05-01
  3. 10101043505096 usa 2014-05-01
  4. 10103043509747 china 2014-05-02
  5. 10103043501575 china 2014-05-02
  6. 10104043514061 china 2014-05-01
  1. -- 创建普通表
  2. create table order_partition
  3. (
  4. order_no string,
  5. type string,
  6. order_time string
  7. )
  8. row format delimited fields terminated by '\t';
  9. -- 装载数据
  10. load data local inpath '/root/hivedata/order_created.txt' into table order_partition;
  11. -- 查询数据是否装载成功
  12. select * from order_partition;

创建分区表:

  1. create table order_dynamic_partition
  2. (
  3. order_no string
  4. )
  5. PARTITIONED BY (type string,`time` string)
  6. row format delimited fields terminated by '\t';
  1. 动态分区表的属性
  2. 使用动态分区表必须配置的参数
  3. set hive.exec.dynamic.partition =true(默认false),表示开启动态分区功能
  4. set hive.exec.dynamic.partition.mode = nonstrict(默认strict),表示允许所有分区都是动态的,否则必须有静态分区字段
  5. 动态分区相关的调优参数:
  6. set hive.exec.max.dynamic.partitions.pernode=100 (默认100,一般可以设置大一点,比如1000 表示每个maperreducer可以允许创建的最大动态分区个数,默认是100,超出则会报错。
  7. set hive.exec.max.dynamic.partitions =1000(默认值) 表示一个动态分区语句可以创建的最大动态分区个数,超出报错
  8. set hive.exec.max.created.files =10000(默认) 全局可以创建的最大文件个数,超出报错。
  1. set hive.exec.dynamic.partition=true;
  2. set hive.exec.dynamic.partition.mode=nostrick;
  3. insert overwrite table order_dynamic_partition
  4. partition (type, `time`)
  5. select order_no, type, order_time
  6. from order_partition;
  1. show partitions order_dynamic_partition;

二、DWD层数据构建

先创建数据库:

  1. create database dwd_nshop;
  2. use dwd_nshop;

1、用户主题

一定要理解这个码表(这个是理解整个SQL的核心部分):

  1. action行为种类:
  2. INSTALL("01", "install","安装"),
  3. LAUNCH("02", "launch","启动"),
  4. LOGIN("03", "login","登录"),
  5. REGISTER("04", "register","注册"),
  6. INTERACTIVE("05", "interactive","交互行为"),
  7. EXIT("06", "exit","退出"),
  8. PAGE_ENTER_H5("07", "page_enter_h5","页面进入"),
  9. PAGE_ENTER_NATIVE("08", "page_enter_native","页面进入")
  10. eventtype事件类型:
  11. VIEW("01", "view","浏览"),
  12. CLICK("02", "click","点击"),
  13. INPUT("03", "input","输入"),
  14. SLIDE("04", "slide","滑动")

1)用户启动日志表

  1. create external table if not exists dwd_nshop.dwd_nshop_actlog_launch(
  2. user_id string comment '用户id',
  3. device_num string comment '设备号',
  4. device_type string comment '设备类型',
  5. os string comment '手机系统',
  6. os_version string comment '手机系统版本',
  7. manufacturer string comment '手机制造商',
  8. carrier string comment '电信运营商',
  9. network_type string comment '网络类型',
  10. area_code string comment '地区编码',
  11. launch_time_segment string comment '启动时间段',-- 启动时间段分为四个阶段(0-66-1212-18/18-24)每个阶段用 1 2 3 4表示
  12. ct bigint comment '产生时间'
  13. ) partitioned by (bdp_day string)
  14. stored as parquet
  15. location '/data/nshop/dwd/user/dwd_nshop_actlog_launch/';

表创建好了,数据如何导入?

  1. insert overwrite table dwd_nshop.dwd_nshop_actlog_launch partition(bdp_day='20220509')
  2. select
  3. customer_id ,
  4. device_num ,
  5. device_type ,
  6. os ,
  7. os_version ,
  8. manufacturer ,
  9. carrier ,
  10. network_type,
  11. area_code ,
  12. case
  13. when from_unixtime(cast(ct/1000 as int),'HH') > 0
  14. and from_unixtime(cast(ct/1000 as int),'HH') <=6 then 1
  15. when from_unixtime(cast(ct/1000 as int),'HH') > 6
  16. and from_unixtime(cast(ct/1000 as int),'HH') <=12 then 2
  17. when from_unixtime(cast(ct/1000 as int),'HH') > 12
  18. and from_unixtime(cast(ct/1000 as int),'HH') <=18 then 3
  19. else 4
  20. end,
  21. ct
  22. from ods_nshop.ods_nshop_01_useractlog
  23. where bdp_day='20220509' and action='02';

总结:

  1. 根据我们给定的dwd的表结构,拿着我们之前的ods表结构,然后编写对应的SQL语句,将查询出来的数据,导入到dwd的表中。

2、用户产品浏览表

  1. create external table if not exists dwd_nshop.dwd_nshop_actlog_pdtview(
  2. user_id string comment '用户id',
  3. device_num string comment '设备号',
  4. device_type string comment '设备类型',
  5. os string comment '手机系统',
  6. os_version string comment '手机系统版本',
  7. manufacturer string comment '手机制造商',
  8. carrier string comment '电信运营商',
  9. network_type string comment '网络类型',
  10. area_code string comment '地区编码',
  11. target_id string comment '产品ID',
  12. duration int comment '停留时长',
  13. ct bigint comment '产生时间'
  14. ) partitioned by (bdp_day string)
  15. stored as parquet
  16. location '/data/nshop/dwd/user/dwd_nshop_actlog_pdtview/'

编写对应的SQL语句插入表数据

  1. insert overwrite table dwd_nshop.dwd_nshop_actlog_pdtview partition(bdp_day='20220509')
  2. select
  3. customer_id ,
  4. device_num ,
  5. device_type ,
  6. os ,
  7. os_version ,
  8. manufacturer ,
  9. carrier ,
  10. network_type,
  11. area_code ,
  12. get_json_object(extinfo,'$.target_id') as target_id,
  13. duration,
  14. ct
  15. from ods_nshop.ods_nshop_01_useractlog
  16. where bdp_day='20220509' and action in ('07','08')
  17. and event_type ='01';

3、用户产品查询表

  1. create external table if not exists dwd_nshop.dwd_nshop_actlog_pdtsearch(
  2. user_id string comment '用户id',
  3. device_num string comment '设备号',
  4. device_type string comment '设备类型',
  5. os string comment '手机系统',
  6. os_version string comment '手机系统版本',
  7. manufacturer string comment '手机制造商',
  8. carrier string comment '电信运营商',
  9. network_type string comment '网络类型',
  10. area_code string comment '地区编码',
  11. target_order string comment '查询排序方式',
  12. target_keys string comment '查询内容',
  13. target_id string comment '查询商品ID',
  14. ct bigint comment '产生时间'
  15. ) partitioned by (bdp_day string)
  16. stored as parquet
  17. location '/data/nshop/dwd/user/dwd_nshop_actlog_pdtsearch/';

根据查询出来的结果数据,我们看到原始数据:

  1. {"target_type":"4","target_keys":"","target_order":"20","target_ids":"[\"4320901202801\",\"4320901118101\",\"4320901823501\",\"4320901106001\"]"}
  2. get_json_object(extinfo,'$.target_ids')
  3. ["4320901202801","4320901118101","4320901823501","4320901106001"]
  4. regexp_replace(get_json_object(extinfo,'$.target_ids'),'[\\[\\]]','')
  5. "4320901202801","4320901118101","4320901823501","4320901106001"
  6. split(regexp_replace(get_json_object(extinfo,'$.target_ids'),'[\\[\\]]',''),',')
  7. 4320901202801,4320901118101,4320901823501,4320901106001
  1. insert overwrite table dwd_nshop.dwd_nshop_actlog_pdtsearch partition(bdp_day='20220509')
  2. select
  3. customer_id ,
  4. device_num ,
  5. device_type ,
  6. os ,
  7. os_version ,
  8. manufacturer ,
  9. carrier ,
  10. network_type,
  11. area_code ,
  12. get_json_object(extinfo,'$.target_order') as target_order,
  13. get_json_object(extinfo,'$.target_keys') as target_keys,
  14. target_id,
  15. ct
  16. from ods_nshop.ods_nshop_01_useractlog
  17. lateral view explode(split(regexp_replace(get_json_object(extinfo,'$.target_ids'),'[\\[\\]\"]',''),',')) t as target_id
  18. where bdp_day='20220509' and action ='05'
  19. and event_type in ('01','04');

4、用户关注店铺表

  1. create external table if not exists dwd_nshop.dwd_actlog_product_comment(
  2. user_id string comment '用户id',
  3. device_num string comment '设备号',
  4. device_type string comment '设备类型',
  5. os string comment '手机系统',
  6. os_version string comment '手机系统版本',
  7. manufacturer string comment '手机制造商',
  8. carrier string comment '电信运营商',
  9. network_type string comment '网络类型',
  10. area_code string comment '地区编码',
  11. target_id string comment '产品ID',
  12. ct bigint comment '产生时间'
  13. ) partitioned by (bdp_day string)
  14. stored as parquet
  15. location '/data/nshop/dwd/user/dwd_actlog_product_comment/'

编写SQL语句:

  1. with t1 as (
  2. select
  3. customer_id,
  4. device_num ,
  5. device_type ,
  6. os ,
  7. os_version ,
  8. manufacturer ,
  9. carrier ,
  10. network_type,
  11. area_code ,
  12. get_json_object(extinfo,'$.target_type') as target_type,
  13. get_json_object(extinfo,'$.target_action') as target_action,
  14. get_json_object(extinfo,'$.target_id') as target_id,
  15. ct
  16. from ods_nshop.ods_nshop_01_useractlog
  17. where bdp_day='20220509' and action='05' and event_type='02'
  18. )
  19. insert overwrite table dwd_nshop.dwd_actlog_product_comment partition(bdp_day='20220509')
  20. select
  21. customer_id,
  22. device_num ,
  23. device_type ,
  24. os ,
  25. os_version ,
  26. manufacturer ,
  27. carrier ,
  28. network_type,
  29. area_code ,
  30. target_id,
  31. ct from t1 where target_action='01' and target_type = 3;

2、交易主题

交易明细表(ods_nshop.ods_02_orders,ods_nshop.ods_02_order_detail,dim_nshop.dim_pub_product,ods_nshop.ods_02_orders_pay_records)

  1. create external table if not exists dwd_nshop.dwd_nshop_orders_details(
  2. order_id string comment '订单ID',
  3. order_status int comment '订单状态:5已收货(完成)|6投诉 7退货',
  4. supplier_code VARCHAR(20) COMMENT '店铺ID',
  5. product_code VARCHAR(20) COMMENT '商品ID',
  6. customer_id string comment '用户id',
  7. consignee_zipcode string comment '收货人地址',
  8. pay_type string comment '支付类型:线上支付 10 网上银行 11 微信 12 支付宝 | 线下支付(货到付款) 20 ',
  9. pay_nettype varchar(1) COMMENT '支付网络方式:0 wifi | 1 4g | 2 3g |3 线下支付',
  10. pay_count int comment '支付次数',
  11. product_price DECIMAL(5,1) COMMENT '购买商品单价',
  12. product_cnt INT COMMENT '购买商品数量',
  13. weighing_cost DECIMAL(2,1) COMMENT '商品加权价格',
  14. district_money DECIMAL(4,1) COMMENT '优惠金额',
  15. shipping_money DECIMAL(8,1) COMMENT '运费金额',
  16. payment_money DECIMAL(10,1) COMMENT '支付金额',
  17. is_activity int COMMENT '1:参加活动|0:没有参加活动',
  18. order_ctime bigint comment '创建时间'
  19. ) partitioned by (bdp_day string)
  20. stored as parquet
  21. location '/data/nshop/dwd/order/dwd_nshop_orders_details/'
  1. with t1 as (
  2. select
  3. count(1) as pay_count,
  4. order_id
  5. from ods_nshop.ods_02_orders_pay_records
  6. where from_unixtime(cast(pay_ctime/1000 as int),'yyyyMMdd') = '20191102'
  7. group by order_id
  8. ),
  9. t2 as (
  10. select
  11. order_id,
  12. order_status,
  13. customer_id,
  14. consignee_zipcode,
  15. pay_type,
  16. pay_nettype,
  17. district_money,
  18. shipping_money,
  19. payment_money,
  20. order_ctime
  21. from ods_nshop.ods_02_orders
  22. where from_unixtime(cast(order_ctime/1000 as int),'yyyyMMdd') = '20191102'),
  23. t3 as (
  24. select
  25. a.order_id,
  26. a.product_id,
  27. a.product_cnt,
  28. a.product_price,
  29. a.weighing_cost,
  30. a.is_activity,
  31. b.supplier_code
  32. from ods_nshop.ods_02_order_detail a
  33. join ods_nshop.dim_pub_product b
  34. on a.product_id = b.product_code
  35. where from_unixtime(cast(order_detail_ctime/1000 as int),'yyyyMMdd') = '20191102'
  36. )
  37. insert overwrite table dwd_nshop.dwd_nshop_orders_details partition(bdp_day='20220509')
  38. select
  39. t2.order_id,
  40. t2.order_status,
  41. t3.supplier_code,
  42. t3.product_id,
  43. t2.customer_id,
  44. t2.consignee_zipcode,
  45. t2.pay_type,
  46. t2.pay_nettype,
  47. t1.pay_count,
  48. t3.product_price,
  49. t3.product_cnt,
  50. t3.weighing_cost,
  51. t2.district_money,
  52. t2.shipping_money,
  53. t2.payment_money,
  54. t3.is_activity,
  55. t2.order_ctime
  56. from t1 join t2
  57. on t1.order_id = t2.order_id
  58. join t3
  59. on t2.order_id = t3.order_id;

3、营销主题

广告投放数据表:

  1. create external table if not exists dwd_nshop.dwd_nshop_releasedatas(
  2. customer_id string comment '用户id',
  3. device_num string comment '设备号',
  4. device_type string comment '设备类型',
  5. os string comment '手机系统',
  6. os_version string comment '手机系统版本',
  7. manufacturer string comment '手机制造商',
  8. area_code string comment '地区编码',
  9. release_sid string comment '投放请求id',
  10. release_ip string comment '投放方ip',
  11. release_session string comment '投放会话id',
  12. release_sources string comment '投放渠道',
  13. release_category string comment '投放浏览产品分类',
  14. release_product string comment '投放浏览产品',
  15. release_product_page string comment '投放浏览产品页',
  16. ct bigint comment '创建时间'
  17. ) partitioned by (bdp_day string)
  18. stored as parquet
  19. location '/data/nshop/dwd/release/dwd_nshop_releasedatas/'
  1. with t1 as (
  2. select
  3. b.customer_id,
  4. a.device_num,
  5. a.device_type ,
  6. a.os ,
  7. a.os_version ,
  8. a.manufacturer ,
  9. a.area_code ,
  10. a.release_sid ,
  11. parse_url(concat("http://127.0.0.1:8888/path?",a.release_params),'QUERY','ip') as release_ip,
  12. a.release_session,
  13. a.release_sources,
  14. parse_url(concat("http://127.0.0.1:8888/path?",a.release_params),'QUERY','productPage') as release_product_page,
  15. a.ct
  16. from ods_nshop.ods_nshop_01_releasedatas a
  17. join ods_nshop.ods_02_customer b
  18. on a.device_num = b.customer_device_num
  19. where a.bdp_day='20220509'
  20. )
  21. insert overwrite table dwd_nshop.dwd_nshop_releasedatas partition(bdp_day='20220509')
  22. select
  23. t1.customer_id,
  24. t1.device_num,
  25. t1.device_type ,
  26. t1.os ,
  27. t1.os_version ,
  28. t1.manufacturer ,
  29. t1.area_code ,
  30. t1.release_sid,
  31. t1.release_ip,
  32. t1.release_session,
  33. t1.release_sources,
  34. c.category_code,
  35. b.page_target,
  36. t1.release_product_page,
  37. t1.ct
  38. from t1
  39. join ods_nshop.dim_pub_page b
  40. on t1.release_product_page = b.page_code and b.page_type = '4'
  41. join ods_nshop.dim_pub_product c
  42. on b.page_target = c.product_code;

由于这个任务比较大,有可能出现内存溢出的情况,可以将本地模式关闭,使用集群模式运行该任务。

  1. set hive.exec.mode.local.auto=false;