一、Hive内容补充
1、with… as
with as 也叫做子查询部分,hive 可以通过with查询来提高查询性能,因为先通过with语法将数据查询到内存,然后后面其它查询可以直接使用。
with t as (select *, row_number()over(partition by id order by salary desc) ranking from tmp_learning_mary)
select * from t where ranking = 1;
with as就类似于一个视图或临时表,可以用来存储一部分的sql语句作为别名,不同的是with as 属于一次性的,而且必须要和其他sql一起使用才可以!
其最大的好处就是适当的提高代码可读性,而且如果with子句在后面要多次使用到,这可以大大的简化SQL;更重要的是:一次分析,多次使用,这也是为什么会提供性能的地方,达到了“少读”的目标。
使用注意事项:
1.with子句必须在引用的select语句之前定义,而且后面必须要跟select查询,否则报错。
2.with as后面不能加分号,with关键字在同级中只能使用一次,允许跟多个子句,用逗号隔开,最后一个子句与后面的查询语句之间只能用右括号分隔,不能用逗号。
create table a as
with t1 as (select * from firstTable),
t2 as (select * from secondTable),
t3 as (select * from thirdTable)
select * from t1,t2,t3;
3.前面的with子句定义的查询在后面的with子句中可以使用。但是一个with子句内部不能嵌套with子句。
with t1 as (select * from firstTable),
t2 as (select t1.id from t1) #第二个子句t2中用了第一个子句的查询t1
select * from t2
2、hive的多维分析(mysql 8.0也开始支持)
先看数据:在/root/hivedata/duowei01.txt
A|3|男|45
B|3|女|55
C|2|女|74
D|3|男|90
E|1|男|80
F|2|女|92
G|1|女|95
H|1|男|95
I|3|女|99
J|3|男|99
create table tmp_student
(
name string,
class int,
sex string,
score int
)
row format delimited fields terminated by '|';
load data local inpath '/root/hivedata/duowei01.txt' into table tmp_student;
hive> select * from tmp_student;
OK
A 3 男 45
B 3 女 55
C 2 女 74
D 3 男 90
E 1 男 80
F 2 女 92
G 1 女 95
H 1 男 95
I 3 女 99
J 3 男 99
Time taken: 0.377 seconds, Fetched: 10 row(s)
数据表有4个字段:姓名、班级、性别、分数。
如果我想按照以下3个粒度汇总:
1、每个班级的平均分数
2、每个性别的平均分数
3、每个班级下不同性别的平均分数
则使用多个sql分别汇总的方案为:
select class, avg(score) from tmp_student group by class;
select sex, avg(score) from tmp_student group by sex;
select class, sex, avg(score) from tmp_student group by class, sex;
修改,添加多维分析
select class, sex, avg(score) from tmp_student group by class, sex grouping sets(class,sex,(class,sex));
select
grouping__id, -- 内置变量,只要使用grouping sets就可以调用
class,
sex,
avg(score) as avg_score
from
tmp_student
group by
class,
sex
grouping sets(
class,
sex,
(class, sex)
)
可以看到,使用 grouping sets 同时汇总出了 3 种不同粒度的平均分数。
注意:
1、grouping sets 只能用于 group by 之后。
2、grouping sets 中可以包含多种粒度,粒度之间用逗号连接。
3、grouping sets 中的所有字段,都必须出现在 group by 中,相当于 group by 后面的字段是最细粒度。
4、如果 select 中的字段,没有包含在某个 grouping set 中,那么这个粒度下的这个字段值为 NULL。
4、不同的粒度,可以使用内置变量 grouping__id 进行区分。
grouping_id计算方法
grouping sets 中的每一种粒度,都对应唯一的 grouping__id 值,其计算公式与 group by 的顺序、当前粒度的字段有关。
具体计算方法如下:
将 group by 的所有字段 倒序 排列。
对于每个字段,如果该字段出现在了当前粒度中,则该字段位置赋值为1,否则为0。
这样就形成了一个二进制数,这个二进制数转为十进制,即为当前粒度对应的 grouping__id。
以上述对 tmp_student 的3种粒度的统计结果为例:
1、group by 的所有字段倒序排列为:sex class
2、对于 3 种 grouping sets,分别对应的二进制数为:
序号 grouping set 给倒序排列的字段(sex class)赋值 对应的十进制(grouping__id 的值)
维度 | sex | class | 对应的十进制 |
---|---|---|---|
class | 0 | 1 | 1 |
sex | 1 | 0 | 2 |
class,sex | 1 | 1 | 3 |
1 class 0 1 => 1
2 sex 1 0 => 2
3 class,sex 1 1 => 3
这就是上面 grouping sets 的结果中 grouping__id 值的由来。
官网介绍:
再看一个案例:
[whalex@hadoop102data]$ vim test.txt
1001 John male sale beijing
1002 Tom female sale shanghai
1003 Bob male sale beijing
1004 Alex male product shanghai
create table test_duowei(id string,name string,sex string,dept_name string,addr string) row format delimited fields terminated by '\t';
load data local inpath '/root/hivedata/duowei02.txt' overwrite into table test_duowei;
需求:
按照性别分组计算每种性别的人数
按照部门和性别分组计算每个部门每种性别的人数
按照地区和性别分组计算每个地区每种性别的人数
解答:
select sex,count(*) ct from test group by sex;
select dept_name,sex,count(*) ct from test group by dept_name,sex;
select addr,sex,count(*) ct from test group by addr,sex;
多维分析:
select dept_name,addr,sex,count(*) ct from test group by dept_name,addr,sex grouping sets((sex),(dept_name,sex),(addr,sex)) order by dept_name,addr,sex;
+------------+-----------+---------+-----+
| dept_name | addr | sex | ct |
+------------+-----------+---------+-----+
| NULL | NULL | female | 1 |
| NULL | NULL | male | 3 |
| NULL | beijing | male | 2 |
| NULL | shanghai | female | 1 |
| NULL | shanghai | male | 1 |
| product | NULL | male | 1 |
| sale | NULL | female | 1 |
| sale | NULL | male | 2 |
+------------+-----------+---------+-----+
HQL中的order by只是为了让输出结果更好看一些,没有其他意义。
With Cube解决计算所有维度组合问题:
此时需要计算地区、部门以及性别三个维度所有组合的聚合数据
我们当然还可以使用Grouping Sets来解决问题,HQL如下:
select dept_name,addr,sex,count(*) ct from test group by dept_name,addr,sex grouping sets((addr,dept_name,sex),(addr,dept_name),(addr,sex),(dept_name,sex),(addr),(dept_name),(sex),()) order by dept_name,addr,sex;
+------------+-----------+---------+-----+
| dept_name | addr | sex | ct |
+------------+-----------+---------+-----+
| NULL | NULL | NULL | 4 |
| NULL | NULL | female | 1 |
| NULL | NULL | male | 3 |
| NULL | beijing | NULL | 2 |
| NULL | beijing | male | 2 |
| NULL | shanghai | NULL | 2 |
| NULL | shanghai | female | 1 |
| NULL | shanghai | male | 1 |
| product | NULL | NULL | 1 |
| product | NULL | male | 1 |
| product | shanghai | NULL | 1 |
| product | shanghai | male | 1 |
| sale | NULL | NULL | 3 |
| sale | NULL | female | 1 |
| sale | NULL | male | 2 |
| sale | beijing | NULL | 2 |
| sale | beijing | male | 2 |
| sale | shanghai | NULL | 1 |
| sale | shanghai | female | 1 |
+------------+-----------+---------+-----+
此时我们也可以使用With Cube来解决问题:
GROUP BY a, b, c WITH CUBE 等同于
GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( )).
所以我们的HQL语句是这样的:
select dept_name,addr,sex,count(*) ct from test_duowei group by dept_name,addr,sex with cube order by dept_name,addr,sex;
+------------+-----------+---------+-----+
| dept_name | addr | sex | ct |
+------------+-----------+---------+-----+
| NULL | NULL | NULL | 4 |
| NULL | NULL | female | 1 |
| NULL | NULL | male | 3 |
| NULL | beijing | NULL | 2 |
| NULL | beijing | male | 2 |
| NULL | shanghai | NULL | 2 |
| NULL | shanghai | female | 1 |
| NULL | shanghai | male | 1 |
| product | NULL | NULL | 1 |
| product | NULL | male | 1 |
| product | shanghai | NULL | 1 |
| product | shanghai | male | 1 |
| sale | NULL | NULL | 3 |
| sale | NULL | female | 1 |
| sale | NULL | male | 2 |
| sale | beijing | NULL | 2 |
| sale | beijing | male | 2 |
| sale | shanghai | NULL | 1 |
| sale | shanghai | female | 1 |
+------------+-----------+---------+-----+
With Rollup解决计算层级维度组合问题:
数据:
1 2020 12 31 56
2 2020 12 31 89
3 2021 01 01 78
4 2021 01 01 67
5 2021 01 02 56
create table test_rollup(order_id int,year string,month string,day string,order_amount int) row format delimited fields terminated by '\t';
load data local inpath "/root/hivedata/duowei03.txt" overwrite into table test_rollup;
现在需要按照时间统计销售总额,注意按照时间聚合数据的时候,单独的月份、单独的天、月份和天以及年和天的维度聚合出来的指标是没有意义的,也就是说按照时间维度聚合指标时只有“年月日”、“年月”、“年”以及不添加维度信息计算总和时才是有意义的。
这种像年月日维度的可以称为层级维度,同样可以使用Grouping Sets来解决该需求,HQL如下:
select year,month,day,sum(order_amount) from test_rollup group by year,month,day grouping sets((year,month,day),(year,month),(year),()) order by year,month,day;
+-------+--------+-------+------+
| year | month | day | _c3 |
+-------+--------+-------+------+
| NULL | NULL | NULL | 346 |
| 2020 | NULL | NULL | 145 |
| 2020 | 12 | NULL | 145 |
| 2020 | 12 | 31 | 145 |
| 2021 | NULL | NULL | 201 |
| 2021 | 01 | NULL | 201 |
| 2021 | 01 | 01 | 145 |
| 2021 | 01 | 02 | 56 |
+-------+--------+-------+------+
此时我们还可以使用With Rollup来解决该需求:
GROUP BY a, b, c, WITH ROLLUP 等同于
GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( )).
所以HQL语句为:
select year,month,day,sum(order_amount) from test_rollup group by year,month,day with rollup order by year,month,day;
+-------+--------+-------+------+
| year | month | day | _c3 |
+-------+--------+-------+------+
| NULL | NULL | NULL | 346 |
| 2020 | NULL | NULL | 145 |
| 2020 | 12 | NULL | 145 |
| 2020 | 12 | 31 | 145 |
| 2021 | NULL | NULL | 201 |
| 2021 | 01 | NULL | 201 |
| 2021 | 01 | 01 | 145 |
| 2021 | 01 | 02 | 56 |
+-------+--------+-------+------+
总结:
Hive中提供了grouping sets、with cube以及with rollup来解决多维分析的问题,当维度特别多时,我们根据具体情况可以使用对应的函数来解决问题。
我们可以将结果输出到一张表中方便后续查询,比如我们是a,b,c三个维度使用with cube计算的结果集,当我们只需要group by a的结果时,我们可以使用select a,ct from result_table where a is not null and b is null and c is null;这样的HQL获取需要的结果,避免了我们在开头提到了多条HQL需要使用多张表来保存数据给最后查询带来麻烦的问题。
3、复习hive函数
1)hive中解析url的parse_url函数解析url
举例:
https://www.baidu.com/hzy?user_id=10000&platform=ios
例子:
protocol(协议):https,一种可以通过安全的 HTTPS 访问该资源的传输协议
select parse_url('https://www.baidu.com/hzy?user_id=10000&platform=ios','PROTOCOL') ;
https
解析主机名:hostname(主机名)
select parse_url('https://www.baidu.com/hzy?user_id=10000&platform=ios','HOST');
www.baidu.com
path(路径):由零或多个“/”符号隔开的字符串,一般用来表示主机上的一个目录或文件地址
select parse_url('https://www.baidu.com/hzy?user_id=10000&platform=ios','PATH');
/hzy
query(查询):可选,用于给动态网页传递参数,可有多个参数,用“&”符号隔开,每个参数的名和值用“=”符号隔开
select parse_url('https://www.baidu.com/hzy?user_id=10000&platform=ios','QUERY');
user_id=10000&platform=ios
解析QUERY部分的value值:
select parse_url('https://www.baidu.com/hzy?user_id=10000&platform=ios','QUERY','user_id');
10000
select parse_url('https://www.baidu.com/hzy?user_id=10000&platform=ios','QUERY','platform');
ios
2)get_json_object 函数
[{"name":"大郎","sex":"男","age":"25"},{"name":"西门庆","sex":"男","age":"47"}]
获取一个json对象:
select get_json_object('[{"name":"大郎","sex":"男","age":"25"},{"name":"西门庆","sex":"男","age":"47"}]','$[0]');
结果是:{"name":"大郎","sex":"男","age":"25"}
获取age:
hive (gmall)>
SELECT get_json_object('[{"name":"大郎","sex":"男","age":"25"},{"name":"西门庆","sex":"男","age":"47"}]',"$[0].name");
结果是:25
如果数据直接是一个json对象,获取时使用$.字段名字 即可获取。
SELECT get_json_object('{"name":"大郎","sex":"男","age":"25"}',"$.name");
4、动态分区
动态分区是分区在数据插入的时候,根据某一列的列值动态生成.
往hive分区表中插入数据时,如果需要创建的分区很多,比如以表中某个字段进行分区存储,则需要复制粘贴修改很多sql去执行,效率低。因为hive是批处理系统,所以hive提供了一个动态分区功能,其可以基于查询参数的位置去推断分区的名称,从而建立分区。
静态分区与动态分区的主要区别在于静态分区是手动指定,而动态分区是通过数据来进行判断。详细来说,静态分区的列实在编译时期,通过用户传递来决定的;动态分区只有在SQL执行时才能决定。
例如:
hive> insert overwrite table partition_test partition(stat_date='20110728',province='henan')
select
member_id
,name
from
partition_test_input
where
stat_date='20110728' and province='henan';
insert overwrite table partition_test partition(stat_date='20110728',province)
select
member_id
,name
,province
from
partition_test_input
where
stat_date='20110728';
案例:
[root@zjj101 soft]# cat order_created.txt
10703007267488 usa 2014-05-01
10101043505096 usa 2014-05-01
10103043509747 china 2014-05-02
10103043501575 china 2014-05-02
10104043514061 china 2014-05-01
-- 创建普通表
create table order_partition
(
order_no string,
type string,
order_time string
)
row format delimited fields terminated by '\t';
-- 装载数据
load data local inpath '/root/hivedata/order_created.txt' into table order_partition;
-- 查询数据是否装载成功
select * from order_partition;
创建分区表:
create table order_dynamic_partition
(
order_no string
)
PARTITIONED BY (type string,`time` string)
row format delimited fields terminated by '\t';
动态分区表的属性
使用动态分区表必须配置的参数 :
set hive.exec.dynamic.partition =true(默认false),表示开启动态分区功能
set hive.exec.dynamic.partition.mode = nonstrict(默认strict),表示允许所有分区都是动态的,否则必须有静态分区字段
动态分区相关的调优参数:
set hive.exec.max.dynamic.partitions.pernode=100 (默认100,一般可以设置大一点,比如1000) 表示每个maper或reducer可以允许创建的最大动态分区个数,默认是100,超出则会报错。
set hive.exec.max.dynamic.partitions =1000(默认值) 表示一个动态分区语句可以创建的最大动态分区个数,超出报错
set hive.exec.max.created.files =10000(默认) 全局可以创建的最大文件个数,超出报错。
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrick;
insert overwrite table order_dynamic_partition
partition (type, `time`)
select order_no, type, order_time
from order_partition;
show partitions order_dynamic_partition;
二、DWD层数据构建
先创建数据库:
create database dwd_nshop;
use dwd_nshop;
1、用户主题
一定要理解这个码表(这个是理解整个SQL的核心部分):
action行为种类:
INSTALL("01", "install","安装"),
LAUNCH("02", "launch","启动"),
LOGIN("03", "login","登录"),
REGISTER("04", "register","注册"),
INTERACTIVE("05", "interactive","交互行为"),
EXIT("06", "exit","退出"),
PAGE_ENTER_H5("07", "page_enter_h5","页面进入"),
PAGE_ENTER_NATIVE("08", "page_enter_native","页面进入")
eventtype事件类型:
VIEW("01", "view","浏览"),
CLICK("02", "click","点击"),
INPUT("03", "input","输入"),
SLIDE("04", "slide","滑动")
1)用户启动日志表
create external table if not exists dwd_nshop.dwd_nshop_actlog_launch(
user_id string comment '用户id',
device_num string comment '设备号',
device_type string comment '设备类型',
os string comment '手机系统',
os_version string comment '手机系统版本',
manufacturer string comment '手机制造商',
carrier string comment '电信运营商',
network_type string comment '网络类型',
area_code string comment '地区编码',
launch_time_segment string comment '启动时间段',-- 启动时间段分为四个阶段(0-6、6-12、12-18/18-24)每个阶段用 1 2 3 4表示
ct bigint comment '产生时间'
) partitioned by (bdp_day string)
stored as parquet
location '/data/nshop/dwd/user/dwd_nshop_actlog_launch/';
表创建好了,数据如何导入?
insert overwrite table dwd_nshop.dwd_nshop_actlog_launch partition(bdp_day='20220509')
select
customer_id ,
device_num ,
device_type ,
os ,
os_version ,
manufacturer ,
carrier ,
network_type,
area_code ,
case
when from_unixtime(cast(ct/1000 as int),'HH') > 0
and from_unixtime(cast(ct/1000 as int),'HH') <=6 then 1
when from_unixtime(cast(ct/1000 as int),'HH') > 6
and from_unixtime(cast(ct/1000 as int),'HH') <=12 then 2
when from_unixtime(cast(ct/1000 as int),'HH') > 12
and from_unixtime(cast(ct/1000 as int),'HH') <=18 then 3
else 4
end,
ct
from ods_nshop.ods_nshop_01_useractlog
where bdp_day='20220509' and action='02';
总结:
根据我们给定的dwd的表结构,拿着我们之前的ods表结构,然后编写对应的SQL语句,将查询出来的数据,导入到dwd的表中。
2、用户产品浏览表
create external table if not exists dwd_nshop.dwd_nshop_actlog_pdtview(
user_id string comment '用户id',
device_num string comment '设备号',
device_type string comment '设备类型',
os string comment '手机系统',
os_version string comment '手机系统版本',
manufacturer string comment '手机制造商',
carrier string comment '电信运营商',
network_type string comment '网络类型',
area_code string comment '地区编码',
target_id string comment '产品ID',
duration int comment '停留时长',
ct bigint comment '产生时间'
) partitioned by (bdp_day string)
stored as parquet
location '/data/nshop/dwd/user/dwd_nshop_actlog_pdtview/'
编写对应的SQL语句插入表数据
insert overwrite table dwd_nshop.dwd_nshop_actlog_pdtview partition(bdp_day='20220509')
select
customer_id ,
device_num ,
device_type ,
os ,
os_version ,
manufacturer ,
carrier ,
network_type,
area_code ,
get_json_object(extinfo,'$.target_id') as target_id,
duration,
ct
from ods_nshop.ods_nshop_01_useractlog
where bdp_day='20220509' and action in ('07','08')
and event_type ='01';
3、用户产品查询表
create external table if not exists dwd_nshop.dwd_nshop_actlog_pdtsearch(
user_id string comment '用户id',
device_num string comment '设备号',
device_type string comment '设备类型',
os string comment '手机系统',
os_version string comment '手机系统版本',
manufacturer string comment '手机制造商',
carrier string comment '电信运营商',
network_type string comment '网络类型',
area_code string comment '地区编码',
target_order string comment '查询排序方式',
target_keys string comment '查询内容',
target_id string comment '查询商品ID',
ct bigint comment '产生时间'
) partitioned by (bdp_day string)
stored as parquet
location '/data/nshop/dwd/user/dwd_nshop_actlog_pdtsearch/';
根据查询出来的结果数据,我们看到原始数据:
{"target_type":"4","target_keys":"","target_order":"20","target_ids":"[\"4320901202801\",\"4320901118101\",\"4320901823501\",\"4320901106001\"]"}
get_json_object(extinfo,'$.target_ids')
["4320901202801","4320901118101","4320901823501","4320901106001"]
regexp_replace(get_json_object(extinfo,'$.target_ids'),'[\\[\\]]','')
"4320901202801","4320901118101","4320901823501","4320901106001"
split(regexp_replace(get_json_object(extinfo,'$.target_ids'),'[\\[\\]]',''),',')
4320901202801,4320901118101,4320901823501,4320901106001
insert overwrite table dwd_nshop.dwd_nshop_actlog_pdtsearch partition(bdp_day='20220509')
select
customer_id ,
device_num ,
device_type ,
os ,
os_version ,
manufacturer ,
carrier ,
network_type,
area_code ,
get_json_object(extinfo,'$.target_order') as target_order,
get_json_object(extinfo,'$.target_keys') as target_keys,
target_id,
ct
from ods_nshop.ods_nshop_01_useractlog
lateral view explode(split(regexp_replace(get_json_object(extinfo,'$.target_ids'),'[\\[\\]\"]',''),',')) t as target_id
where bdp_day='20220509' and action ='05'
and event_type in ('01','04');
4、用户关注店铺表
create external table if not exists dwd_nshop.dwd_actlog_product_comment(
user_id string comment '用户id',
device_num string comment '设备号',
device_type string comment '设备类型',
os string comment '手机系统',
os_version string comment '手机系统版本',
manufacturer string comment '手机制造商',
carrier string comment '电信运营商',
network_type string comment '网络类型',
area_code string comment '地区编码',
target_id string comment '产品ID',
ct bigint comment '产生时间'
) partitioned by (bdp_day string)
stored as parquet
location '/data/nshop/dwd/user/dwd_actlog_product_comment/'
编写SQL语句:
with t1 as (
select
customer_id,
device_num ,
device_type ,
os ,
os_version ,
manufacturer ,
carrier ,
network_type,
area_code ,
get_json_object(extinfo,'$.target_type') as target_type,
get_json_object(extinfo,'$.target_action') as target_action,
get_json_object(extinfo,'$.target_id') as target_id,
ct
from ods_nshop.ods_nshop_01_useractlog
where bdp_day='20220509' and action='05' and event_type='02'
)
insert overwrite table dwd_nshop.dwd_actlog_product_comment partition(bdp_day='20220509')
select
customer_id,
device_num ,
device_type ,
os ,
os_version ,
manufacturer ,
carrier ,
network_type,
area_code ,
target_id,
ct from t1 where target_action='01' and target_type = 3;
2、交易主题
交易明细表(ods_nshop.ods_02_orders,ods_nshop.ods_02_order_detail,dim_nshop.dim_pub_product,ods_nshop.ods_02_orders_pay_records)
create external table if not exists dwd_nshop.dwd_nshop_orders_details(
order_id string comment '订单ID',
order_status int comment '订单状态:5已收货(完成)|6投诉 7退货',
supplier_code VARCHAR(20) COMMENT '店铺ID',
product_code VARCHAR(20) COMMENT '商品ID',
customer_id string comment '用户id',
consignee_zipcode string comment '收货人地址',
pay_type string comment '支付类型:线上支付 10 网上银行 11 微信 12 支付宝 | 线下支付(货到付款) 20 ',
pay_nettype varchar(1) COMMENT '支付网络方式:0 wifi | 1 4g | 2 3g |3 线下支付',
pay_count int comment '支付次数',
product_price DECIMAL(5,1) COMMENT '购买商品单价',
product_cnt INT COMMENT '购买商品数量',
weighing_cost DECIMAL(2,1) COMMENT '商品加权价格',
district_money DECIMAL(4,1) COMMENT '优惠金额',
shipping_money DECIMAL(8,1) COMMENT '运费金额',
payment_money DECIMAL(10,1) COMMENT '支付金额',
is_activity int COMMENT '1:参加活动|0:没有参加活动',
order_ctime bigint comment '创建时间'
) partitioned by (bdp_day string)
stored as parquet
location '/data/nshop/dwd/order/dwd_nshop_orders_details/'
with t1 as (
select
count(1) as pay_count,
order_id
from ods_nshop.ods_02_orders_pay_records
where from_unixtime(cast(pay_ctime/1000 as int),'yyyyMMdd') = '20191102'
group by order_id
),
t2 as (
select
order_id,
order_status,
customer_id,
consignee_zipcode,
pay_type,
pay_nettype,
district_money,
shipping_money,
payment_money,
order_ctime
from ods_nshop.ods_02_orders
where from_unixtime(cast(order_ctime/1000 as int),'yyyyMMdd') = '20191102'),
t3 as (
select
a.order_id,
a.product_id,
a.product_cnt,
a.product_price,
a.weighing_cost,
a.is_activity,
b.supplier_code
from ods_nshop.ods_02_order_detail a
join ods_nshop.dim_pub_product b
on a.product_id = b.product_code
where from_unixtime(cast(order_detail_ctime/1000 as int),'yyyyMMdd') = '20191102'
)
insert overwrite table dwd_nshop.dwd_nshop_orders_details partition(bdp_day='20220509')
select
t2.order_id,
t2.order_status,
t3.supplier_code,
t3.product_id,
t2.customer_id,
t2.consignee_zipcode,
t2.pay_type,
t2.pay_nettype,
t1.pay_count,
t3.product_price,
t3.product_cnt,
t3.weighing_cost,
t2.district_money,
t2.shipping_money,
t2.payment_money,
t3.is_activity,
t2.order_ctime
from t1 join t2
on t1.order_id = t2.order_id
join t3
on t2.order_id = t3.order_id;
3、营销主题
广告投放数据表:
create external table if not exists dwd_nshop.dwd_nshop_releasedatas(
customer_id string comment '用户id',
device_num string comment '设备号',
device_type string comment '设备类型',
os string comment '手机系统',
os_version string comment '手机系统版本',
manufacturer string comment '手机制造商',
area_code string comment '地区编码',
release_sid string comment '投放请求id',
release_ip string comment '投放方ip',
release_session string comment '投放会话id',
release_sources string comment '投放渠道',
release_category string comment '投放浏览产品分类',
release_product string comment '投放浏览产品',
release_product_page string comment '投放浏览产品页',
ct bigint comment '创建时间'
) partitioned by (bdp_day string)
stored as parquet
location '/data/nshop/dwd/release/dwd_nshop_releasedatas/'
with t1 as (
select
b.customer_id,
a.device_num,
a.device_type ,
a.os ,
a.os_version ,
a.manufacturer ,
a.area_code ,
a.release_sid ,
parse_url(concat("http://127.0.0.1:8888/path?",a.release_params),'QUERY','ip') as release_ip,
a.release_session,
a.release_sources,
parse_url(concat("http://127.0.0.1:8888/path?",a.release_params),'QUERY','productPage') as release_product_page,
a.ct
from ods_nshop.ods_nshop_01_releasedatas a
join ods_nshop.ods_02_customer b
on a.device_num = b.customer_device_num
where a.bdp_day='20220509'
)
insert overwrite table dwd_nshop.dwd_nshop_releasedatas partition(bdp_day='20220509')
select
t1.customer_id,
t1.device_num,
t1.device_type ,
t1.os ,
t1.os_version ,
t1.manufacturer ,
t1.area_code ,
t1.release_sid,
t1.release_ip,
t1.release_session,
t1.release_sources,
c.category_code,
b.page_target,
t1.release_product_page,
t1.ct
from t1
join ods_nshop.dim_pub_page b
on t1.release_product_page = b.page_code and b.page_type = '4'
join ods_nshop.dim_pub_product c
on b.page_target = c.product_code;
由于这个任务比较大,有可能出现内存溢出的情况,可以将本地模式关闭,使用集群模式运行该任务。
set hive.exec.mode.local.auto=false;