拉勾-第三阶段模块二-作业 - 《拉钩→学习笔记→大数据课程》

第一题：
分析：
使用 row_number()函数；
语句：通过 row_number()函数进行排名，然后通过字段year做差值计算，将子查询结果作为一个新表，在根据新表的数据对其进行分组和筛选，选出连续三次获得冠军的条件数据即可。

WITH newTable as (
SELECT team, (year - row_number() over (partition by team order by year)) num
FROM t1
)
SELECT team
FROM newTable
GROUP BY team, num
HAVING COUNT(*) = 3;

第二题：
分析：
使用序列函数，lag(),lead();
语句：子查询使用序列函数查询到对应的上下行数据，将子查询结果作为新表，对其进行case条件判断，当字段price的值同时大于子查询中上下行数据即为波峰，反之为波谷。

select id,time,price,case when price >p1 and price >p2 then ‘波峰’ when price from (
select id,time,price,LAG(price,1,price) over(partition by id order by id) as p1,
LEAD(price,1,price) over(partition by id order by id) as p2
from t2) t3
where (price >p1 and price >p2) or ( price <p1 and price <p2 );

第三题：
3.1：每个id浏览时长、步⻓:
分析：使用row_number()函数，unix_timestamp()函数，获得当前时间戳；nvl()函数，lag()函数；
语句：子查询对所有数据进行排名作为步长，计算两条数据的时间差值作为最大浏览时长。

select id,sum(timeStep) sum_time,max(rank) sum_step
from (select id, dt,browseid,row_number() over (partition by id order by dt) rank,
(unix_timestamp(dt, ‘yyyy/MM/dd HH:mm’) -
unix_timestamp(nvl(lag(dt) over (partition by id order by dt),dt), ‘yyyy/MM/dd HH:mm’))/60 as timeStep from t3) tmp1
group by id;

3.2：如果两次浏览之间的间隔超过30分钟，认为是两个不同的浏览时间；再求每个id浏<br />览时长、步长：<br />        分析：使用row_number()函数，unix_timestamp()函数,nvl()函数，lag()函数;<br />        语句：子查询计算出浏览时长，最终根据浏览时长（30分钟）进行分组展示。

select id,(max(unix_timestamp(dt, ‘yyyy/MM/dd HH:mm’)) - min(unix_timestamp(dt, ‘yyyy/MM/dd HH:mm’)))/60 as resultTime,count(id) step,type
from(select id, dt,browseid,rank,type
from (select id, dt,browseid,
row_number() over (partition by id order by dt) rank,
case when (unix_timestamp(dt, ‘yyyy/MM/dd HH:mm’)
- unix_timestamp(nvl(lag(dt) over (partition by id order by dt),dt), ‘yyyy/MM/dd HH:mm’))/60 >=30 then 1
else 0
end type
from t3
)t4)t5 group by id,type;

Impala作业题
业务背景
现有收集到用户的页面点击行为日志数据，数据格式如下：
用户id, 点击时间

user_id click_time
A,2020-05-15 01:30:00
A,2020-05-15 01:35:00
A,2020-05-15 02:00:00
A,2020-05-15 03:00:10
A,2020-05-15 03:05:00
B,2020-05-15 02:03:00
B,2020-05-15 02:29:40
B,2020-05-15 04:00:00
业务：
会话概念：用户的一次会话含义是指用户进入系统开始到用户离开算作一次会话，离开或者重新开始一次会话的概念是指用户的两次行为事件差值大于30分钟，

比如以A用户为例：

第一次会话：
A,2020-05-15 01:30:00
A,2020-05-15 01:35:00
A,2020-05-15 02:00:00

第二次会话:
A,2020-05-15 03:00:10
A,2020-05-15 03:05:00

判断条件是只要两次时间差值大于30分钟就属于两次会话。

需求：

对用户的日志数据打上会话内序号，如下:

A,2020-05-15 01:30:00,1
A,2020-05-15 01:35:00,2
A,2020-05-15 02:00:00,3
A,2020-05-15 03:00:10,1
A,2020-05-15 03:05:00,2
B,2020-05-15 02:03:00,1
B,2020-05-15 02:29:40,2
B,2020-05-15 04:00:00,1

实现：

在Hive中完成数据加载

—创建表droptable if exists user_clicklog;
createtable user_clicklog (user_id string,click_time string)
row format delimited fields terminated by”,”;
—加载数据
load data local inpath ‘/root/impala_data/clicklog.dat’intotable user_clicklog;

使用Impala sql完成指标统计…

— 1. 添加flag
— flag ：上一条数据时间减当前数据时间大于30分钟的标记为1 反之为0
select user_id, click_time,
if(nvl((unix_timestamp(click_time) - unix_timestamp(lag(click_time) over(partition by user_id order by click_time)))/60,0)>30,1,0) flag
from user_clicklog;
— 2. 添加gid
— gid：将flag逐一相加，成分组id
select user_id, click_time,
sum(flag) over(partition by user_id order by click_time rows between unbounded preceding and current row) gid
from (
select user_id, click_time,
if(nvl((unix_timestamp(click_time) - unix_timestamp(lag(click_time) over(partition by user_id order by click_time)))/60,0)>30,1,0) flag
from user_clicklog
)t1;
— 3. 根据id和gid 进行分组添加排序
select user_id, click_time,
row_number() over(partition by user_id,gid order by click_time) rowNum
from (
select user_id, click_time,
sum(flag) over(partition by user_id order by click_time rows between unbounded preceding and current row) gid
from (
select user_id, click_time,
if(nvl((unix_timestamp(click_time) - unix_timestamp(lag(click_time) over(partition by user_id order by click_time)))/60,0)>30,1,0) flag
from user_clicklog
)t1
)t2;