1、问题
条件如下
- 已知一个流水表中有两个流水类型 【开/关】
- 【开】类型的流水能连续发生
- 【关】类型的流水不能连续发生
- 有【关】类型的流水不能认为一定有开逻辑,必须表里有才算
已知图表如下,使用SQL求发生过打开关闭后又打开过的设备数量,即发生了【开-关-开】的设备数量
flow_id | device_id | action | flow_time |
---|---|---|---|
1 | A | 开 | 20180101 |
2 | A | 开 | 20180102 |
3 | A | 关 | 20180103 |
4 | B | 开 | 20180101 |
5 | B | 关 | 20180102 |
6 | B | 开 | 20180103 |
7 | B | 开 | 20180103 |
8 | C | 关 | 20180102 |
9 | C | 开 | 20180103 |
10 | C | 开 | 20180103 |
A设备发生了 【开-开-关 】不满足要求
B设备发生了 【开-关-开-关 】满足要求
C设备发生了 【关-开-开 】不满足要求
2、解决方案
在解决问题之前先分享一招,我工作中常使用的使用union
和 with
生成测试表
with flow_table as (
select 1 as flow_id,'A' as device_id,'开' as action,20180101 as flow_time
union all
select 2 as flow_id,'A' as device_id,'开' as action,20180102 as flow_time
union all
select 3 as flow_id,'A' as device_id,'关' as action,20180103 as flow_time
union all
select 4 as flow_id,'B' as device_id,'开' as action,20180101 as flow_time
union all
select 5 as flow_id,'B' as device_id,'关' as action,20180102 as flow_time
union all
select 6 as flow_id,'B' as device_id,'开' as action,20180103 as flow_time
union all
select 7 as flow_id,'B' as device_id,'开' as action,20180103 as flow_time
union all
select 8 as flow_id,'C' as device_id,'关' as action,20180102 as flow_time
union all
select 9 as flow_id,'C' as device_id,'开' as action,20180103 as flow_time
union all
select 10 as flow_id,'C' as device_id,'开' as action,20180103 as flow_time
)
select * from flow_table order by flow_id;
思路:
只要一个设备的序列中,存在 【开-关】、【关-开】两种类型就说明该设备是【发生过打开,关闭后又打开过】
将action列按照device_id分组,使用lead函数生成action_next列
select
flow_id,
device_id,
action,
lead(action) over(partition by device_id order by flow_time) as action_next,
flow_time
from
flow_table order by flow_id;
之前分析了只有B满足情况,也只有B有【开-关】、【关-开】两种都有的情况
下一步:
过滤掉action_next IS NULL 和 action = action_next的情况,分组聚合后数据量大于2的就是满足要求的数据
with_action_next as (
select
flow_id,
device_id,
action,
lead(action) over(partition by device_id order by flow_time) as action_next,
flow_time
from
flow_table order by flow_id
)
select device_id,action,action_next from with_action_next
where
action_next is not null
and action <> action_next
group by device_id,action,action_next
;
结果完整SQL:
with flow_table as (
select 1 as flow_id,'A' as device_id,'开' as action,20180101 as flow_time
union all
select 2 as flow_id,'A' as device_id,'开' as action,20180102 as flow_time
union all
select 3 as flow_id,'A' as device_id,'关' as action,20180103 as flow_time
union all
select 4 as flow_id,'B' as device_id,'开' as action,20180101 as flow_time
union all
select 5 as flow_id,'B' as device_id,'关' as action,20180102 as flow_time
union all
select 6 as flow_id,'B' as device_id,'开' as action,20180103 as flow_time
union all
select 7 as flow_id,'B' as device_id,'开' as action,20180103 as flow_time
union all
select 8 as flow_id,'C' as device_id,'关' as action,20180102 as flow_time
union all
select 9 as flow_id,'C' as device_id,'开' as action,20180103 as flow_time
union all
select 10 as flow_id,'C' as device_id,'开' as action,20180103 as flow_time
)
, with_action_next as (
select
flow_id,
device_id,
action,
lead(action) over(partition by device_id order by flow_time) as action_next,
flow_time
from
flow_table order by flow_id
)
, result_table as (
select device_id,action,action_next from with_action_next
where
action_next is not null
and action <> action_next
group by device_id,action,action_next
)
select count(*) from (
select
device_id
from result_table
group by device_id having count(*) > 1
) t
;
结果:
搞定
3、总结
- 使用
union all
方式可以快速生成测试数据来开发 lead
或者lag
函数生成偏移列- 使用
with
临时表,优化SQL代码结构4、参考
[1] 扎心了,老铁.HIVE学习之路.博客园:博客,2018-04-15
[2] 京东集团运维部.PRESTO文档.5.13. 窗口函数