统计最近 7 天内连续活跃 3 天的人
这是数据分析场景中比较常见的需求,连续活跃表示至少3天是连着活跃的。
假设我们的活跃日志 ods.ods_open_log 如下(简化了,实际生产环境字段会很多)
uid string comment '用户标识'
dt string comment '日期'
导入一批测试数据,如下:
uid_1 2021-11-01
uid_1 2021-11-01
uid_1 2021-11-02
uid_1 2021-11-02
uid_1 2021-11-03
uid_1 2021-11-03
uid_1 2021-11-04
uid_2 2021-11-01
uid_2 2021-11-02
uid_3 2021-11-01
uid_3 2021-11-02
uid_3 2021-11-03
思路一
1、因为每天用户活跃次数可能不止一次,所以需要先将用户每天的活跃日期去重。
2、再用row_number() over(partition by uid order by dt)函数将用户 uid 分组,按照活跃时间进行排序。
3、计算活跃日期减去第二步骤得到的结果值,用户连续活跃情况下,每次相减的结果都相同。
4、按照 uid 和日期分组并求和,筛选大于等于 3 的即为连续 3 天活跃登陆的用户。
SQL 演示,按照步骤编写
第一步,查询最近 7 天的数据,根据日期和uid去重。
select
uid
,dt
from ods.ods_open_log
where dt between date_sub('2021-11-05',7) and '20211105'
group by
uid
,dt
;
第二步,使用开窗函数排序
select
uid
,dt
,row_number() over(partition by uid order by dt asc) as rank
from(
select
uid
,dt
from ods.ods_open_log
where dt between date_sub('2021-11-05',7) and '20211105'
group by
uid
,dt
) as t
;
第三步,使用活跃日期减去 rank 的值,得到一个新的日期,用户连续活跃情况下,每次相减的结果都相同。
select
uid
,dt
,date_sub(dt,rank) as new_dt
from(
select
uid
,dt
,row_number() over(partition by uid order by dt asc) as rank
from(
select
uid
,dt
from ods.ods_open_log
where dt between date_sub('2021-11-05',7) and '20211105'
group by
uid
,dt
) as t1
) as t2
;
第四步,按照 uid 和日期分组并求和,筛选大于等于 3 的即为连续 3 天活跃登陆的用户(注意最后结果再去个重)。
select
uid
from(
select
uid
,new_dt
from(
select
uid
,dt
,date_sub(dt,rank) as new_dt
from(
select
uid
,dt
,row_number() over(partition by uid order by dt asc) as rank
from(
select
uid
,dt
from ods.ods_open_log
where dt between date_sub('2021-11-05',7) and '20211105'
group by
uid
,dt
) as t1
) as t2
) as t3
group by uid,new_dt
having count(1) >= 3
) as t4
group by uid
;
-- 得到结果
uid_1
uid_3
思路二
使用lag(向后)或者lead(向前)
这里计算的是连续 3 天,所以步长设置为 2
select
uid
,dt
,lead(dt) over(partition by uid order by dt asc) as next_dt
from(
select
uid
,dt
from ods.ods_open_log
where dt between date_sub('2021-11-05',7) and '20211105'
group by
uid
,dt
) as t
;
得到结果如下:
select
uid
,dt
from(
select
uid
,dt
,lead(dt) over(partition by uid order by dt asc) as next_1_dt
,lead(dt,2) over(partition by uid order by dt asc) as next_2_dt
from(
select
uid
,dt
from ods.ods_open_log
where dt between date_sub('2021-11-05',7) and '20211105'
group by
uid
,dt
) as t1
) as t2
-- 保留日期差值等于 1 的
where datediff(next_dt,dt) = 1
;