统计最近 7 天内连续活跃 3 天的人
这是数据分析场景中比较常见的需求,连续活跃表示至少3天是连着活跃的。
假设我们的活跃日志 ods.ods_open_log 如下(简化了,实际生产环境字段会很多)
uid string comment '用户标识'dt string comment '日期'
导入一批测试数据,如下:
uid_1 2021-11-01uid_1 2021-11-01uid_1 2021-11-02uid_1 2021-11-02uid_1 2021-11-03uid_1 2021-11-03uid_1 2021-11-04uid_2 2021-11-01uid_2 2021-11-02uid_3 2021-11-01uid_3 2021-11-02uid_3 2021-11-03
思路一
1、因为每天用户活跃次数可能不止一次,所以需要先将用户每天的活跃日期去重。
2、再用row_number() over(partition by uid order by dt)函数将用户 uid 分组,按照活跃时间进行排序。
3、计算活跃日期减去第二步骤得到的结果值,用户连续活跃情况下,每次相减的结果都相同。
4、按照 uid 和日期分组并求和,筛选大于等于 3 的即为连续 3 天活跃登陆的用户。
SQL 演示,按照步骤编写
第一步,查询最近 7 天的数据,根据日期和uid去重。
selectuid,dtfrom ods.ods_open_logwhere dt between date_sub('2021-11-05',7) and '20211105'group byuid,dt;
第二步,使用开窗函数排序
selectuid,dt,row_number() over(partition by uid order by dt asc) as rankfrom(selectuid,dtfrom ods.ods_open_logwhere dt between date_sub('2021-11-05',7) and '20211105'group byuid,dt) as t;
第三步,使用活跃日期减去 rank 的值,得到一个新的日期,用户连续活跃情况下,每次相减的结果都相同。
selectuid,dt,date_sub(dt,rank) as new_dtfrom(selectuid,dt,row_number() over(partition by uid order by dt asc) as rankfrom(selectuid,dtfrom ods.ods_open_logwhere dt between date_sub('2021-11-05',7) and '20211105'group byuid,dt) as t1) as t2;
第四步,按照 uid 和日期分组并求和,筛选大于等于 3 的即为连续 3 天活跃登陆的用户(注意最后结果再去个重)。
selectuidfrom(selectuid,new_dtfrom(selectuid,dt,date_sub(dt,rank) as new_dtfrom(selectuid,dt,row_number() over(partition by uid order by dt asc) as rankfrom(selectuid,dtfrom ods.ods_open_logwhere dt between date_sub('2021-11-05',7) and '20211105'group byuid,dt) as t1) as t2) as t3group by uid,new_dthaving count(1) >= 3) as t4group by uid;-- 得到结果uid_1uid_3
思路二
使用lag(向后)或者lead(向前)
这里计算的是连续 3 天,所以步长设置为 2
selectuid,dt,lead(dt) over(partition by uid order by dt asc) as next_dtfrom(selectuid,dtfrom ods.ods_open_logwhere dt between date_sub('2021-11-05',7) and '20211105'group byuid,dt) as t;
得到结果如下:
selectuid,dtfrom(selectuid,dt,lead(dt) over(partition by uid order by dt asc) as next_1_dt,lead(dt,2) over(partition by uid order by dt asc) as next_2_dtfrom(selectuid,dtfrom ods.ods_open_logwhere dt between date_sub('2021-11-05',7) and '20211105'group byuid,dt) as t1) as t2-- 保留日期差值等于 1 的where datediff(next_dt,dt) = 1;
