统计最近 7 天内连续活跃 3 天的人

这是数据分析场景中比较常见的需求,连续活跃表示至少3天是连着活跃的。
假设我们的活跃日志 ods.ods_open_log 如下(简化了,实际生产环境字段会很多)

  1. uid string comment '用户标识'
  2. dt string comment '日期'

导入一批测试数据,如下:

  1. uid_1 2021-11-01
  2. uid_1 2021-11-01
  3. uid_1 2021-11-02
  4. uid_1 2021-11-02
  5. uid_1 2021-11-03
  6. uid_1 2021-11-03
  7. uid_1 2021-11-04
  8. uid_2 2021-11-01
  9. uid_2 2021-11-02
  10. uid_3 2021-11-01
  11. uid_3 2021-11-02
  12. uid_3 2021-11-03

思路一

1、因为每天用户活跃次数可能不止一次,所以需要先将用户每天的活跃日期去重。
2、再用row_number() over(partition by uid order by dt)函数将用户 uid 分组,按照活跃时间进行排序。
3、计算活跃日期减去第二步骤得到的结果值,用户连续活跃情况下,每次相减的结果都相同
4、按照 uid 和日期分组并求和,筛选大于等于 3 的即为连续 3 天活跃登陆的用户。

SQL 演示,按照步骤编写
第一步,查询最近 7 天的数据,根据日期和uid去重。

  1. select
  2. uid
  3. ,dt
  4. from ods.ods_open_log
  5. where dt between date_sub('2021-11-05',7) and '20211105'
  6. group by
  7. uid
  8. ,dt
  9. ;

第二步,使用开窗函数排序

  1. select
  2. uid
  3. ,dt
  4. ,row_number() over(partition by uid order by dt asc) as rank
  5. from(
  6. select
  7. uid
  8. ,dt
  9. from ods.ods_open_log
  10. where dt between date_sub('2021-11-05',7) and '20211105'
  11. group by
  12. uid
  13. ,dt
  14. ) as t
  15. ;

第三步,使用活跃日期减去 rank 的值,得到一个新的日期,用户连续活跃情况下,每次相减的结果都相同。

  1. select
  2. uid
  3. ,dt
  4. ,date_sub(dt,rank) as new_dt
  5. from(
  6. select
  7. uid
  8. ,dt
  9. ,row_number() over(partition by uid order by dt asc) as rank
  10. from(
  11. select
  12. uid
  13. ,dt
  14. from ods.ods_open_log
  15. where dt between date_sub('2021-11-05',7) and '20211105'
  16. group by
  17. uid
  18. ,dt
  19. ) as t1
  20. ) as t2
  21. ;

第四步,按照 uid 和日期分组并求和,筛选大于等于 3 的即为连续 3 天活跃登陆的用户(注意最后结果再去个重)。

  1. select
  2. uid
  3. from(
  4. select
  5. uid
  6. ,new_dt
  7. from(
  8. select
  9. uid
  10. ,dt
  11. ,date_sub(dt,rank) as new_dt
  12. from(
  13. select
  14. uid
  15. ,dt
  16. ,row_number() over(partition by uid order by dt asc) as rank
  17. from(
  18. select
  19. uid
  20. ,dt
  21. from ods.ods_open_log
  22. where dt between date_sub('2021-11-05',7) and '20211105'
  23. group by
  24. uid
  25. ,dt
  26. ) as t1
  27. ) as t2
  28. ) as t3
  29. group by uid,new_dt
  30. having count(1) >= 3
  31. ) as t4
  32. group by uid
  33. ;
  34. -- 得到结果
  35. uid_1
  36. uid_3

思路二

使用lag(向后)或者lead(向前)

这里计算的是连续 3 天,所以步长设置为 2

  1. select
  2. uid
  3. ,dt
  4. ,lead(dt) over(partition by uid order by dt asc) as next_dt
  5. from(
  6. select
  7. uid
  8. ,dt
  9. from ods.ods_open_log
  10. where dt between date_sub('2021-11-05',7) and '20211105'
  11. group by
  12. uid
  13. ,dt
  14. ) as t
  15. ;

得到结果如下:
截屏2021-11-05 上午11.26.17.png

  1. select
  2. uid
  3. ,dt
  4. from(
  5. select
  6. uid
  7. ,dt
  8. ,lead(dt) over(partition by uid order by dt asc) as next_1_dt
  9. ,lead(dt,2) over(partition by uid order by dt asc) as next_2_dt
  10. from(
  11. select
  12. uid
  13. ,dt
  14. from ods.ods_open_log
  15. where dt between date_sub('2021-11-05',7) and '20211105'
  16. group by
  17. uid
  18. ,dt
  19. ) as t1
  20. ) as t2
  21. -- 保留日期差值等于 1
  22. where datediff(next_dt,dt) = 1
  23. ;