题目一:找出影片类型为2种的视频数据

  • 题目描述:长视频类型维度表:Tencent_video_cid_cate
    • 包含字段:品类category(电视剧、电影等);影片名称cid_nm;影片类型cate(存储形式为:青春;武侠;悬疑等)

$$L]{68ZNIUL5(O~K@@~HQL.png

  • 需求:找出影片类型为2种的视频数据。
    • 数据准备:
      1. DROP TABLE IF EXISTS `Tencent_video_cid_cate`;
      2. CREATE TABLE IF NOT EXISTS `Tencent_video_cid_cate` (
      3. mdate DATE,
      4. category VARCHAR(10),
      5. cid_nm VARCHAR(20),
      6. cate VARCHAR(50)
      7. );
      8. INSERT INTO `Tencent_video_cid_cate` VALUES
      9. ('20210301', '电视剧', '陈情令', '玄幻;古装'),
      10. ('20210302', '电影', '长歌行', '爱情;历史古装;武侠');
      11. SELECT * FROM `Tencent_video_cid_cate`;
  • 解决思路:
    • 首先,我们可以发现一条记录中,类型字段中不同类型的数据是通过分号;分割的。
    • 那么通过分析关系可以发现,03. DQL相关面试真题 - 图2,那么类型为两种的话,就会有一个分号。
    • 此时,若将类型数据中的所有分号都替换成空字符串'',那么03. DQL相关面试真题 - 图3
    • 即只要找到03. DQL相关面试真题 - 图4的行,那么该行视频数据中,类型的种类就是两种。
  • SQL实现: ```sql — 查询出电影名称、电影类型以及去除分号后的电影类型(去除分号即将分号替换成空字符串)。 SELECT cid_nm, cate, REPLACE(cate, ‘;’, ‘’) AS fin_cate FROM tencent_video_cid_cate;

— 过滤出只有一个分号的影片名称。 SELECT cid_nm FROM ( SELECT cid_nm, cate, REPLACE(cate, ‘;’, ‘’) AS fin_cate FROM tencent_video_cid_cate ) AS tmp_mov WHERE CHAR_LENGTH(cate) - CHAR_LENGTH(fin_cate) = 1;

— 根据影片名称,查询影片数据。 SELECT * FROM tencent_video_cid_cate WHERE cid_nm IN ( SELECT cid_nm FROM ( SELECT cid_nm, cate, REPLACE(cate, ‘;’, ‘’) AS fin_cate FROM tencent_video_cid_cate ) AS tmp_mov WHERE CHAR_LENGTH(cate) - CHAR_LENGTH(fin_cate) = 1 );

  1. <a name="yRMps"></a>
  2. #### 题目二:客户单量分布情况
  3. - 题目描述:现有数据库中揽收表字段如下:
  4. ![1668241314139.jpg](https://cdn.nlark.com/yuque/0/2022/jpeg/2692415/1668241320976-955b8ba7-d78a-4ed7-bc67-8ccccb8d0833.jpeg#averageHue=%23d6d1cb&clientId=u5548aa8a-bd78-4&from=paste&height=44&id=u2bbd84b6&originHeight=44&originWidth=238&originalType=binary&ratio=1&rotation=0&showTitle=false&size=2919&status=done&style=none&taskId=uba5b01c7-158a-4bd0-a8f6-b68c4289a4a&title=&width=238)
  5. - 要求计算创建日期在0501~0531期间的客户的单量分布情况。
  6. - 最终得出数据如下:
  7. ![1668241419065.jpg](https://cdn.nlark.com/yuque/0/2022/jpeg/2692415/1668241424804-31b81237-ab73-41dd-98d0-911b3d9d8a8d.jpeg#averageHue=%23dfdbd7&clientId=u5548aa8a-bd78-4&from=paste&height=101&id=u980f3767&originHeight=101&originWidth=145&originalType=binary&ratio=1&rotation=0&showTitle=false&size=2862&status=done&style=none&taskId=ua5de2eb6-77c1-45a5-8483-c1e91fa68c4&title=&width=145)
  8. - 数据准备:
  9. ```sql
  10. DROP TABLE IF EXISTS `express_tb`;
  11. CREATE TABLE IF NOT EXISTS `express_tb` (
  12. `order_id` VARCHAR(255) PRIMARY KEY COMMENT '运单号',
  13. `client_id` VARCHAR(255) COMMENT '客户ID',
  14. `create_date` DATE COMMENT '创建日期'
  15. );
  16. INSERT INTO `express_tb` VALUES
  17. ('pno0001', 'cc001', '2022-05-01'),
  18. ('pno0002', 'cc001', '2022-05-01'),
  19. ('pno0003', 'cc002', '2022-05-02'),
  20. ('pno0004', 'cc002', '2022-05-03'),
  21. ('pno0005', 'cc002', '2022-05-04'),
  22. ('pno0006', 'cc002', '2022-05-05'),
  23. ('pno0007', 'cc002', '2022-05-06'),
  24. ('pno0008', 'cc002', '2022-05-06'),
  25. ('pno0009', 'cc003', '2022-05-01'),
  26. ('pno0010', 'cc003', '2022-05-01'),
  27. ('pno0011', 'cc003', '2022-05-01'),
  28. ('pno0012', 'cc003', '2022-05-01'),
  29. ('pno0013', 'cc003', '2022-05-01'),
  30. ('pno0014', 'cc003', '2022-05-01'),
  31. ('pno0015', 'cc003', '2022-05-01'),
  32. ('pno0016', 'cc003', '2022-05-01'),
  33. ('pno0017', 'cc003', '2022-05-01'),
  34. ('pno0018', 'cc004', '2022-05-01'),
  35. ('pno0019', 'cc004', '2022-05-01'),
  36. ('pno0020', 'cc004', '2022-05-01'),
  37. ('pno0021', 'cc004', '2022-05-01'),
  38. ('pno0022', 'cc004', '2022-05-01'),
  39. ('pno0023', 'cc004', '2022-05-01'),
  40. ('pno0024', 'cc005', '2022-05-01'),
  41. ('pno0025', 'cc005', '2022-05-01');
  42. SELECT * FROM express_tb;
  • 解决思路:
    • 首先需要统计出所要求的时间范围内每个用户的下单量。
    • 然后,用IF或者CASE-WHEN结构给下单量进行一个等级划分。
    • 最后,按照等级进行分组,然后进行一个客户数量统计即可。
  • SQL实现: ```sql — 统计在0501~0531期间每个用户的下单量。 SELECT client_id, COUNT(order_id) AS order_num FROM express_tb WHERE create_date BETWEEN ‘2022-05-01’ AND ‘2022-05-31’ GROUP BY client_id;

— 按照下单量进行分级 SELECT *, CASE WHEN order_num > 20 THEN ‘20以上’ WHEN order_num > 10 THEN ‘11-20’ WHEN order_num > 5 THEN ‘5-10’ ELSE ‘0-5’ END AS order_level FROM ( SELECT client_id, COUNT(order_id) AS order_num FROM express_tb WHERE create_date BETWEEN ‘2022-05-01’ AND ‘2022-05-31’ GROUP BY client_id ) AS tmp;

— 按照下单量等级进行分组,统计每组用户的数量。 SELECT order_level AS 单量, COUNT(client_id) AS 客户数 FROM ( SELECT *, CASE WHEN order_num > 20 THEN ‘20以上’ WHEN order_num > 10 THEN ‘11-20’ WHEN order_num > 5 THEN ‘5-10’ ELSE ‘0-5’ END AS order_level FROM ( SELECT client_id, COUNT(order_id) AS order_num FROM express_tb WHERE create_date BETWEEN ‘2022-05-01’ AND ‘2022-05-31’ GROUP BY client_id ) AS tmp ) AS tmp GROUP BY order_level;

  1. - 进一步完善结果:
  2. - 问题描述:根据现有的数据,能够查询出来的就只有两个分段,因为另外两个分段没有相对应的数据。
  3. ![image.png](https://cdn.nlark.com/yuque/0/2022/png/2692415/1668243485315-b581b7dc-db63-4ec7-8854-17f2f0e70ac2.png#averageHue=%23333232&clientId=u5548aa8a-bd78-4&from=paste&height=82&id=ud78e86c5&originHeight=82&originWidth=269&originalType=binary&ratio=1&rotation=0&showTitle=false&size=4861&status=done&style=none&taskId=ubcdb6278-6007-4eba-a6b9-5b9dba5aa00&title=&width=269)
  4. - 需求:此时想要把另外两个分段也显示出来,然后客户数显示为0即可。
  5. - 实现:
  6. ```sql
  7. -- 首先,可以先制作一张分段表。
  8. SELECT '20以上' AS 单量
  9. UNION
  10. SELECT '11-20' AS 单量
  11. UNION
  12. SELECT '5-10' AS 单量
  13. UNION
  14. SELECT '0-5' AS 单量;
  15. -- 然后让这张分段表与之前的结果进行连接。
  16. -- 需要显示分段表中的所有数据,因此要使用外连接。
  17. -- 最后,用IFNULL函数对空的客户数进行一下简单的处理,就可以得到最后的结果。
  18. SELECT t2.单量, IFNULL(t1.客户数, 0) AS 客户数 FROM (
  19. SELECT
  20. order_level AS 单量,
  21. COUNT(client_id) AS 客户数
  22. FROM (
  23. SELECT *,
  24. CASE
  25. WHEN order_num > 20 THEN '20以上'
  26. WHEN order_num > 10 THEN '11-20'
  27. WHEN order_num > 5 THEN '5-10'
  28. ELSE '0-5'
  29. END AS order_level
  30. FROM (
  31. SELECT client_id, COUNT(order_id) AS order_num FROM express_tb
  32. WHERE create_date BETWEEN '2022-05-01' AND '2022-05-31'
  33. GROUP BY client_id
  34. ) AS tmp
  35. ) AS tmp
  36. GROUP BY order_level
  37. ) AS t1 RIGHT OUTER JOIN (
  38. SELECT '20以上' AS 单量 UNION SELECT '11-20' AS 单量 UNION SELECT '5-10' AS 单量 UNION SELECT '0-5' AS 单量
  39. ) AS t2 ON t1.单量 = t2.单量;

题目三:昨天、本周、本月创建的用户

  • 业务场景描述:现在有用户表user:

1668338588805.jpg

  • 数据准备: ```sql — 创建user表 DROP TABLE IF EXISTS user; CREATE TABLE IF NOT EXISTS user ( uid INT(11) PRIMARY KEY AUTO_INCREMENT COMMENT ‘用户ID’, create_time INT(11) COMMENT ‘用户创建时间’ );

— 插入数据(这个数据具有一定的实时性,因此需要手动改一下) INSERT INTO user(create_time) VALUES — 当天的时间:SELECT UNIX_TIMESTAMP(); — 1668434630 (1668434630), (1668434630), (1668434630), (1668434630), (1668434630), (1668434630), (1668434630), (1668434630), (1668434630), (1668434630), (1668434630), — 一天前的时间:SELECT UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 DAY)); — 1668348276 (1668348276), (1668348276), (1668348276), (1668348276), (1668348276), (1668348276), (1668348276), — 上周的时间:SELECT UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 WEEK)); — 1667829891 (1667829891), (1667829891), (1667829891), (1667829891), (1667829891), (1667829891), (1667829891), (1667829891), (1667829891), (1667829891), (1667829891), (1667829891), — 10天前的时间:SELECT UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 10 DAY)); — 1667570721 — 注意:笔记的当前时间是2022/11/13,这些数据为的是创建当月的数据。 — 若阅读笔记时的日期小于10号,请不要使用10天前的时间。 (1667570721), (1667570721), (1667570721), (1667570721), (1667570721);

— 验证插入 SELECT * FROM user;

  1. - 需求1:写SQL取出昨天、本周、本月创建的用户数,输出如下结果:
  2. ![1668336533007.jpg](https://cdn.nlark.com/yuque/0/2022/jpeg/2692415/1668336536465-949958c8-b467-4f28-9a93-8974271bac9a.jpeg#averageHue=%23e1dfdd&clientId=ub64b792f-82c9-4&from=paste&height=44&id=MSaoQ&originHeight=44&originWidth=217&originalType=binary&ratio=1&rotation=0&showTitle=false&size=1623&status=done&style=none&taskId=uf7bd0f4e-832d-4df9-98f6-af66067d675&title=&width=217)
  3. ```sql
  4. -- 为了方便阅读,将时间戳数据转换成正常的时间格式数据。
  5. SELECT *, FROM_UNIXTIME(create_time) AS date_time FROM user;
  6. -- 标记数据的时间范围
  7. SELECT
  8. *,
  9. IF(DATEDIFF(NOW(), date_time) = 1, 1, 0) AS yesterday_tag, -- 判断数据是否是昨天,若是则标记为1,否则标记为0
  10. IF(WEEK(date_time, 1) = WEEK(NOW(), 1), 1, 0) AS week_tag, -- 判断数据是否是本周
  11. IF(MONTH(date_time) = MONTH(NOW()), 1, 0) AS month_tag -- 判断数据是否是本月
  12. FROM (
  13. SELECT *, FROM_UNIXTIME(create_time) AS date_time FROM user
  14. ) AS tmp;
  15. -- 对每个时间段的tag进行求和,得到最后的结果
  16. SELECT
  17. SUM(IF(DATEDIFF(NOW(), date_time) = 1, 1, 0)) AS '昨天',
  18. SUM(IF(WEEK(date_time, 1) = WEEK(NOW(), 1), 1, 0)) AS '本周',
  19. SUM(IF(MONTH(date_time) = MONTH(NOW()), 1, 1)) AS '本月'
  20. FROM (
  21. SELECT *, FROM_UNIXTIME(create_time) AS date_time FROM user
  22. ) AS tmp;
  • 需求2:写SQL取出昨天、本周、本月创建的用户数,输出如下结果:

1668336597869.jpg

  1. -- 这道题需要多次用到需求1SQL语句,因此可以先将需求1SQL保存成一个视图。
  2. -- 视图在07. MySQL高级操作中有介绍
  3. CREATE VIEW v_date AS (
  4. SELECT
  5. SUM(IF(DATEDIFF(NOW(), date_time) = 1, 1, 0)) AS '昨天',
  6. SUM(IF(WEEK(date_time, 1) = WEEK(NOW(), 1), 1, 0)) AS '本周',
  7. SUM(IF(MONTH(date_time) = MONTH(NOW()), 1, 1)) AS '本月'
  8. FROM (
  9. SELECT *, FROM_UNIXTIME(create_time) AS date_time FROM user
  10. ) AS tmp
  11. );
  12. SELECT * FROM v_date;
  13. -- 接着,将昨天、本周、本月的数据分别查询出来。
  14. -- 然后用UNION连接到一起,就可以实现纵向的效果。
  15. SELECT '昨天' AS date, 昨天 AS num FROM v_date
  16. UNION
  17. SELECT '本周' AS date, 本周 AS num FROM v_date
  18. UNION
  19. SELECT '本月' AS date, 本月 AS num FROM v_date;
  • 需求3:写SQL取出最近3天创建的用户数量,输出如下结果:

1668336720613.jpg

  1. -- 最近三天指:前天、昨天、今天
  2. -- 三天之内指:今天、明天、后天
  3. -- 将时间戳数据转换成正常的时间格式数据,并保存视图。
  4. CREATE VIEW v_user AS SELECT *, FROM_UNIXTIME(create_time) AS date_time FROM user;
  5. SELECT * FROM v_user;
  6. -- 筛选出三天内的用户数据,把三天之前的数据过滤掉。
  7. -- 并把时间格式化成表中的格式。
  8. -- 为了方便后续的操作,可以将这一步的结果保存视图。
  9. CREATE VIEW v_three_day AS (
  10. SELECT
  11. *,
  12. DATE_FORMAT(date_time, '%Y%m%d') AS time
  13. FROM v_user WHERE DATEDIFF(NOW(), date_time) <= 2
  14. );
  15. SELECT * FROM v_three_day;
  16. -- 统计每一段用户数量,即可得到结果
  17. SELECT
  18. time,
  19. COUNT(*) AS num
  20. FROM v_three_day
  21. GROUP BY time;
  22. -- 最后,简单处理一下显示的数据即可。
  23. SELECT
  24. t1.time,
  25. IFNULL(t2.num, 0) AS num
  26. FROM (
  27. SELECT DATE_FORMAT(NOW(), '%Y%m%d') AS time
  28. UNION
  29. SELECT DATE_FORMAT(DATE(DATE_SUB(NOW(), INTERVAL 1 DAY)), '%Y%m%d') AS time
  30. UNION
  31. SELECT DATE_FORMAT(DATE(DATE_SUB(NOW(), INTERVAL 2 DAY)), '%Y%m%d') AS time
  32. ) AS t1 LEFT OUTER JOIN (
  33. SELECT time, COUNT(*) AS num
  34. FROM v_three_day GROUP BY time
  35. ) AS t2 ON t1.time = t2.time;

题目四:购买平台人数分布

  • 题目描述:现有支出表:

1668472134338.jpg

  • 表描述:
    • 这张表记录了用户在一个在线购物网站的支出历史。
    • 该在线购物平台同时拥有桌面端(‘desktop’)和手机端(‘mobile’)的应用程序。
    • 这张表的主键是(‘user_id’, ‘spend_date’, ‘plotform’)。
    • 平台列platform是一种ENUM,类型为(‘desktop’, ‘mobile’)。
  • 需求:写一段SQL来查找每天仅使用手机端用户、仅使用桌面端用户、同时使用桌面端和手机端的用户人数。
  • 结果:
    • 在2019-07-01,用户1同时使用桌面端和手机端购买;用户2仅使用了手机端购买;用户3仅使用了桌面端购买。
    • 在2019-07-02,用户2仅使用了手机端购买;用户3仅使用了桌面端购买;且没有用户同时使用桌面端和手机端购买。

1668475428425.jpg

  • 数据准备: ```sql — 创建spending表 DROP TABLE IF EXISTS spending; CREATE TABLE IF NOT EXISTS spending ( user_id INT COMMENT ‘用户ID’, spend_date DATE COMMENT ‘消费时间’, platform ENUM(‘desktop’, ‘mobile’) COMMENT ‘使用平台’, amount INT COMMENT ‘支出金额’, PRIMARY KEY(user_id, spend_date, platform) );

— 插入数据 INSERT INTO spending VALUES (1, ‘2019-07-01’, ‘mobile’, 100), (1, ‘2019-07-01’, ‘desktop’, 100), (2, ‘2019-07-01’, ‘mobile’, 100), (2, ‘2019-07-02’, ‘mobile’, 100), (3, ‘2019-07-01’, ‘desktop’, 100), (3, ‘2019-07-02’, ‘desktop’, 100);

— 验证查询 SELECT * FROM spending;

  1. - 解决思路:
  2. - 步骤一:
  3. - 总的来说该需求需要的是每天平台的消费用户,并且平台需要区分的出仅desktop、仅mobileboth三类。
  4. - 要得到这些,首先就需要得到每天每个用户的消费平台(即根据日期和用户ID进行分类,然后用GROUP_CONCAT()函数聚合当天该用户的所有消费平台)。
  5. - 步骤二:
  6. - 接着,若最终结果是both的,那么聚合的结果是`desktop,mobile`
  7. - 换言之,就是聚合的结果中若存在逗号`,`,那么就有个将聚合的结果替换成`both`
  8. - 步骤三:
  9. - 可能有些天里有些平台没有用户使用,但是在结果中,这部分数据是显示出来的,只不过显示为0而已。
  10. - 此时就需要去造表以显示完整的表结构。
  11. - 步骤四:将步骤三与步骤二的结果进行连接,然后再做数据聚合即可统计完成total_user的数据。
  12. - SQL实现:
  13. ```sql
  14. -- 聚合每天每个用户的消费平台。
  15. CREATE VIEW v_spend AS (
  16. SELECT
  17. spend_date,
  18. user_id,
  19. GROUP_CONCAT(platform) AS platform
  20. FROM spending
  21. GROUP BY spend_date, user_id
  22. );
  23. SELECT * FROM v_spend;
  24. -- 将有两个平台的聚合结果替换成both
  25. SELECT
  26. spend_date,
  27. user_id,
  28. IF(INSTR(platform, ','), 'both', platform) AS platform
  29. FROM v_spend;
  30. -- 造表显示完整的结构
  31. CREATE VIEW v_platform AS (
  32. SELECT DISTINCT spend_date, 'both' AS platform FROM spending
  33. UNION
  34. SELECT DISTINCT spend_date, 'mobile' AS platform FROM spending
  35. UNION
  36. SELECT DISTINCT spend_date, 'desktop' AS platform FROM spending
  37. ORDER BY spend_date, platform DESC
  38. );
  39. SELECT * FROM v_platform;
  40. -- 接着将结构表与步骤二进行连接。
  41. -- 然后再聚合数据即可统计完成total_user的数据。
  42. SELECT
  43. p.spend_date,
  44. p.platform,
  45. COUNT(user_id) AS total_users
  46. FROM v_platform AS p
  47. LEFT OUTER JOIN (
  48. SELECT spend_date, user_id,
  49. IF(INSTR(platform, ','), 'both', platform) AS platform
  50. FROM v_spend
  51. ) AS t
  52. ON p.spend_date = t.spend_date AND p.platform = t.platform
  53. GROUP BY p.spend_date, p.platform;

题目五:用户订单

  • 表描述(两张):
    • 订单明细表ord,字段包括:dt-日期、dept-部门、pin-用户ID、ord_id-订单ID、sku_id-商品ID、qtty-销量、amout-金额。
    • 全量用户画像表npl,字段包括:dt-日期、pin-用户ID、gender-性别(男/女)、age-年龄(0-18/19-25/26-30/31-35/36-45/46以上)、city_lvl-城市级别(L1/L2/L3/L4/L5/L6)、profession-职业。
  • 数据准备: ```sql — 创建订单明细表ord DROP TABLE IF EXISTS ord; CREATE TABLE IF NOT EXISTS ord ( ord_id INT PRIMARY KEY AUTO_INCREMENT COMMENT ‘订单ID’, sku_id INT COMMENT ‘商品ID’, pin INT COMMENT ‘用户ID’, qtty INT COMMENT ‘销量’, amount INT COMMENT ‘销售额’, dept VARCHAR(255) COMMENT ‘部门’, dt DATE COMMENT ‘订单日期’ );

— 创建用户画像表npl DROP TABLE IF EXISTS npl; CREATE TABLE IF NOT EXISTS npl ( pin INT PRIMARY KEY COMMENT ‘用户id’, gender VARCHAR(10) COMMENT ‘性别’, age VARCHAR(255) COMMENT ‘年龄0-18/19-25/26-30/31-35/36-45/46以上’, city_lvl VARCHAR(255) COMMENT ‘城市级别 L1/L2…/L6’, profession VARCHAR(255) COMMENT ‘职业’, dt DATE COMMENT ‘创建日期’ );

— 插入数据 INSERT INTO npl VALUES (10001, ‘男’, ‘19-25’, ‘L2’, ‘医务人员’, ‘2021-06-01’), (10002, ‘男’, ‘0-18’, ‘L1’, ‘学生’, ‘2021-06-10’), (10003, ‘女’, ‘0-18’, ‘L2’, ‘学生’, ‘2021-06-05’), (10004, ‘女’, ‘19-25’, ‘L3’, ‘行政人员’, ‘2021-06-09’), (10005, ‘男’, ‘31-35’, ‘L2’, ‘教师’, ‘2021-06-01’), (10006, ‘女’, ‘36-45’, ‘L1’, ‘演员’, ‘2021-06-09’), (10007, ‘男’, ‘26-30’, ‘L4’, ‘医务人员’, ‘2020-06-10’), (10008, ‘男’, ‘26-30’, ‘L4’, ‘医务人员’, ‘2020-07-01’), (10009, ‘女’, ‘36-45’, ‘L6’, ‘演员’, ‘2020-07-01’), (10010, ‘女’, ‘26-30’, ‘L6’, ‘医务人员’, ‘2020-07-05’), (10011, ‘女’, ‘36-45’, ‘L6’, ‘演员’, ‘2020-07-07’), (10012, ‘女’, ‘26-30’, ‘L1’, ‘医务人员’, ‘2021-06-15’), (10013, ‘男’, ‘46以上’, ‘L1’, ‘演员’, ‘2019-06-22’), (10014, ‘男’, ‘46以上’, ‘L1’, ‘医务人员’, ‘2019-06-21’), (10015, ‘男’, ‘31-35’, ‘L5’, ‘演员’, ‘2019-06-11’), (10016, ‘女’, ‘46以上’, ‘L5’, ‘医务人员’, ‘2019-07-03’), (10017, ‘男’, ‘19-25’, ‘L2’, ‘医务人员’, ‘2019-07-09’), (10018, ‘女’, ‘46以上’, ‘L3’, ‘教师’, ‘2019-06-14’), (10019, ‘女’, ‘31-35’, ‘L5’, ‘教师’, ‘2019-06-22’), (10020, ‘男’, ‘31-35’, ‘L4’, ‘医务人员’, ‘2020-06-17’);

INSERT INTO ord(sku_id, pin, qtty, amount, dept, dt) VALUES (10, 10003, 10, 100, ‘01’, ‘2020-11-15’), (20, 10003, 10, 300, ‘02’, ‘2020-12-15’), (30, 10003, 10, 500, ‘03’, ‘2021-01-15’), (30, 10005, 10, 500, ‘01’, ‘2021-01-15’), (20, 10008, 10, 300, ‘03’, ‘2021-05-15’), (10, 10007, 10, 100, ‘03’, ‘2021-05-19’), (10, 10010, 10, 100, ‘02’, ‘2021-06-15’), (20, 10015, 10, 300, ‘02’, ‘2021-08-15’), (20, 10012, 10, 300, ‘01’, ‘2021-08-15’), (20, 10017, 10, 300, ‘02’, ‘2021-11-15’), (30, 10010, 10, 500, ‘02’, ‘2022-01-15’), (30, 10005, 10, 500, ‘01’, ‘2022-03-15’), (30, 10004, 10, 500, ‘03’, ‘2022-03-19’), (20, 10008, 10, 300, ‘03’, ‘2022-05-15’), (20, 10005, 10, 300, ‘03’, ‘2022-05-17’), (10, 10011, 10, 100, ‘03’, ‘2022-05-19’), (30, 10015, 10, 500, ‘02’, ‘2022-06-15’), (30, 10019, 10, 500, ‘01’, ‘2022-06-18’), (10, 10011, 10, 100, ‘03’, ‘2022-06-16’);

— 以下数据要根据时间动态修改,为上周数据(年份就今年和去年) INSERT INTO ord(sku_id, pin, qtty, amount, dept, dt) VALUES (20, 10012, 10, 300, ‘01’, ‘2021-11-14’), (20, 10017, 10, 300, ‘02’, ‘2021-11-15’), (30, 10010, 10, 500, ‘02’, ‘2021-11-15’), (30, 10005, 10, 500, ‘01’, ‘2021-11-14’), (30, 10004, 10, 500, ‘03’, ‘2022-11-20’), (20, 10008, 10, 300, ‘03’, ‘2022-11-17’), (20, 10005, 10, 300, ‘03’, ‘2022-11-14’), (10, 10011, 10, 100, ‘03’, ‘2022-11-17’), (30, 10015, 10, 500, ‘02’, ‘2022-11-15’), (30, 10019, 10, 500, ‘01’, ‘2022-11-14’);

  1. - 需求一:提取上周各部门下单用户数、销量、金额、订单量、人均金额,以及下单用户数同比。
  2. - 注意:上周时间为20211114日-20211120日,下同;
  3. - 同比就是对比去年同期。
  4. ```sql
  5. -- 提取上周各部门下单用户数、销量、金额、订单量、人均金额。
  6. -- 筛选出上周的数据
  7. SELECT * FROM ord
  8. WHERE YEAR(dt) = YEAR(NOW())
  9. AND WEEK(dt) + 1 = WEEK(NOW());
  10. -- 根据部门归类,统计指标
  11. SELECT
  12. dept,
  13. COUNT(DISTINCT pin) AS 用户数, -- 同一个用户可能下单多次,统计前要先去重
  14. SUM(qtty) AS 销量,
  15. SUM(amount) AS 金额,
  16. COUNT(ord_id) AS 订单量,
  17. ROUND(SUM(amount) / COUNT(DISTINCT pin), 2) AS 人均金额 -- 人均金额 = 总金额 ÷ 用户数。
  18. FROM ord
  19. WHERE YEAR(dt) = YEAR(NOW())
  20. AND WEEK(dt) + 1 = WEEK(NOW())
  21. GROUP BY dept;
  22. -- 要求同比,就要找到去年这个时候下单的用户数。
  23. SELECT
  24. dept,
  25. COUNT(DISTINCT pin) AS 用户数 -- 同一个用户可能下单多次,统计前要先去重
  26. FROM ord
  27. WHERE YEAR(dt) + 1 = YEAR(NOW())
  28. AND WEEK(CONCAT_WS('-', YEAR(NOW()), SUBSTR(dt, 6))) + 1 = WEEK(NOW()) -- 去年一个日期的周数和今年可能不一样,所以要先同一年份。
  29. GROUP BY dept;
  30. -- 结合同比(今年的 ÷ 去年的)
  31. SELECT
  32. dept,
  33. COUNT(DISTINCT pin) AS 用户数, -- 同一个用户可能下单多次,统计前要先去重
  34. SUM(qtty) AS 销量,
  35. SUM(amount) AS 金额,
  36. COUNT(ord_id) AS 订单量,
  37. ROUND(SUM(amount) / COUNT(DISTINCT pin), 2) AS 人均金额, -- 人均金额 = 总金额 ÷ 用户数。
  38. IFNULL(
  39. COUNT(DISTINCT pin) / (SELECT 用户数 FROM (
  40. SELECT
  41. dept,
  42. COUNT(DISTINCT pin) AS 用户数 -- 同一个用户可能下单多次,统计前要先去重
  43. FROM ord
  44. WHERE YEAR(dt) + 1 = YEAR(NOW())
  45. AND WEEK(CONCAT_WS('-', YEAR(NOW()), SUBSTR(dt, 6))) + 1 = WEEK(NOW()) -- 去年一个日期的周数和今年可能不一样,所以要先同一年份。
  46. GROUP BY dept
  47. ) AS tmp WHERE tmp.dept = ord.dept), 0
  48. ) AS 同比
  49. FROM ord
  50. WHERE YEAR(dt) = YEAR(NOW())
  51. AND WEEK(dt) + 1 = WEEK(NOW())
  52. GROUP BY dept;
  • 需求二:提取上周各部门销量Top 100的商品。 ```sql — 先统计出上周各个部门中每种产品的销量值。 SELECT dept, sku_id, SUM(qtty) AS total_qtty FROM ord WHERE YEAR(dt) = YEAR(NOW()) AND WEEK(dt) + 1 = WEEK(NOW()) GROUP BY dept, sku_id;

— 然后对第一步的数据进行排名 SELECT *, ROW_NUMBER() OVER( PARTITION BY dept ORDER BY total_qtty DESC ) AS ranking FROM ( SELECT dept, sku_id, SUM(qtty) AS total_qtty FROM ord WHERE YEAR(dt) = YEAR(NOW()) AND WEEK(dt) + 1 = WEEK(NOW()) GROUP BY dept, sku_id ) AS tmp;

— 最后筛选出Top 100即可。 SELECT FROM ( SELECT , ROW_NUMBER() OVER( PARTITION BY dept ORDER BY total_qtty DESC ) AS ranking FROM ( SELECT dept, sku_id, SUM(qtty) AS total_qtty FROM ord WHERE YEAR(dt) = YEAR(NOW()) AND WEEK(dt) + 1 = WEEK(NOW()) GROUP BY dept, sku_id ) AS tmp ) AS tmp WHERE ranking <= 100;

  1. - 需求三:提取上周各部门下单用户的性别、年龄、城市级别、职业人数分布(即用户画像)。
  2. ```sql
  3. SELECT
  4. dept,
  5. gender,
  6. age,
  7. city_lvl,
  8. profession,
  9. COUNT(*) AS 人数
  10. FROM ord
  11. JOIN npl ON ord.pin = npl.pin
  12. WHERE YEAR(ord.dt) = YEAR(NOW())
  13. AND WEEK(ord.dt) + 1 = WEEK(NOW())
  14. GROUP BY dept, gender, age, city_lvl, profession;
  • 需求四:统计上周总体的新老用户数,其中新用户定义为该时间段之前从未下过单(假设从2020年1月1日开始有数)。 ```sql — 先去找上周的最后一天。 SELECT DATE_SUB(CURDATE(), INTERVAL DAYOFWEEK(CURDATE()) DAY);

— 若用户的创建时间小于上周的最后一天,就可以获取上周的所有新老用户数据。 SELECT npl.*, ord.dt AS ord_dt FROM npl LEFT OUTER JOIN ord ON npl.pin = ord.pin WHERE npl.dt < DATE_SUB(CURDATE(), INTERVAL DAYOFWEEK(CURDATE()) DAY);

— 老用户就是在2020-01-01之后有下单记录(即订单创建时间不为空)的用户。 SELECT COUNT(DISTINCT pin) AS 老用户数 FROM ( SELECT npl.*, ord.dt AS ord_dt FROM npl LEFT OUTER JOIN ord ON npl.pin = ord.pin WHERE npl.dt < DATE_SUB(CURDATE(), INTERVAL DAYOFWEEK(CURDATE()) DAY) ) AS tmp WHERE ord_dt IS NOT NULL AND ord_dt > ‘2020-01-01’;

— 新用户就是订单为空或者2020年1月1日之后没有订单的用户。 — 2020年1月1日之后没有订单的用户就是不在2020-01-01之后有下单记录的用户中的用户。 SELECT COUNT(DISTINCT pin) AS 新用户数 FROM ( SELECT npl.*, ord.dt AS ord_dt FROM npl LEFT OUTER JOIN ord ON npl.pin = ord.pin WHERE npl.dt < DATE_SUB(CURDATE(), INTERVAL DAYOFWEEK(CURDATE()) DAY) ) AS tmp WHERE ord_dt IS NULL OR pin NOT IN ( SELECT DISTINCT pin FROM ord WHERE dt > ‘2020-01-01’ );

  1. - 需求五:计算今年5月总体下单用户的次月留存率(次月留存率的定义为:本月下单用户中在下个月仍有购买的人所占的比例)
  2. ```sql
  3. -- 先查找5月份下单的用户。
  4. SELECT DISTINCT
  5. pin
  6. FROM ord
  7. WHERE YEAR(dt) = YEAR(NOW())
  8. AND MONTH(dt) = 5;
  9. -- 接着查找6月份下单的用户。
  10. SELECT DISTINCT pin FROM ord
  11. WHERE YEAR(dt) = YEAR(NOW()) AND MONTH(dt) = 6;
  12. -- 从6月份的数据中,找出5月份也有的用户。
  13. SELECT * FROM (
  14. SELECT DISTINCT pin FROM ord
  15. WHERE YEAR(dt) = YEAR(NOW()) AND MONTH(dt) = 6
  16. ) AS tmp
  17. WHERE pin IN (
  18. SELECT DISTINCT pin FROM ord
  19. WHERE YEAR(dt) = YEAR(NOW()) AND MONTH(dt) = 5
  20. );
  21. -- 根据留存率的定义,可以得出6月的留存率 = 5月6月都下单的用户用户数 / 5月的总用户数。
  22. SELECT (
  23. SELECT COUNT(*) FROM (
  24. SELECT DISTINCT pin FROM ord
  25. WHERE YEAR(dt) = YEAR(NOW()) AND MONTH(dt) = 6
  26. ) AS tmp
  27. WHERE pin IN (
  28. SELECT DISTINCT pin FROM ord
  29. WHERE YEAR(dt) = YEAR(NOW()) AND MONTH(dt) = 5
  30. )
  31. ) / (
  32. SELECT
  33. COUNT(DISTINCT pin)
  34. FROM ord
  35. WHERE YEAR(dt) = YEAR(NOW())
  36. AND MONTH(dt) = 5
  37. ) AS result;

题目六:连续登录问题

  • 题目描述:从日志表中取6月有过连续7天登录的用户。
    • 表样式:user_id、login_datetime(登录时间,格式:yyyy-mm-dd hh:mm:ss)。
  • 数据准备: ```sql — 创建表 DROP TABLE IF EXISTS login_tb; CREATE TABLE IF NOT EXISTS login_tb ( user_id INT, login_datetime DATETIME );

— 插入数据 INSERT INTO login_tb VALUES (1, ‘2022-06-20 11:23:35’), (2, ‘2022-06-20 17:23:35’), (3, ‘2022-06-20 21:23:35’), (4, ‘2022-06-20 12:23:35’), (5, ‘2022-06-20 19:23:35’), (1, ‘2022-06-20 19:23:35’), (1, ‘2022-06-20 21:23:35’), (2, ‘2022-06-21 11:23:35’), (2, ‘2022-06-22 11:23:35’), (2, ‘2022-06-23 11:23:35’), (2, ‘2022-06-24 11:23:35’), (2, ‘2022-06-29 11:23:35’), (2, ‘2022-06-30 11:23:35’), (3, ‘2022-06-23 11:23:35’), (3, ‘2022-06-24 11:23:35’), (3, ‘2022-06-25 11:23:35’), (3, ‘2022-06-27 11:23:35’), (3, ‘2022-06-29 11:23:35’), (3, ‘2022-06-30 11:23:35’), (4, ‘2022-06-21 11:23:35’), (4, ‘2022-06-29 11:23:35’), (5, ‘2022-06-21 11:23:35’), (5, ‘2022-06-22 11:23:35’), (5, ‘2022-06-23 11:23:35’), (5, ‘2022-06-24 11:23:35’), (5, ‘2022-06-25 11:23:35’), (5, ‘2022-06-26 11:23:35’), (5, ‘2022-06-27 11:23:35’), (1, ‘2022-06-25 11:23:35’), (1, ‘2022-06-27 11:23:35’);

  1. - 解决思路:
  2. - 可以先处理一下数据,只保留年月日的时间即可。
  3. - 此时,若用户一天中登陆了多次,就会出现重复数据,即可以使用DISTINCT去重。
  4. - 接着,可以将同一个用户的数据归纳在一起,并且这里不能折叠数据,因此需要使用窗口函数完成。
  5. - 使用排名窗口函数完成。
  6. - 若时间是连续的,则`日期-排名`就是一个固定值,如:
  7. ![1669807280580.jpg](https://cdn.nlark.com/yuque/0/2022/jpeg/2692415/1669807286336-ba31c842-4331-4ac1-9853-068f20450aaf.jpeg#averageHue=%23e4e1df&clientId=u3ca62dff-8b1a-4&from=paste&height=158&id=u635ad8f2&originHeight=158&originWidth=235&originalType=binary&ratio=1&rotation=0&showTitle=false&size=5362&status=done&style=none&taskId=u419e2597-3985-473e-ba3c-85b7fe96db0&title=&width=235)
  8. - 接着,实现`日期-排名`这个操作。
  9. - 最后,根据用户ID和这个差值去分组,统计差值出现的次数。若差值出现的次数大于等于7,就说明该用户连续登录过7天。
  10. - SQL实现:
  11. ```sql
  12. -- 处理时间数据并去重。
  13. SELECT DISTINCT
  14. user_id AS uid,
  15. DATE(login_datetime) AS login_dt
  16. FROM login_tb;
  17. -- 使用窗户函数归纳数据。
  18. SELECT
  19. *,
  20. ROW_NUMBER() OVER (
  21. PARTITION BY uid
  22. ORDER BY login_dt
  23. ) AS ranking
  24. FROM (
  25. SELECT DISTINCT user_id AS uid, DATE(login_datetime) AS login_dt FROM login_tb
  26. ) AS tmp;
  27. -- 实现“日期-排名”这个操作。
  28. SELECT
  29. *,
  30. DAY(login_dt) - ranking AS diff
  31. FROM (
  32. SELECT *, ROW_NUMBER() OVER (PARTITION BY uid ORDER BY login_dt) AS ranking
  33. FROM (SELECT DISTINCT user_id AS uid, DATE(login_datetime) AS login_dt FROM login_tb) AS tmp
  34. ) AS tmp;
  35. -- 归纳数据,找出连续登录7天的用户。
  36. SELECT
  37. uid,
  38. diff,
  39. COUNT(uid) AS continue_login_days
  40. FROM (
  41. SELECT *, DAY(login_dt) - ranking AS diff
  42. FROM (
  43. SELECT *, ROW_NUMBER() OVER (PARTITION BY uid ORDER BY login_dt) AS ranking
  44. FROM (SELECT DISTINCT user_id AS uid, DATE(login_datetime) AS login_dt FROM login_tb) AS tmp
  45. ) AS tmp
  46. ) AS tmp
  47. GROUP BY uid, diff
  48. HAVING continue_login_days >= 7;

题目七:黑产行为

  • 题目描述:
    • 现有点击日志表tencent_video_click_online。
    • 包括字段:uid-用户ID、click_date-点击日期、click_time-点击时间戳、platform-平台(21移动端,22PC端)。
    • 需求:已知同一用户在移动端连续两次点击时间间隔不高于2秒,则可能为黑产行为,请筛选出22年7月所有可能为黑产行为的用户。
  • 数据准备: ```sql DROP TABLE IF EXISTS tencent_video_click_online; CREATE TABLE IF NOT EXISTS tencent_video_click_online ( uid INT COMMENT ‘用户ID’, click_date DATE COMMENT ‘点击日期’, click_time INT COMMENT ‘点击时间戳’, platform INT COMMENT ‘平台’ );

INSERT INTO tencent_video_click_online VALUES (1, ‘2022-07-09’, 1657353606, 22), (1, ‘2022-07-09’, 1657353600, 21), (1, ‘2022-07-09’, 1657353601, 21), (1, ‘2022-07-09’, 1657353606, 21), (2, ‘2022-07-09’, 1657353599, 21), (2, ‘2022-07-09’, 1657353606, 21), (3, ‘2022-07-09’, 1657353598, 21), (3, ‘2022-07-09’, 1657353599, 21), (4, ‘2022-07-09’, 1657353601, 21), (4, ‘2022-07-09’, 1657353606, 22), (5, ‘2022-07-09’, 1657353600, 21), (5, ‘2022-07-09’, 1657353606, 21), (7, ‘2022-07-09’, 1657353600, 21), (7, ‘2022-07-09’, 1657353601, 21), (7, ‘2022-07-09’, 1657353603, 21);

  1. - SQL实现:
  2. ```sql
  3. -- 连续两次点击的时间都存在点击表中,因此需要采用自连接。
  4. -- 第一个视作第一次点击,第二个视作第二次点击。
  5. -- 自连接的条件:平台一样,用户一样。
  6. SELECT * FROM tencent_video_click_online AS t1
  7. INNER JOIN tencent_video_click_online AS t2
  8. ON t1.uid = t2.uid
  9. AND t1.platform = t2.platform;
  10. -- 题目中需要的移动平台的黑产数据,因此可以过滤出platform=21的数据。
  11. -- 并且第一次点击的时间一定是早于第二次点击的时间的。
  12. SELECT * FROM tencent_video_click_online AS t1
  13. INNER JOIN tencent_video_click_online AS t2
  14. ON t1.uid = t2.uid
  15. AND t1.platform = t2.platform
  16. WHERE t1.platform = 21
  17. AND t1.click_time < t2.click_time;
  18. -- 最后,黑产行为的标准是两次点击时间间隔不高于两秒(第二次点击时间 - 第一次点击时间 <= 2)
  19. -- 以及题目要求的时间为22年7月。
  20. SELECT DISTINCT t1.uid
  21. FROM tencent_video_click_online AS t1
  22. INNER JOIN tencent_video_click_online AS t2
  23. ON t1.uid = t2.uid
  24. AND t1.platform = t2.platform
  25. WHERE t1.platform = 21
  26. AND t1.click_time < t2.click_time
  27. AND t2.click_time - t1.click_time <= 2
  28. AND YEAR(t1.click_date) = 2022 AND MONTH(t1.click_date) = 7;

题目八:主播平均时长

  • 题目描述:
    • 现有主播表user_anchor。
    • 表结构:id-主播ID、dt-日期(yyyy-mm-dd)、event-事件,有开播和下播两类、time-开播下播发生时间(yyyy-mm-dd hh:MM:ss)。
    • 求10月份每个主播平均每次的直播时长。(开播和下播的次数一样,不存在缺失情况)
  • 数据准备: ```sql DROP TABLE IF EXISTS user_anchor; CREATE TABLE IF NOT EXISTS user_anchor ( id INT COMMENT ‘主播ID’, dt DATE COMMENT ‘日期’, event INT COMMENT ‘开播为1,下播为2’, time DATETIME COMMENT ‘开播下播发生的时间’ );

INSERT INTO user_anchor VALUES (1, ‘2022-07-01’, 1, ‘2022-07-01 07:00:00’), (1, ‘2022-07-01’, 2, ‘2022-07-01 09:00:00’), (1, ‘2022-07-01’, 1, ‘2022-07-01 19:00:00’), (1, ‘2022-07-01’, 2, ‘2022-07-01 20:30:00’), (1, ‘2022-07-02’, 1, ‘2022-07-02 07:00:00’), (1, ‘2022-07-02’, 2, ‘2022-07-02 21:00:00’), (1, ‘2022-07-03’, 1, ‘2022-07-03 19:00:00’), (1, ‘2022-07-04’, 2, ‘2022-07-04 07:00:00’), (2, ‘2022-07-01’, 1, ‘2022-07-01 09:00:00’), (2, ‘2022-07-01’, 2, ‘2022-07-01 11:00:00’), (2, ‘2022-07-03’, 1, ‘2022-07-03 19:00:00’), (2, ‘2022-07-03’, 2, ‘2022-07-03 22:00:00’);

  1. - SQL实现:
  2. ```sql
  3. -- 要知道7月份每个主播的平均直播时长,就要知道总直播时长和直播次数。
  4. -- 首先:每次的直播时间,即求上播时间和下播时间的时间差。
  5. -- (抽象出上播表和下播表,使用自连接查询)
  6. SELECT * FROM user_anchor AS online
  7. INNER JOIN user_anchor AS offline
  8. ON online.id = offline.id -- 相同主播进行关联
  9. AND online.event = 1 -- 上播表仅处理上播事件
  10. AND offline.event = 2 -- 下播表仅处理下播事件
  11. WHERE online.time < offline.time; -- 按照现实逻辑,上播时间一定是小于下播时间的。
  12. -- 步骤一完成后,发现上播时间重复的在和下播时间进行匹配。
  13. -- 那按照常理来说,每一个下播时间,对应的是和它最近的上播时间。
  14. -- 因此可以按照用户和下播时间进行归类,找出上播时间的最大值即可。
  15. SELECT
  16. online.id,
  17. MAX(online.time) AS online_time,
  18. offline.time AS offline_time
  19. FROM user_anchor AS online
  20. INNER JOIN user_anchor AS offline
  21. ON online.id = offline.id
  22. AND online.event = 1
  23. AND offline.event = 2
  24. WHERE online.time < offline.time
  25. GROUP BY id, offline.time;
  26. -- 接着,在步骤二的基础上,求每次的直播时长(直播时长 = 下播时间 - 上播时间)
  27. SELECT
  28. *,
  29. TIMESTAMPDIFF(MINUTE, online_time, offline_time) AS minute_diff
  30. FROM (
  31. SELECT online.id, MAX(online.time) AS online_time, offline.time AS offline_time
  32. FROM user_anchor AS online
  33. INNER JOIN user_anchor AS offline ON online.id = offline.id AND online.event = 1 AND offline.event = 2
  34. WHERE online.time < offline.time
  35. GROUP BY id, offline.time
  36. ) AS tmp;
  37. -- 最后,按照主播ID分组求时间平均值即可。
  38. SELECT
  39. id,
  40. AVG(minute_diff) AS avg_minute
  41. FROM (
  42. SELECT *, TIMESTAMPDIFF(MINUTE, online_time, offline_time) AS minute_diff
  43. FROM (
  44. SELECT online.id, MAX(online.time) AS online_time, offline.time AS offline_time
  45. FROM user_anchor AS online
  46. INNER JOIN user_anchor AS offline ON online.id = offline.id AND online.event = 1 AND offline.event = 2
  47. WHERE online.time < offline.time
  48. GROUP BY id, offline.time
  49. ) AS tmp
  50. ) AS tmp
  51. GROUP BY id;