Mazur 的 SQL 风格指南

您好!我是Matt Mazur ,是一名数据分析师,曾在几家初创公司工作过,帮助公司利用数据发展业务。本指南记录了我对格式化 SQL 的喜好,希望对其他人有一些用处。如果您或您的团队还没有 SQL 风格指南,那么它可以作为一个很好的起点,您可以根据自己的喜好来采用和更新它。

另外,我是一个坚定的信仰者Strong Opinions, Weakly Held。如果你不同意它, 给我私信, 我会喜欢并一起讨论的.

如果你喜欢这个课题,你应该也喜欢我的LookML Style Guide或者我的博客,我写了很多关于分析和数据分析的文章.

例子

这里是一些比较简单的查询,主要是为了展示下这个指南:

  1. with hubspot_interest as (
  2. select
  3. email,
  4. timestamp_millis(property_beacon_interest) as expressed_interest_at
  5. from hubspot.contact
  6. where property_beacon_interest is not null
  7. ),
  8. support_interest as (
  9. select
  10. conversation.email,
  11. conversation.created_at as expressed_interest_at
  12. from helpscout.conversation
  13. inner join helpscout.conversation_tag on conversation.id = conversation_tag.conversation_id
  14. where conversation_tag.tag = 'beacon-interest'
  15. ),
  16. combined_interest as (
  17. select * from hubspot_interest
  18. union all
  19. select * from support_interest
  20. ),
  21. final as (
  22. select
  23. email,
  24. min(expressed_interest_at) as expressed_interest_at
  25. from combined_interest
  26. group by email
  27. )
  28. select * from final

指南

使用小写字母 SQL

它就像大写 SQL 一样易读,而且你不必总是按住 Shift 键。

  1. --
  2. select * from users
  3. -- 不好
  4. SELECT * FROM users
  5. -- 不好
  6. Select * From users

单行查询 vs 多行查询

只需要查询单列,则推荐单行:

  1. --
  2. select * from users
  3. --
  4. select id from users
  5. --
  6. select count(*) from users

一旦你需要查询更多的列或比较复杂,分散在多行可以变得更容易阅读:

  1. --
  2. select
  3. id,
  4. email,
  5. created_at
  6. from users
  7. --
  8. select *
  9. from users
  10. where email = 'example@domain.com'
  11. --
  12. select
  13. user_id,
  14. count(*) as total_charges
  15. from charges
  16. group by user_id
  17. -- 不好
  18. select id, email, created_at
  19. from users
  20. -- 不好
  21. select id,
  22. email
  23. from users

左对齐 SQL 关键字

有些 IDE 能够自动格式化 SQL,以便 SQL 关键字之后的空格垂直对齐。手动做这个格式化非常麻烦(在我看来这样也更难阅读),所以我建议所有的关键字都左对齐。

  1. --
  2. select id, email
  3. from users
  4. where email like '%@gmail.com'
  5. -- 不好
  6. select id, email
  7. from users
  8. where email like '%@gmail.com'

使用单引号

有些 SQL 分支(例如 BigQuery)支持使用双引号,但是对于大多数分支,双引号都使用在列名上,因此最好使用单引号。

  1. --
  2. select *
  3. from users
  4. where email = 'example@domain.com'
  5. -- 不好
  6. select *
  7. from users
  8. where email = "example@domain.com"

使用 != 而不是 <>

很简单,因为 != 看起来像 “不等于”,更接近我们想要表达的意思。

  1. --
  2. select count(*) as paying_users_count
  3. from users
  4. where plan_name != 'free'

逗号应该在行尾

  1. --
  2. select
  3. id,
  4. email
  5. from users
  6. -- 不好
  7. select
  8. id
  9. , email
  10. from users

where 条件的缩进

当只有一个条件时,与 where 保持在同一行:

  1. select email
  2. from users
  3. where id = 1234

当有多个条件时,每一个条件都比 where 缩进一层。将逻辑运算符放在前一个条件的末尾:

  1. select id, email
  2. from users
  3. where
  4. created_at >= '2019-03-01' and
  5. vertical = 'work'

避免括号内的空格

  1. --
  2. select *
  3. from users
  4. where id in (1, 2)
  5. -- 不好
  6. select *
  7. from users
  8. where id in ( 1, 2 )

in 中比较长的列表,应该分在多个不同的缩进行

  1. --
  2. select *
  3. from users
  4. where email in (
  5. 'user-1@example.com',
  6. 'user-2@example.com',
  7. 'user-3@example.com',
  8. 'user-4@example.com'
  9. )

表名应该是名称复数蛇形风格

  1. --
  2. select * from users
  3. select * from visit_logs
  4. -- 不好
  5. select * from user
  6. select * from visitLog

列名应该是蛇形风格

  1. --
  2. select
  3. id,
  4. email,
  5. timestamp_trunc(created_at, month) as signup_month
  6. from users
  7. -- 不好
  8. select
  9. id,
  10. email,
  11. timestamp_trunc(created_at, month) as SignupMonth
  12. from users

列名约定

  • Boolean 类型应该有 is_has_does_ 前缀。例如 is_customerhas_unsubscribed 等。
  • Date-only 类型应该有 _date 后缀。例如 report_date 等。
  • Date+time 类型应该有 _at 后缀。例如 created_atposted_at 等。

列命约定

将主键放到最前面,然后是外键,最后是其他列。如果有任何系统列(如 created_atupdated_atis_deleted 等等,把它们放到最后。

  1. --
  2. select
  3. id,
  4. name,
  5. created_at
  6. from users
  7. -- 不好
  8. select
  9. created_at,
  10. name,
  11. id,
  12. from users

显性 inner,在内部连接

最好是显性写出 inner join,而不是省略 inner

  1. --
  2. select
  3. users.email,
  4. sum(charges.amount) as total_revenue
  5. from users
  6. inner join charges on users.id = charges.user_id
  7. -- 不好
  8. select
  9. users.email,
  10. sum(charges.amount) as total_revenue
  11. from users
  12. join charges on users.id = charges.user_id

对于 join 条件,按引用的表顺序排序到 on 之后

通过这样做,可以更容易确定连接是否导致结果呈扇形分布:

  1. --
  2. select
  3. ...
  4. from users
  5. left join charges on users.id = charges.user_id
  6. -- primary_key = foreign_key --> one-to-many --> fanout
  7. select
  8. ...
  9. from charges
  10. left join users on charges.user_id = users.id
  11. -- foreign_key = primary_key --> many-to-one --> no fanout
  12. -- 不好
  13. select
  14. ...
  15. from users
  16. left join charges on charges.user_id = users.id

单个连接条件应与 join 在同一行上

  1. --
  2. select
  3. users.email,
  4. sum(charges.amount) as total_revenue
  5. from users
  6. inner join charges on users.id = charges.user_id
  7. group by email
  8. -- 不好
  9. select
  10. users.email,
  11. sum(charges.amount) as total_revenue
  12. from users
  13. inner join charges
  14. on users.id = charges.user_id
  15. group by email

当有多个连接条件时,请将每个条件放在它们自己的缩进行中:

  1. --
  2. select
  3. users.email,
  4. sum(charges.amount) as total_revenue
  5. from users
  6. inner join charges on
  7. users.id = charges.user_id and
  8. refunded = false
  9. group by email

大多数情况下尽量避免表名的别名

将表名 users 缩写为 u,将 charges 缩写为 c,这可能很诱人,但这最终会降低 SQL 的可读性

  1. --
  2. select
  3. users.email,
  4. sum(charges.amount) as total_revenue
  5. from users
  6. inner join charges on users.id = charges.user_id
  7. -- 不好
  8. select
  9. u.email,
  10. sum(c.amount) as total_revenue
  11. from users u
  12. inner join charges c on u.id = c.user_id

大多数情况下,最好是使用完整的表名。

有两个例外:

  • 如果需要在同一个查询中多次连接到一个表,并且需要区分这几个之间的不同,那么就需要别名。
  • 另外,如果表名很长或有歧义,可以使用别名(但仍然需要使用有意义的名称)。
  1. -- 好:有意义的表别名
  2. select
  3. companies.com_name,
  4. beacons.created_at
  5. from stg_mysql_helpscout__helpscout_companies companies
  6. inner join stg_mysql_helpscout__helpscout_beacons_v2 beacons on companies.com_id = beacons.com_id
  7. -- 还行:没有表别名
  8. select
  9. stg_mysql_helpscout__helpscout_companies.com_name,
  10. stg_mysql_helpscout__helpscout_beacons_v2.created_at
  11. from stg_mysql_helpscout__helpscout_companies
  12. inner join stg_mysql_helpscout__helpscout_beacons_v2 on stg_mysql_helpscout__helpscout_companies.com_id = stg_mysql_helpscout__helpscout_beacons_v2.com_id
  13. -- 不好:不清晰的表别名
  14. select
  15. c.com_name,
  16. b.created_at
  17. from stg_mysql_helpscout__helpscout_companies c
  18. inner join stg_mysql_helpscout__helpscout_beacons_v2 b on c.com_id = b.com_id

当存在 join 时,显性写出表名,否则省略表名

当没有涉及到 join 时,就不会对列来自哪个表产生歧义,因此可以省略表名:

  1. --
  2. select
  3. id,
  4. name
  5. from companies
  6. -- 不好
  7. select
  8. companies.id,
  9. companies.name
  10. from companies

当涉及到 join 时,最好是显式的,这样就可以清楚地知道列来源:

  1. --
  2. select
  3. users.email,
  4. sum(charges.amount) as total_revenue
  5. from users
  6. inner join charges on users.id = charges.user_id
  7. -- 不好
  8. select
  9. email,
  10. sum(amount) as total_revenue
  11. from users
  12. inner join charges on users.id = charges.user_id

总是重命名聚合和函数包装的参数

  1. --
  2. select count(*) as total_users
  3. from users
  4. -- 不好
  5. select count(*)
  6. from users
  7. --
  8. select timestamp_millis(property_beacon_interest) as expressed_interest_at
  9. from hubspot.contact
  10. where property_beacon_interest is not null
  11. -- 不好
  12. select timestamp_millis(property_beacon_interest)
  13. from hubspot.contact
  14. where property_beacon_interest is not null

明确布尔条件

  1. --
  2. select * from customers where is_cancelled = true
  3. select * from customers where is_cancelled = false
  4. -- 不好
  5. select * from customers where is_cancelled
  6. select * from customers where not is_cancelled

使用 as 作为列名别名

  1. --
  2. select
  3. id,
  4. email,
  5. timestamp_trunc(created_at, month) as signup_month
  6. from users
  7. -- 不好
  8. select
  9. id,
  10. email,
  11. timestamp_trunc(created_at, month) signup_month
  12. from users

使用列名或列号进行分组,但不要同时使用两种

我更喜欢按列名分组,但按数字分组也是极好的

  1. --
  2. select user_id, count(*) as total_charges
  3. from charges
  4. group by user_id
  5. --
  6. select user_id, count(*) as total_charges
  7. from charges
  8. group by 1
  9. -- 不好
  10. select
  11. timestamp_trunc(created_at, month) as signup_month,
  12. vertical,
  13. count(*) as users_count
  14. from users
  15. group by 1, vertical

按名称分组时,使用别名

  1. --
  2. select
  3. timestamp_trunc(com_created_at, year) as signup_year,
  4. count(*) as total_companies
  5. from companies
  6. group by signup_year
  7. -- 不好
  8. select
  9. timestamp_trunc(com_created_at, year) as signup_year,
  10. count(*) as total_companies
  11. from companies
  12. group by timestamp_trunc(com_created_at, year)

首先应该对列分组

  1. --
  2. select
  3. timestamp_trunc(com_created_at, year) as signup_year,
  4. count(*) as total_companies
  5. from companies
  6. group by signup_year
  7. -- 不好
  8. select
  9. count(*) as total_companies,
  10. timestamp_trunc(com_created_at, year) as signup_year
  11. from mysql_helpscout.helpscout_companies
  12. group by signup_year

调整 case/when 语句

每个 when 都应该独自一行( case 不包含别的信息),并且应该缩进比 case 深一层,then 可以和 when 保持在同一行,也可以换行。

  1. --
  2. select
  3. case
  4. when event_name = 'viewed_homepage' then 'Homepage'
  5. when event_name = 'viewed_editor' then 'Editor'
  6. else 'Other'
  7. end as page_name
  8. from events
  9. -- 也不错
  10. select
  11. case
  12. when event_name = 'viewed_homepage'
  13. then 'Homepage'
  14. when event_name = 'viewed_editor'
  15. then 'Editor'
  16. else 'Other'
  17. end as page_name
  18. from events
  19. -- 不好
  20. select
  21. case when event_name = 'viewed_homepage' then 'Homepage'
  22. when event_name = 'viewed_editor' then 'Editor'
  23. else 'Other'
  24. end as page_name
  25. from events

使用 CTE (公用表表达式),而不是子查询

避免使用子查询,CTE 将使查询更容易阅读和理解。

使用 CTE 时,用新行填充查询。

在使用任意的 CTE 时,始终使用 finalselect * from final 。通过这种方式,可以快速检查查询中使用的其他 CTE 输出,以便调试结果。

结尾的 CTE 括号应该使用与 with 和 CTE 名称相同的缩进。

  1. --
  2. with ordered_details as (
  3. select
  4. user_id,
  5. name,
  6. row_number() over (partition by user_id order by date_updated desc) as details_rank
  7. from billingdaddy.billing_stored_details
  8. ),
  9. final as (
  10. select user_id, name
  11. from ordered_details
  12. where details_rank = 1
  13. )
  14. select * from final
  15. -- 不好
  16. select user_id, name
  17. from (
  18. select
  19. user_id,
  20. name,
  21. row_number() over (partition by user_id order by date_updated desc) as details_rank
  22. from billingdaddy.billing_stored_details
  23. ) ranked
  24. where details_rank = 1

使用有意义的 CTE 名称

  1. --
  2. with ordered_details as (
  3. -- 不好
  4. with d1 as (

窗口函数

你可以把它单独放在一行上,或者根据它的长度把它分成多行:

  1. --
  2. select
  3. user_id,
  4. name,
  5. row_number() over (partition by user_id order by date_updated desc) as details_rank
  6. from billingdaddy.billing_stored_details
  7. --
  8. select
  9. user_id,
  10. name,
  11. row_number() over (
  12. partition by user_id
  13. order by date_updated desc
  14. ) as details_rank
  15. from billingdaddy.billing_stored_details

Credits

这个风格指南的灵感部分来自于:

向 Peter Butler、Dan Wyman、Simon Ouderkirk、Alex Cano、Adam Stone、Brian Kim 和 Claire Carroll 致敬,感谢他们对本指南提供的反馈。