基数与总行数的比值再乘以 100% 就是某个列的选择性。

在进行 SQL 优化的时候,单独看列的基数是没有意义的,基数必须对比总行数才有实际意义,正是因为这个原因,我们才引出了选择性这个概念。

下面我们查看 test 表各个列的基数与选择性,为了查看选择性,必须先收集统计信息。关于统计信息,我们在第 2 章会详细介绍。下面的脚本用于收集 test 表的统计信息。

  1. BEGIN
  2. DBMS_STATS.GATHER_TABLE_STATS(ownname => 'SCOTT',
  3. tabname => 'TEST',
  4. estimate_percent => 100,
  5. method_opt => 'for all columns size 1',
  6. no_invalidate => FALSE,
  7. degree => 1,
  8. cascade => TRUE);
  9. END;
  1. PL/SQL procedure successfully completed.

下面的脚本用于查看 test 表中每个列的基数与选择性。

  1. select a.column_name,
  2. b.num_rows,
  3. a.num_distinct Cardinality,
  4. round(a.num_distinct / b.num_rows * 100, 2) selectivity,
  5. a.histogram,
  6. a.num_buckets
  7. from dba_tab_col_statistics a, dba_tables b
  8. where a.owner = b.owner
  9. and a.table_name = b.table_name
  10. and a.owner = 'SCOTT'
  11. and a.table_name = 'TEST';
  1. COLUMN_NAME NUM_ROWS CARDINALITY SELECTIVITY HISTOGRAM NUM_BUCKETS
  2. --------------- ---------- ----------- ----------- --------- -----------
  3. OWNER 72462 29 .04 NONE 1
  4. OBJECT_NAME 72462 44236 61.05 NONE 1
  5. SUBOBJECT_NAME 72462 106 .15 NONE 1
  6. OBJECT_ID 72462 72462 100 NONE 1
  7. DATA_OBJECT_ID 72462 7608 10.5 NONE 1
  8. OBJECT_TYPE 72462 44 .06 NONE 1
  9. CREATED 72462 1366 1.89 NONE 1
  10. LAST_DDL_TIME 72462 1412 1.95 NONE 1
  11. TIMESTAMP 72462 1480 2.04 NONE 1
  12. STATUS 72462 1 0 NONE 1
  13. TEMPORARY 72462 2 0 NONE 1
  14. GENERATED 72462 2 0 NONE 1
  15. SECONDARY 72462 2 0 NONE 1
  16. NAMESPACE 72462 21 .03 NONE 1
  17. EDITION_NAME 72462 0 0 NONE 0
  18. 15 rows selected.

请思考:什么样的列必须建立索引呢?

有人说基数高的列,有人说在 where 条件中的列。这些答案并不完美。基数高究竟是多高?没有和总行数对比,始终不知道有多高。比如某个列的基数有几万行,但是总行数有几十亿行,那么这个列的基数还高吗?这就是要引出选择性的根本原因。

当一个列选择性大于 20%,说明该列的数据分布就比较均衡了。测试表 test 中 object_name、object_id 的选择性均大于 20%,其中 object_name 列的选择性为 61.05%。现在我们查看该列数据分布(为了方便展示,只输出前 10 行数据的分布情况)。

  1. select *
  2. from (select object_name, count(*)
  3. from test
  4. group by object_name
  5. order by 2 desc)
  6. where rownum <= 10;
  1. OBJECT_NAME COUNT(*)
  2. ------------------ ----------
  3. COSTS 30
  4. SALES 30
  5. SALES_CHANNEL_BIX 29
  6. COSTS_TIME_BIX 29
  7. COSTS_PROD_BIX 29
  8. SALES_TIME_BIX 29
  9. SALES_PROMO_BIX 29
  10. SALES_PROD_BIX 29
  11. SALES_CUST_BIX 29
  12. DBMS_REPCAT_AUTH 5
  13. 10 rows selected.

由上面的查询结果我们可知,object_name 列的数据分布非常均衡。我们查询以下 SQL。

  1. select * from test where object_name=:B1;

不管 object_name 传入任何值,最多返回 30 行数据。

什么样的列必须要创建索引呢?当一个列出现在 where 条件中,该列没有创建索引并且选择性大于 20%,那么该列就必须创建索引,从而提升 SQL 查询性能。当然了,如果表只有几百条数据,那我们就不用创建索引了。

下面抛出SQL 优化核心思想第一个观点:只有大表才会产生性能问题。

也许有人会说:「我有个表很小,只有几百条,但是该表经常进行 DML,会产生热点块,也会出性能问题。」对此我们并不想过多地讨论此问题,这属于应用程序设计问题,不属于 SQL 优化的范畴。

下面我们将通过实验为大家分享本书第一个全自动优化脚本。

抓出必须创建索引的列

(请读者对该脚本适当修改,以便用于生产环境)。

首先,该列必须出现在 where 条件中,怎么抓出表的哪个列出现在 where 条件中呢?有两种方法,一种是可以通过 V$SQL_PLAN 抓取,另一种是通过下面的脚本抓取。

先执行下面的存储过程,刷新数据库监控信息。

  1. begin
  2. dbms_stats.flush_database_monitoring_info;
  3. end;

运行完上面的命令之后,再运行下面的查询语句就可以查询出哪个表的哪个列出现在 where 条件中。

  1. select r.name owner
  2. o.name table_name
  3. c.name column_name
  4. equality_preds ---等值过滤
  5. equijoin_preds ---等值 JOIN 比如 where a.id=b.id
  6. nonequijoin_preds ----不等 JOIN
  7. range_preds ----范围过滤次数 > >= < <= between and
  8. like_preds ----LIKE 过滤
  9. null_preds ----NULL 过滤
  10. timestamp
  11. from sys.col_usage$ u sys.obj$ o sys.col$ c sys.user$ r
  12. where o.obj# = u.obj#
  13. and c.obj# = u.obj#
  14. and c.col# = u.intcol#
  15. and r.name = SCOTT
  16. and o.name = TEST』;

下面是实验步骤。

我们首先运行一个查询语句,让 owner 与 object_id 列出现在 where 条件中。

  1. select object_id, owner, object_type
  2. from test
  3. where owner = 'SYS'
  4. and object_id < 100
  5. and rownum <= 10;
  1. OBJECT_ID OWNER OBJECT_TYPE
  2. ---------- -------------------- -----------
  3. 20 SYS TABLE
  4. 46 SYS INDEX
  5. 28 SYS TABLE
  6. 15 SYS TABLE
  7. 29 SYS CLUSTER
  8. 3 SYS INDEX
  9. 25 SYS TABLE
  10. 41 SYS INDEX
  11. 54 SYS INDEX
  12. 40 SYS INDEX
  13. 10 rows selected.

其次刷新数据库监控信息。

  1. begin
  2. dbms_stats.flush_database_monitoring_info;
  3. end;
  1. PL/SQL procedure successfully completed.

然后我们查看 test 表有哪些列出现在 where 条件中。

  1. select r.name owner, o.name table_name, c.name column_name
  2. from sys.col_usage$ u, sys.obj$ o, sys.col$ c, sys.user$ r
  3. where o.obj# = u.obj#
  4. and c.obj# = u.obj#
  5. and c.col# = u.intcol#
  6. and r.name = 'SCOTT'
  7. and o.name = 'TEST';
  1. OWNER TABLE_NAME COLUMN_NAME
  2. ---------- ---------- ------------------------------
  3. SCOTT TEST OWNER
  4. SCOTT TEST OBJECT_ID

接下来我们查询出选择性大于等于 20% 的列。

  1. select a.owner,
  2. a.table_name,
  3. a.column_name,
  4. round(a.num_distinct / b.num_rows * 100, 2) selectivity
  5. from dba_tab_col_statistics a, dba_tables b
  6. where a.owner = b.owner
  7. and a.table_name = b.table_name
  8. and a.owner = 'SCOTT'
  9. and a.table_name = 'TEST'
  10. and a.num_distinct / b.num_rows >= 0.2;
  1. OWNER TABLE_NAME COLUMN_NAME SELECTIVITY
  2. ---------- ---------- ------------- -----------
  3. SCOTT TEST OBJECT_NAME 61.05
  4. SCOTT TEST OBJECT_ID 100

最后,确保这些列没有创建索引。

  1. select table_owner table_name column_name index_name
  2. from dba_ind_columns
  3. where table_owner = 'SCOTT'
  4. and table_name = 'TEST';
  1. 未选定行

把上面的脚本组合起来,我们就可以得到全自动的优化脚本了。

  1. select owner,
  2. column_name,
  3. num_rows,
  4. Cardinality,
  5. selectivity,
  6. 'Need index' as notice
  7. from (select b.owner,
  8. a.column_name,
  9. b.num_rows,
  10. a.num_distinct Cardinality,
  11. round(a.num_distinct / b.num_rows * 100, 2) selectivity
  12. from dba_tab_col_statistics a, dba_tables b
  13. where a.owner = b.owner
  14. and a.table_name = b.table_name
  15. and a.owner = 'SCOTT'
  16. and a.table_name = 'TEST')
  17. where selectivity >= 20
  18. and column_name not in (select column_name
  19. from dba_ind_columns
  20. where table_owner = 'SCOTT'
  21. and table_name = 'TEST')
  22. and column_name in
  23. (select c.name
  24. from sys.col_usage$ u, sys.obj$ o, sys.col$ c, sys.user$ r
  25. where o.obj# = u.obj#
  26. and c.obj# = u.obj#
  27. and c.col# = u.intcol#
  28. and r.name = 'SCOTT'
  29. and o.name = 'TEST');
  1. OWNER COLUMN_NAME NUM_ROWS CARDINALITY SELECTIVITY NOTICE
  2. ---------- ------------- ---------- ----------- ----------- ----------
  3. SCOTT OBJECT_ID 72462 72462 100 Need index

本节SQL

  1. BEGIN
  2. DBMS_STATS.GATHER_TABLE_STATS(ownname => 'SCOTT',
  3. tabname => 'TEST',
  4. estimate_percent => 100,
  5. method_opt => 'for all columns size 1',
  6. no_invalidate => FALSE,
  7. degree => 1,
  8. cascade => TRUE);
  9. END;
  10. SELECT a.column_name,b.num_rows,a.num_distinct CARDINALITY, ROUND(a.num_distinct/b.num_rows*100,2) selectivity,
  11. a.histogram,a.num_buckets FROM dba_tab_col_statistics a, dba_tables b WHERE a.owner=b.owner AND a.TABLE_NAME = b.TABLE_NAME
  12. AND a.OWNER = 'SCOTT' AND a.TABLE_NAME = 'TEST';
  13. select r.name owner
  14. o.name table_name
  15. c.name column_name
  16. equality_preds ---等值过滤
  17. equijoin_preds ---等值 JOIN 比如 where a.id=b.id
  18. nonequijoin_preds ----不等 JOIN
  19. range_preds ----范围过滤次数 > >= < <= between and
  20. like_preds ----LIKE 过滤
  21. null_preds ----NULL 过滤
  22. timestamp
  23. from sys.col_usage$ u sys.obj$ o sys.col$ c sys.user$ r
  24. where o.obj# = u.obj#
  25. and c.obj# = u.obj#
  26. and c.col# = u.intcol#
  27. and r.name = 'SCOTT'
  28. and o.name = 'TEST';
  29. select object_id, owner, object_type
  30. from test
  31. where owner = 'SYS'
  32. and object_id < 100
  33. and rownum <= 10;
  34. begin
  35. dbms_stats.flush_database_monitoring_info;
  36. end;
  37. select r.name owner, o.name table_name, c.name column_name
  38. from sys.col_usage$ u, sys.obj$ o, sys.col$ c, sys.user$ r
  39. where o.obj# = u.obj#
  40. and c.obj# = u.obj#
  41. and c.col# = u.intcol#
  42. and r.name = 'SCOTT'
  43. and o.name = 'TEST';
  44. select a.owner,
  45. a.table_name,
  46. a.column_name,
  47. round(a.num_distinct / b.num_rows * 100, 2) selectivity
  48. from dba_tab_col_statistics a, dba_tables b
  49. where a.owner = b.owner
  50. and a.table_name = b.table_name
  51. and a.owner = 'SCOTT'
  52. and a.table_name = 'TEST'
  53. and a.num_distinct / b.num_rows >= 0.2;
  54. select table_owner table_name column_name index_name
  55. from dba_ind_columns
  56. where table_owner = 'SCOTT'
  57. and table_name = 'TEST';
  58. select owner,
  59. column_name,
  60. num_rows,
  61. Cardinality,
  62. selectivity,
  63. 'Need index' as notice
  64. from (select b.owner,
  65. a.column_name,
  66. b.num_rows,
  67. a.num_distinct Cardinality,
  68. round(a.num_distinct / b.num_rows * 100, 2) selectivity
  69. from dba_tab_col_statistics a, dba_tables b
  70. where a.owner = b.owner
  71. and a.table_name = b.table_name
  72. and a.owner = 'SCOTT'
  73. and a.table_name = 'TEST')
  74. where selectivity >= 20
  75. and column_name not in (select column_name
  76. from dba_ind_columns
  77. where table_owner = 'SCOTT'
  78. and table_name = 'TEST')
  79. and column_name in
  80. (select c.name
  81. from sys.col_usage$ u, sys.obj$ o, sys.col$ c, sys.user$ r
  82. where o.obj# = u.obj#
  83. and c.obj# = u.obj#
  84. and c.col# = u.intcol#
  85. and r.name = 'SCOTT'
  86. and o.name = 'TEST');