基数与总行数的比值再乘以 100% 就是某个列的选择性。
在进行 SQL 优化的时候,单独看列的基数是没有意义的,基数必须对比总行数才有实际意义,正是因为这个原因,我们才引出了选择性这个概念。
下面我们查看 test 表各个列的基数与选择性,为了查看选择性,必须先收集统计信息。关于统计信息,我们在第 2 章会详细介绍。下面的脚本用于收集 test 表的统计信息。
BEGIN
DBMS_STATS.GATHER_TABLE_STATS(ownname => 'SCOTT',
tabname => 'TEST',
estimate_percent => 100,
method_opt => 'for all columns size 1',
no_invalidate => FALSE,
degree => 1,
cascade => TRUE);
END;
PL/SQL procedure successfully completed.
下面的脚本用于查看 test 表中每个列的基数与选择性。
select a.column_name,
b.num_rows,
a.num_distinct Cardinality,
round(a.num_distinct / b.num_rows * 100, 2) selectivity,
a.histogram,
a.num_buckets
from dba_tab_col_statistics a, dba_tables b
where a.owner = b.owner
and a.table_name = b.table_name
and a.owner = 'SCOTT'
and a.table_name = 'TEST';
COLUMN_NAME NUM_ROWS CARDINALITY SELECTIVITY HISTOGRAM NUM_BUCKETS
--------------- ---------- ----------- ----------- --------- -----------
OWNER 72462 29 .04 NONE 1
OBJECT_NAME 72462 44236 61.05 NONE 1
SUBOBJECT_NAME 72462 106 .15 NONE 1
OBJECT_ID 72462 72462 100 NONE 1
DATA_OBJECT_ID 72462 7608 10.5 NONE 1
OBJECT_TYPE 72462 44 .06 NONE 1
CREATED 72462 1366 1.89 NONE 1
LAST_DDL_TIME 72462 1412 1.95 NONE 1
TIMESTAMP 72462 1480 2.04 NONE 1
STATUS 72462 1 0 NONE 1
TEMPORARY 72462 2 0 NONE 1
GENERATED 72462 2 0 NONE 1
SECONDARY 72462 2 0 NONE 1
NAMESPACE 72462 21 .03 NONE 1
EDITION_NAME 72462 0 0 NONE 0
15 rows selected.
请思考:什么样的列必须建立索引呢?
有人说基数高的列,有人说在 where 条件中的列。这些答案并不完美。基数高究竟是多高?没有和总行数对比,始终不知道有多高。比如某个列的基数有几万行,但是总行数有几十亿行,那么这个列的基数还高吗?这就是要引出选择性的根本原因。
当一个列选择性大于 20%,说明该列的数据分布就比较均衡了。测试表 test 中 object_name、object_id 的选择性均大于 20%,其中 object_name 列的选择性为 61.05%。现在我们查看该列数据分布(为了方便展示,只输出前 10 行数据的分布情况)。
select *
from (select object_name, count(*)
from test
group by object_name
order by 2 desc)
where rownum <= 10;
OBJECT_NAME COUNT(*)
------------------ ----------
COSTS 30
SALES 30
SALES_CHANNEL_BIX 29
COSTS_TIME_BIX 29
COSTS_PROD_BIX 29
SALES_TIME_BIX 29
SALES_PROMO_BIX 29
SALES_PROD_BIX 29
SALES_CUST_BIX 29
DBMS_REPCAT_AUTH 5
10 rows selected.
由上面的查询结果我们可知,object_name 列的数据分布非常均衡。我们查询以下 SQL。
select * from test where object_name=:B1;
不管 object_name 传入任何值,最多返回 30 行数据。
什么样的列必须要创建索引呢?当一个列出现在 where 条件中,该列没有创建索引并且选择性大于 20%,那么该列就必须创建索引,从而提升 SQL 查询性能。当然了,如果表只有几百条数据,那我们就不用创建索引了。
下面抛出SQL 优化核心思想第一个观点:只有大表才会产生性能问题。
也许有人会说:「我有个表很小,只有几百条,但是该表经常进行 DML,会产生热点块,也会出性能问题。」对此我们并不想过多地讨论此问题,这属于应用程序设计问题,不属于 SQL 优化的范畴。
下面我们将通过实验为大家分享本书第一个全自动优化脚本。
抓出必须创建索引的列
(请读者对该脚本适当修改,以便用于生产环境)。
首先,该列必须出现在 where 条件中,怎么抓出表的哪个列出现在 where 条件中呢?有两种方法,一种是可以通过 V$SQL_PLAN 抓取,另一种是通过下面的脚本抓取。
先执行下面的存储过程,刷新数据库监控信息。
begin
dbms_stats.flush_database_monitoring_info;
end;
运行完上面的命令之后,再运行下面的查询语句就可以查询出哪个表的哪个列出现在 where 条件中。
select r.name owner,
o.name table_name,
c.name column_name,
equality_preds, ---等值过滤
equijoin_preds, ---等值 JOIN 比如 where a.id=b.id
nonequijoin_preds, ----不等 JOIN
range_preds, ----范围过滤次数 > >= < <= between and
like_preds, ----LIKE 过滤
null_preds, ----NULL 过滤
timestamp
from sys.col_usage$ u, sys.obj$ o, sys.col$ c, sys.user$ r
where o.obj# = u.obj#
and c.obj# = u.obj#
and c.col# = u.intcol#
and r.name = 『SCOTT』
and o.name = 『TEST』;
下面是实验步骤。
我们首先运行一个查询语句,让 owner 与 object_id 列出现在 where 条件中。
select object_id, owner, object_type
from test
where owner = 'SYS'
and object_id < 100
and rownum <= 10;
OBJECT_ID OWNER OBJECT_TYPE
---------- -------------------- -----------
20 SYS TABLE
46 SYS INDEX
28 SYS TABLE
15 SYS TABLE
29 SYS CLUSTER
3 SYS INDEX
25 SYS TABLE
41 SYS INDEX
54 SYS INDEX
40 SYS INDEX
10 rows selected.
其次刷新数据库监控信息。
begin
dbms_stats.flush_database_monitoring_info;
end;
PL/SQL procedure successfully completed.
然后我们查看 test 表有哪些列出现在 where 条件中。
select r.name owner, o.name table_name, c.name column_name
from sys.col_usage$ u, sys.obj$ o, sys.col$ c, sys.user$ r
where o.obj# = u.obj#
and c.obj# = u.obj#
and c.col# = u.intcol#
and r.name = 'SCOTT'
and o.name = 'TEST';
OWNER TABLE_NAME COLUMN_NAME
---------- ---------- ------------------------------
SCOTT TEST OWNER
SCOTT TEST OBJECT_ID
接下来我们查询出选择性大于等于 20% 的列。
select a.owner,
a.table_name,
a.column_name,
round(a.num_distinct / b.num_rows * 100, 2) selectivity
from dba_tab_col_statistics a, dba_tables b
where a.owner = b.owner
and a.table_name = b.table_name
and a.owner = 'SCOTT'
and a.table_name = 'TEST'
and a.num_distinct / b.num_rows >= 0.2;
OWNER TABLE_NAME COLUMN_NAME SELECTIVITY
---------- ---------- ------------- -----------
SCOTT TEST OBJECT_NAME 61.05
SCOTT TEST OBJECT_ID 100
最后,确保这些列没有创建索引。
select table_owner, table_name, column_name, index_name
from dba_ind_columns
where table_owner = 'SCOTT'
and table_name = 'TEST';
未选定行
把上面的脚本组合起来,我们就可以得到全自动的优化脚本了。
select owner,
column_name,
num_rows,
Cardinality,
selectivity,
'Need index' as notice
from (select b.owner,
a.column_name,
b.num_rows,
a.num_distinct Cardinality,
round(a.num_distinct / b.num_rows * 100, 2) selectivity
from dba_tab_col_statistics a, dba_tables b
where a.owner = b.owner
and a.table_name = b.table_name
and a.owner = 'SCOTT'
and a.table_name = 'TEST')
where selectivity >= 20
and column_name not in (select column_name
from dba_ind_columns
where table_owner = 'SCOTT'
and table_name = 'TEST')
and column_name in
(select c.name
from sys.col_usage$ u, sys.obj$ o, sys.col$ c, sys.user$ r
where o.obj# = u.obj#
and c.obj# = u.obj#
and c.col# = u.intcol#
and r.name = 'SCOTT'
and o.name = 'TEST');
OWNER COLUMN_NAME NUM_ROWS CARDINALITY SELECTIVITY NOTICE
---------- ------------- ---------- ----------- ----------- ----------
SCOTT OBJECT_ID 72462 72462 100 Need index
本节SQL
BEGIN
DBMS_STATS.GATHER_TABLE_STATS(ownname => 'SCOTT',
tabname => 'TEST',
estimate_percent => 100,
method_opt => 'for all columns size 1',
no_invalidate => FALSE,
degree => 1,
cascade => TRUE);
END;
SELECT a.column_name,b.num_rows,a.num_distinct CARDINALITY, ROUND(a.num_distinct/b.num_rows*100,2) selectivity,
a.histogram,a.num_buckets FROM dba_tab_col_statistics a, dba_tables b WHERE a.owner=b.owner AND a.TABLE_NAME = b.TABLE_NAME
AND a.OWNER = 'SCOTT' AND a.TABLE_NAME = 'TEST';
select r.name owner,
o.name table_name,
c.name column_name,
equality_preds, ---等值过滤
equijoin_preds, ---等值 JOIN 比如 where a.id=b.id
nonequijoin_preds, ----不等 JOIN
range_preds, ----范围过滤次数 > >= < <= between and
like_preds, ----LIKE 过滤
null_preds, ----NULL 过滤
timestamp
from sys.col_usage$ u, sys.obj$ o, sys.col$ c, sys.user$ r
where o.obj# = u.obj#
and c.obj# = u.obj#
and c.col# = u.intcol#
and r.name = 'SCOTT'
and o.name = 'TEST';
select object_id, owner, object_type
from test
where owner = 'SYS'
and object_id < 100
and rownum <= 10;
begin
dbms_stats.flush_database_monitoring_info;
end;
select r.name owner, o.name table_name, c.name column_name
from sys.col_usage$ u, sys.obj$ o, sys.col$ c, sys.user$ r
where o.obj# = u.obj#
and c.obj# = u.obj#
and c.col# = u.intcol#
and r.name = 'SCOTT'
and o.name = 'TEST';
select a.owner,
a.table_name,
a.column_name,
round(a.num_distinct / b.num_rows * 100, 2) selectivity
from dba_tab_col_statistics a, dba_tables b
where a.owner = b.owner
and a.table_name = b.table_name
and a.owner = 'SCOTT'
and a.table_name = 'TEST'
and a.num_distinct / b.num_rows >= 0.2;
select table_owner, table_name, column_name, index_name
from dba_ind_columns
where table_owner = 'SCOTT'
and table_name = 'TEST';
select owner,
column_name,
num_rows,
Cardinality,
selectivity,
'Need index' as notice
from (select b.owner,
a.column_name,
b.num_rows,
a.num_distinct Cardinality,
round(a.num_distinct / b.num_rows * 100, 2) selectivity
from dba_tab_col_statistics a, dba_tables b
where a.owner = b.owner
and a.table_name = b.table_name
and a.owner = 'SCOTT'
and a.table_name = 'TEST')
where selectivity >= 20
and column_name not in (select column_name
from dba_ind_columns
where table_owner = 'SCOTT'
and table_name = 'TEST')
and column_name in
(select c.name
from sys.col_usage$ u, sys.obj$ o, sys.col$ c, sys.user$ r
where o.obj# = u.obj#
and c.obj# = u.obj#
and c.col# = u.intcol#
and r.name = 'SCOTT'
and o.name = 'TEST');