两表关联只返回主表的数据,而且只返回主表与子表没关联上的数据,这种连接就叫反连接。反连接一般就是指的 not in 和 not exists。
反连接等价改写
not in 与 not exists 一般情况下也可以进行等价改写。
not in 的写法如下。
select * from dept where deptno not in (select deptno from emp);
DEPTNO DNAME LOC
---------- --------------- ---------------------------------------
40 OPERATIONS BOSTON
not exists 的写法如下。
select *
from dept
where not exists (select null from emp where dept.deptno = emp.deptno);
DEPTNO DNAME LOC
---------- --------------- ---------------------------------------
40 OPERATIONS BOSTON
需要注意的是,not in 里面如果有 null,整个查询会返回空,而 in 里面有 null,查询不受 null 影响,例子如下。
select * from dept where deptno not in (10,null);
no rows selected
select * from dept where deptno in (10,null);
DEPTNO DNAME LOC
---------- --------------- -----------------------
10 ACCOUNTING NEW YORK
所以在将 not exists 等价改写为 not in 的时候,要注意 null。一般情况下,如果反连接采用 not in 写法,我们需要在 where 条件中剔除 null。
select *
from dept
where deptno not in (select deptno from emp where deptno is not null);
not in 与 not exists 除了可以相互等价改写以外,还可以等价地改写为外连接,例如,上面查询可以等价改写为如下写法。
select d.*
from dept d
left join emp e on d.deptno = e.deptno
where e.deptno is null;
DEPTNO DNAME LOC
---------- --------------- ------------------
40 OPERATIONS BOSTON
为什么反连接可以改写为「外连接 + 子表连接条件 is null」?我们再来回顾一下反连接定义:两表关联只返回主表的数据,而且只返回主表与子表没有关联上的数据。根据反连接定义,翻译为标准 SQL 写法就是「外连接 + 子表连接条件 is null」。与半连接改写为内连接不同的是,反连接改写为外连接不需要考虑两表之间的关系。
控制反连接执行计划
我们先来查看示例 SQL 的原始执行计划。
select * from dept where deptno not in (select deptno from emp);
Execution Plan
----------------------------------------------------------
Plan hash value: 2230682264
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost(%CPU)|Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 23 | 6 (17)|00:00:01|
| 1 | MERGE JOIN ANTI NA | | 1 | 23 | 6 (17)|00:00:01|
| 2 | SORT JOIN | | 4 | 80 | 2 (0)|00:00:01|
| 3 | TABLE ACCESS BY INDEX ROWID| DEPT | 4 | 80 | 2 (0)|00:00:01|
| 4 | INDEX FULL SCAN | PK_DEPT | 4 | | 1 (0)|00:00:01|
|* 5 | SORT UNIQUE | | 14 | 42 | 4 (25)|00:00:01|
| 6 | TABLE ACCESS FULL | EMP | 14 | 42 | 3 (0)|00:00:01|
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
5 - access("DEPTNO"="DEPTNO")
filter("DEPTNO"="DEPTNO")
原始执行计划中 DEPT 与 EMP 是采用排序合并连接进行关联的。
我们现在让 DEPT 与 EMP 使用嵌套循环进行关联,不指定驱动表。
select /*+ use_nl(dept,emp@a) */ *
from dept
where deptno not in (select /*+ qb_name(a) */
deptno
from emp);
Execution Plan
----------------------------------------------------------
Plan hash value: 1831344308
-------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 23 | 11 (0)| 00:00:01 |
|* 1 | FILTER | | | | | |
| 2 | NESTED LOOPS ANTI SNA| | 1 | 23 | 11 (28)| 00:00:01 |
| 3 | TABLE ACCESS FULL | DEPT | 4 | 80 | 3 (0)| 00:00:01 |
|* 4 | TABLE ACCESS FULL | EMP | 9 | 27 | 1 (0)| 00:00:01 |
|* 5 | TABLE ACCESS FULL | EMP | 1 | 3 | 3 (0)| 00:00:01 |
-------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter( NOT EXISTS (SELECT /*+ QB_NAME ("A") */ 0 FROM "EMP"
"EMP" WHERE "DEPTNO" IS NULL))
4 - filter("DEPTNO"="DEPTNO")
5 - filter("DEPTNO" IS NULL)
执行计划居然变成了 FILTER,我们指定的 HINT 被 CBO 忽略了。这究竟是什么原因呢?注意观察 FILTER 对应的谓词部分我们就能发现原因。因为子表 EMP 的连接列 DEPTNO 没有排除存在 null 的情况,所以 CBO 选择了 FILTER。现在我们给子查询加上语句where deptno is not null
再看一下执行计划。
select /*+ use_nl(dept,emp@a) */ *
from dept
where deptno not in (select /*+ qb_name(a) */
deptno
from emp where deptno is not null);
Execution Plan
----------------------------------------------------------
Plan hash value: 1522491139
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 23 | 8 (0)| 00:00:01 |
| 1 | NESTED LOOPS ANTI | | 1 | 23 | 8 (0)| 00:00:01 |
| 2 | TABLE ACCESS FULL| DEPT | 4 | 80 | 3 (0)| 00:00:01 |
|* 3 | TABLE ACCESS FULL| EMP | 9 | 27 | 1 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("DEPTNO" IS NOT NULL AND "DEPTNO"="DEPTNO")
现在我们将 not in 改写为 not exists,加上 HINT,再查看执行计划。
select /*+ use_nl(dept,emp@a) */ *
from dept
where not exists
(select /*+ qb_name(a) */ null from emp where emp.deptno = dept.deptno);
Execution Plan
----------------------------------------------------------
Plan hash value: 1522491139
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 23 | 8 (0)| 00:00:01 |
| 1 | NESTED LOOPS ANTI | | 1 | 23 | 8 (0)| 00:00:01 |
| 2 | TABLE ACCESS FULL| DEPT | 4 | 80 | 3 (0)| 00:00:01 |
|* 3 | TABLE ACCESS FULL| EMP | 9 | 27 | 1 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("EMP"."DEPTNO"="DEPT"."DEPTNO")
在执行计划中,DEPT 是嵌套循环的驱动表,EMP 是嵌套循环的被驱动表。现在我们让 DEPT 与 EMP 还进行嵌套循环连接,但是让 EMP 作为驱动表。
select /*+ use_nl(dept,emp@a) leading(emp@a) */ *
from dept
where not exists
(select /*+ qb_name(a) */ null from emp where emp.deptno = dept.deptno);
Execution Plan
----------------------------------------------------------
Plan hash value: 1522491139
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 23 | 8 (0)| 00:00:01 |
| 1 | NESTED LOOPS ANTI | | 1 | 23 | 8 (0)| 00:00:01 |
| 2 | TABLE ACCESS FULL| DEPT | 4 | 80 | 3 (0)| 00:00:01 |
|* 3 | TABLE ACCESS FULL| EMP | 9 | 27 | 1 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("EMP"."DEPTNO"="DEPT"."DEPTNO")
注意观察执行计划,虽然我们使用了 leading(emp@a)强制让 EMP 作为驱动表,但是执行计划中驱动表还是 DEPT。这是为什么呢?因为反连接等价于「外连接 + 子表连接条件 is null」,大家是否还记得:当两表关联是外连接,使用嵌套循环进行关联的时候无法更改驱动表,驱动表会被固定为主表。
现在我们让 DEPT 与 EMP 进行 HASH 连接,而且让 EMP 作为驱动表。
select /*+ use_hash(dept,emp@a) leading(emp@a) */ *
from dept
where not exists
(select /*+ qb_name(a) */ null from emp where emp.deptno = dept.deptno);
Execution Plan
----------------------------------------------------------
Plan hash value: 474461924
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 23 | 7 (15)| 00:00:01 |
|* 1 | HASH JOIN ANTI | | 1 | 23 | 7 (15)| 00:00:01 |
| 2 | TABLE ACCESS FULL| DEPT | 4 | 80 | 3 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL| EMP | 14 | 42 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("EMP"."DEPTNO"="DEPT"."DEPTNO")
虽然 DEPT 与 EMP 采用的是 HASH 连接,但是驱动表还是 DEPT。为什么 leading(emp@a)失效了呢?因为两表关联如果是外连接,要改变 HASH 连接的驱动表必须使用 swap_join_inputs。现在我们使用 swap_join_inputs 来更改 HASH 连接的驱动表。
select /*+ use_hash(dept,emp@a) swap_join_inputs(emp@a) */ *
from dept
where not exists
(select /*+ qb_name(a) */ null from emp where emp.deptno = dept.deptno);
Execution Plan
----------------------------------------------------------
Plan hash value: 152508289
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 23 | 7 (15)| 00:00:01 |
|* 1 | HASH JOIN RIGHT ANTI| | 1 | 23 | 7 (15)| 00:00:01 |
| 2 | TABLE ACCESS FULL | EMP | 14 | 42 | 3 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL | DEPT | 4 | 80 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("EMP"."DEPTNO"="DEPT"."DEPTNO")
读者思考
现有如下 SQL。
select * from a where a.id not in (select id from b where id is not null);
假设 a 有 1 000 万条,b 有 1 000 条,请问如何优化该 SQL?
假设 a 有 1 000 条,b 有 1 000 万条,请问如何优化该 SQL?
假设 a 有 100 万条,b 有 1 000 万条,请问如何优化该 SQL?