两表关联只返回主表的数据,而且只返回主表与子表没关联上的数据,这种连接就叫反连接。反连接一般就是指的 not in 和 not exists

反连接等价改写

not in 与 not exists 一般情况下也可以进行等价改写。

not in 的写法如下。

  1. select * from dept where deptno not in (select deptno from emp);
  1. DEPTNO DNAME LOC
  2. ---------- --------------- ---------------------------------------
  3. 40 OPERATIONS BOSTON

not exists 的写法如下。

  1. select *
  2. from dept
  3. where not exists (select null from emp where dept.deptno = emp.deptno);
  1. DEPTNO DNAME LOC
  2. ---------- --------------- ---------------------------------------
  3. 40 OPERATIONS BOSTON

需要注意的是,not in 里面如果有 null,整个查询会返回空,而 in 里面有 null,查询不受 null 影响,例子如下。

  1. select * from dept where deptno not in (10,null);
  1. no rows selected
  1. select * from dept where deptno in (10,null);
  1. DEPTNO DNAME LOC
  2. ---------- --------------- -----------------------
  3. 10 ACCOUNTING NEW YORK

所以在将 not exists 等价改写为 not in 的时候,要注意 null。一般情况下,如果反连接采用 not in 写法,我们需要在 where 条件中剔除 null。

  1. select *
  2. from dept
  3. where deptno not in (select deptno from emp where deptno is not null);

not in 与 not exists 除了可以相互等价改写以外,还可以等价地改写为外连接,例如,上面查询可以等价改写为如下写法。

  1. select d.*
  2. from dept d
  3. left join emp e on d.deptno = e.deptno
  4. where e.deptno is null;
  1. DEPTNO DNAME LOC
  2. ---------- --------------- ------------------
  3. 40 OPERATIONS BOSTON

为什么反连接可以改写为「外连接 + 子表连接条件 is null」?我们再来回顾一下反连接定义:两表关联只返回主表的数据,而且只返回主表与子表没有关联上的数据。根据反连接定义,翻译为标准 SQL 写法就是「外连接 + 子表连接条件 is null」。与半连接改写为内连接不同的是,反连接改写为外连接不需要考虑两表之间的关系。

控制反连接执行计划

我们先来查看示例 SQL 的原始执行计划。

  1. select * from dept where deptno not in (select deptno from emp);
  1. Execution Plan
  2. ----------------------------------------------------------
  3. Plan hash value: 2230682264
  4. -------------------------------------------------------------------------------------
  5. | Id | Operation | Name | Rows | Bytes | Cost(%CPU)|Time |
  6. -------------------------------------------------------------------------------------
  7. | 0 | SELECT STATEMENT | | 1 | 23 | 6 (17)|00:00:01|
  8. | 1 | MERGE JOIN ANTI NA | | 1 | 23 | 6 (17)|00:00:01|
  9. | 2 | SORT JOIN | | 4 | 80 | 2 (0)|00:00:01|
  10. | 3 | TABLE ACCESS BY INDEX ROWID| DEPT | 4 | 80 | 2 (0)|00:00:01|
  11. | 4 | INDEX FULL SCAN | PK_DEPT | 4 | | 1 (0)|00:00:01|
  12. |* 5 | SORT UNIQUE | | 14 | 42 | 4 (25)|00:00:01|
  13. | 6 | TABLE ACCESS FULL | EMP | 14 | 42 | 3 (0)|00:00:01|
  14. -------------------------------------------------------------------------------------
  15. Predicate Information (identified by operation id):
  16. ---------------------------------------------------
  17. 5 - access("DEPTNO"="DEPTNO")
  18. filter("DEPTNO"="DEPTNO")

原始执行计划中 DEPT 与 EMP 是采用排序合并连接进行关联的。

我们现在让 DEPT 与 EMP 使用嵌套循环进行关联,不指定驱动表。

  1. select /*+ use_nl(dept,emp@a) */ *
  2. from dept
  3. where deptno not in (select /*+ qb_name(a) */
  4. deptno
  5. from emp);
  1. Execution Plan
  2. ----------------------------------------------------------
  3. Plan hash value: 1831344308
  4. -------------------------------------------------------------------------------
  5. | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
  6. -------------------------------------------------------------------------------
  7. | 0 | SELECT STATEMENT | | 1 | 23 | 11 (0)| 00:00:01 |
  8. |* 1 | FILTER | | | | | |
  9. | 2 | NESTED LOOPS ANTI SNA| | 1 | 23 | 11 (28)| 00:00:01 |
  10. | 3 | TABLE ACCESS FULL | DEPT | 4 | 80 | 3 (0)| 00:00:01 |
  11. |* 4 | TABLE ACCESS FULL | EMP | 9 | 27 | 1 (0)| 00:00:01 |
  12. |* 5 | TABLE ACCESS FULL | EMP | 1 | 3 | 3 (0)| 00:00:01 |
  13. -------------------------------------------------------------------------------
  14. Predicate Information (identified by operation id):
  15. ---------------------------------------------------
  16. 1 - filter( NOT EXISTS (SELECT /*+ QB_NAME ("A") */ 0 FROM "EMP"
  17. "EMP" WHERE "DEPTNO" IS NULL))
  18. 4 - filter("DEPTNO"="DEPTNO")
  19. 5 - filter("DEPTNO" IS NULL)

执行计划居然变成了 FILTER,我们指定的 HINT 被 CBO 忽略了。这究竟是什么原因呢?注意观察 FILTER 对应的谓词部分我们就能发现原因。因为子表 EMP 的连接列 DEPTNO 没有排除存在 null 的情况,所以 CBO 选择了 FILTER。现在我们给子查询加上语句where deptno is not null再看一下执行计划。

  1. select /*+ use_nl(dept,emp@a) */ *
  2. from dept
  3. where deptno not in (select /*+ qb_name(a) */
  4. deptno
  5. from emp where deptno is not null);
  1. Execution Plan
  2. ----------------------------------------------------------
  3. Plan hash value: 1522491139
  4. ---------------------------------------------------------------------------
  5. | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
  6. ---------------------------------------------------------------------------
  7. | 0 | SELECT STATEMENT | | 1 | 23 | 8 (0)| 00:00:01 |
  8. | 1 | NESTED LOOPS ANTI | | 1 | 23 | 8 (0)| 00:00:01 |
  9. | 2 | TABLE ACCESS FULL| DEPT | 4 | 80 | 3 (0)| 00:00:01 |
  10. |* 3 | TABLE ACCESS FULL| EMP | 9 | 27 | 1 (0)| 00:00:01 |
  11. ---------------------------------------------------------------------------
  12. Predicate Information (identified by operation id):
  13. ---------------------------------------------------
  14. 3 - filter("DEPTNO" IS NOT NULL AND "DEPTNO"="DEPTNO")

现在我们将 not in 改写为 not exists,加上 HINT,再查看执行计划。

  1. select /*+ use_nl(dept,emp@a) */ *
  2. from dept
  3. where not exists
  4. (select /*+ qb_name(a) */ null from emp where emp.deptno = dept.deptno);
  1. Execution Plan
  2. ----------------------------------------------------------
  3. Plan hash value: 1522491139
  4. ---------------------------------------------------------------------------
  5. | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
  6. ---------------------------------------------------------------------------
  7. | 0 | SELECT STATEMENT | | 1 | 23 | 8 (0)| 00:00:01 |
  8. | 1 | NESTED LOOPS ANTI | | 1 | 23 | 8 (0)| 00:00:01 |
  9. | 2 | TABLE ACCESS FULL| DEPT | 4 | 80 | 3 (0)| 00:00:01 |
  10. |* 3 | TABLE ACCESS FULL| EMP | 9 | 27 | 1 (0)| 00:00:01 |
  11. ---------------------------------------------------------------------------
  12. Predicate Information (identified by operation id):
  13. ---------------------------------------------------
  14. 3 - filter("EMP"."DEPTNO"="DEPT"."DEPTNO")

在执行计划中,DEPT 是嵌套循环的驱动表,EMP 是嵌套循环的被驱动表。现在我们让 DEPT 与 EMP 还进行嵌套循环连接,但是让 EMP 作为驱动表。

  1. select /*+ use_nl(dept,emp@a) leading(emp@a) */ *
  2. from dept
  3. where not exists
  4. (select /*+ qb_name(a) */ null from emp where emp.deptno = dept.deptno);
  1. Execution Plan
  2. ----------------------------------------------------------
  3. Plan hash value: 1522491139
  4. ---------------------------------------------------------------------------
  5. | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
  6. ---------------------------------------------------------------------------
  7. | 0 | SELECT STATEMENT | | 1 | 23 | 8 (0)| 00:00:01 |
  8. | 1 | NESTED LOOPS ANTI | | 1 | 23 | 8 (0)| 00:00:01 |
  9. | 2 | TABLE ACCESS FULL| DEPT | 4 | 80 | 3 (0)| 00:00:01 |
  10. |* 3 | TABLE ACCESS FULL| EMP | 9 | 27 | 1 (0)| 00:00:01 |
  11. ---------------------------------------------------------------------------
  12. Predicate Information (identified by operation id):
  13. ---------------------------------------------------
  14. 3 - filter("EMP"."DEPTNO"="DEPT"."DEPTNO")

注意观察执行计划,虽然我们使用了 leading(emp@a)强制让 EMP 作为驱动表,但是执行计划中驱动表还是 DEPT。这是为什么呢?因为反连接等价于「外连接 + 子表连接条件 is null」,大家是否还记得:当两表关联是外连接,使用嵌套循环进行关联的时候无法更改驱动表,驱动表会被固定为主表。

现在我们让 DEPT 与 EMP 进行 HASH 连接,而且让 EMP 作为驱动表。

  1. select /*+ use_hash(dept,emp@a) leading(emp@a) */ *
  2. from dept
  3. where not exists
  4. (select /*+ qb_name(a) */ null from emp where emp.deptno = dept.deptno);
  1. Execution Plan
  2. ----------------------------------------------------------
  3. Plan hash value: 474461924
  4. ---------------------------------------------------------------------------
  5. | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
  6. ---------------------------------------------------------------------------
  7. | 0 | SELECT STATEMENT | | 1 | 23 | 7 (15)| 00:00:01 |
  8. |* 1 | HASH JOIN ANTI | | 1 | 23 | 7 (15)| 00:00:01 |
  9. | 2 | TABLE ACCESS FULL| DEPT | 4 | 80 | 3 (0)| 00:00:01 |
  10. | 3 | TABLE ACCESS FULL| EMP | 14 | 42 | 3 (0)| 00:00:01 |
  11. ---------------------------------------------------------------------------
  12. Predicate Information (identified by operation id):
  13. ---------------------------------------------------
  14. 1 - access("EMP"."DEPTNO"="DEPT"."DEPTNO")

虽然 DEPT 与 EMP 采用的是 HASH 连接,但是驱动表还是 DEPT。为什么 leading(emp@a)失效了呢?因为两表关联如果是外连接,要改变 HASH 连接的驱动表必须使用 swap_join_inputs。现在我们使用 swap_join_inputs 来更改 HASH 连接的驱动表。

  1. select /*+ use_hash(dept,emp@a) swap_join_inputs(emp@a) */ *
  2. from dept
  3. where not exists
  4. (select /*+ qb_name(a) */ null from emp where emp.deptno = dept.deptno);
  1. Execution Plan
  2. ----------------------------------------------------------
  3. Plan hash value: 152508289
  4. -----------------------------------------------------------------------------
  5. | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
  6. -----------------------------------------------------------------------------
  7. | 0 | SELECT STATEMENT | | 1 | 23 | 7 (15)| 00:00:01 |
  8. |* 1 | HASH JOIN RIGHT ANTI| | 1 | 23 | 7 (15)| 00:00:01 |
  9. | 2 | TABLE ACCESS FULL | EMP | 14 | 42 | 3 (0)| 00:00:01 |
  10. | 3 | TABLE ACCESS FULL | DEPT | 4 | 80 | 3 (0)| 00:00:01 |
  11. -----------------------------------------------------------------------------
  12. Predicate Information (identified by operation id):
  13. ---------------------------------------------------
  14. 1 - access("EMP"."DEPTNO"="DEPT"."DEPTNO")

读者思考

现有如下 SQL。

  1. select * from a where a.id not in (select id from b where id is not null);

假设 a 有 1 000 万条,b 有 1 000 条,请问如何优化该 SQL?

假设 a 有 1 000 条,b 有 1 000 万条,请问如何优化该 SQL?

假设 a 有 100 万条,b 有 1 000 万条,请问如何优化该 SQL?