not in/ not exists 有哪些可以替代操作
-
King
需求
例如一张A表(id, userid, info);其中id为自增长主键。存在userid重复记录;并一直有新数据增加,现需要根据userid去重并实时插入表B。
用SQL语法来解释:
-- 获取表B中记录最大id
select max(id) from B
-- 根据上面获取的id;提取最新的A表记录;根据userid去重;并插入中间表TMP_B;
INSERT INTO TMP_B
SELECT id, userid, info
FROM (select id,
userid,
info,
ROW_NUMBER() OVER(PARTITION BY userid ORDER BY id) RN
FROM A
WHERE id > ?)
WHERE RN = 1;
-- 把上面获取的记录;不存在则插入表B
insert into B
select id, userid, info
from TMP_B
where userid not in (select userid from B);
-- 或者用not exists
insert into B
select id, userid, info
from TMP_B tb
where not exists (select 1 from B b where b.userid = tb.userid);
知识兔其中需要改写的语句是
select id, userid, info
from TMP_B
where userid not in (select userid from B);
-- 插入实验数据
知识兔-- 在表B插入百万条记录
insert into B select level, dbms_random.string('x',8), 'lottu'||level from dual connect by level <= 1000000;
-- 在表TMP_B插入十万条记录
insert into TMP_B select level, dbms_random.string('x',8), 'lottu'||level from dual connect by level <= 100000;
通过执行计划;会发现not in/ not exists 效率相差不大。
insert into B
select id, userid, info from TMP_B tb
where not exists (select 1 from B b where b.userid = tb.userid);
100000 rows created.
Elapsed: 00:00:00.52
Execution Plan
----------------------------------------------------------
Plan hash value: 3462170537
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)|Time |
------------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 115K| 4389K| | 2994 (1)|00:00:36 |
| 1 | LOAD TABLE CONVENTIONAL | B | | | | | |
|* 2 | HASH JOIN ANTI | | 115K| 4389K| 4960K| 2994 (1)|00:00:36 |
| 3 | TABLE ACCESS FULL | TMP_B | 115K| 3601K| | 137 (1)|00:00:02 |
| 4 | TABLE ACCESS FULL | B | 1375K| 9403K| | 1372 (1)|00:00:17 |
-----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("B"."USERID"="TB"."USERID")
Note
-----
- dynamic sampling used for this statement (level=2)
Statistics
----------------------------------------------------------
0 recursive calls
3221 db block gets
6320 consistent gets
35 physical reads
3649980 redo size
845 bytes sent via SQL*Net to client
866 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
100000 rows processed
知识兔替换1: 通过merge into改写语句
merge into B b
using TMP_B tb
on (tb.userid = b.userid)
WHEN NOT MATCHED THEN
INSERT (b.id, b.userid, b.info) values (tb.id, tb.userid, tb.info);
知识兔通过执行计划;该SQL会占用内存用于排序。效率会有大大的提升。 特别是在B数据量很大的情况。优势更明显。
Elapsed: 00:00:00.18
Execution Plan
----------------------------------------------------------
Plan hash value: 2722554344
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time|
--------------------------------------------------------------------------------------
| 0 | MERGE STATEMENT | | 115K| 7203K| | 5298 (1)| 00:01:04 |
| 1 | MERGE | B | | | | ||
| 2 | VIEW | | | | | ||
|* 3 | HASH JOIN OUTER | | 115K| 8553K| 4960K| 5298 (1)| 00:01:04 |
| 4 | TABLE ACCESS FULL| TMP_B | 115K| 3601K| | 137 (1)| 00:00:02 |
| 5 | TABLE ACCESS FULL| B | 1336K| 56M| | 1373 (1)| 00:00:17 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("TB"."USERID"="B"."USERID"(+))
Note
-----
- dynamic sampling used for this statement (level=2)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
5496 consistent gets
0 physical reads
0 redo size
844 bytes sent via SQL*Net to client
896 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
0 rows processed
知识兔替换2: 通过kettle工具,设置转换;通过shell命令;用cron或者其他调度系统调用。该操作相对用户是透明的;至于效率方面有待验证。
看到此处的朋友,若有更好的方法;欢迎在下评论;一起讨论。
最后该需求若在PostgreSQL中;有更简洁的的写法.简单粗暴。
INSERT INTO B
SELECT *
FROM A on conflict (userid) do nothing;
知识兔