SQL> insert into big_table select * from big_table where rownum<=10;I also created an index on the ID column.
SQL> create index big_table_ix on big_table(id) nologging; Index created. Elapsed: 00:01:34.60The analytic function approach used the following SQL statement to find those rows:
SELECT rid FROM (SELECT ROWID rid, ROW_NUMBER () OVER (PARTITION BY id ORDER BY ROWID) rn FROM big_table) WHERE rn <>1;The execution plan of above SQL is as follows:
----------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | ----------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 10M| 238M| | 90333 (2)| 00:18:04 | |* 1 | VIEW | | 10M| 238M| | 90333 (2)| 00:18:04 | | 2 | WINDOW SORT | | 10M| 171M| 268M| 90333 (2)| 00:18:04 | | 3 | TABLE ACCESS FULL| BIG_TABLE | 10M| 171M| | 32467 (1)| 00:06:30 | ----------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("RN"<>1)The subquery approach used the following SQL statement:
select rowid from big_table a where a.rowid > ( select min (rowid) from big_table b where b.id = a.id) ;And the execution plan looks like:
------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 495K| 17M| | 77095 (3)| 00:15:26 | |* 1 | FILTER | | | | | | | | 2 | HASH GROUP BY | | 495K| 17M| 425M| 77095 (3)| 00:15:26 | |* 3 | HASH JOIN | | 10M| 346M| 286M| 39374 (2)| 00:07:53 | | 4 | INDEX FAST FULL SCAN| BIG_TABLE_IX | 10M| 171M| | 5303 (3)| 00:01:04 | | 5 | INDEX FAST FULL SCAN| BIG_TABLE_IX | 10M| 171M| | 5303 (3)| 00:01:04 | ------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(ROWID>MIN(ROWID)) 3 - access("B"."ID"="A"."ID")If there is no index on the ID colume, the execution plan of the subquery approach looks like:
------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | ------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 495K| 17M| | 131K (2)| 00:26:18 | |* 1 | FILTER | | | | | | | | 2 | HASH GROUP BY | | 495K| 17M| 425M| 131K (2)| 00:26:18 | |* 3 | HASH JOIN | | 10M| 346M| 286M| 93703 (2)| 00:18:45 | | 4 | TABLE ACCESS FULL| BIG_TABLE | 10M| 171M| | 32467 (1)| 00:06:30 | | 5 | TABLE ACCESS FULL| BIG_TABLE | 10M| 171M| | 32467 (1)| 00:06:30 | ------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(ROWID>MIN(ROWID)) 3 - access("B"."ID"="A"."ID")
I used the AUTOTRACE tool to obtain the Consistent gets and SET TIMING ON before executing the SQLs. I measured three times for each approach. Below are the results:
aproach consistent get exeuction time ------------------------------------------- analytic 146613 1m50s 146607 1m48s 146545 1m36s --------------------------------------------- subquery 47293 2m25s 47228 2m18s 47309 2m23s --------------------------------------------- subquery 293172 2m27s (no index) 293146 2m30s 293172 2m31s -------------------------------------------It can be seen that compared to the subquery approach (with index) analytic function approah is faster by ~ 40 s although it requires more logical I/O. The subquery approach (without index) required double I/O compared to the analytic function approach as it needs to scan the big_table twice as shown in the execution plan. The execution time is not determined by I/O, these query is CPU bound. SQL_TRACE/TKPROF can show that clearly. Below is a summary of data from TKPROF report:
approach cpu elapsed query ------------------------------------------- analytic 109.57 111.35 146631 subquery 172.68 173.99 47314 -------------------------------------------Obviously, CPU time is predominant. That explains the very small difference in term of execution time between subquery with and without index approach. It suggests that creating the index on ther id column brings little benefit to find the duplicate values in the column. However, there could be a case that we already know the duplicate values in the column and what we want to do is to remove those extra rows with the duplicate values. This situation acutaully happened recently in our production enviroment. We did a 1.5M data load into a production table. Before the data load we dropped all the indexes of this table. But we made a mistake, we loaded 50K rows twice. As a result, when we tried to create unique index on the pK column we failed. I used the analytic function approach to delete the extra rows, it took 50 min. The delete statement is similar to the following
delete from big_table where rowid in ( SELECT rid FROM (SELECT ROWID rid, ROW_NUMBER () OVER (PARTITION BY id ORDER BY ROWID) rn FROM big_table) WHERE rn <>1 );The above approach is not the most efficient one in this case. Becasue we already know the 50K duplicate PK values, the better approach is to create a non-unique index on the PK column, construct a driver table (ids) containing the 50K values, using the following SQL:
delete from big_table a where a.id in ( select id from ids) and a.rowid > (select min(rowid) from big_table b where b.id = a.id);This approach took about 20 min including the time to create the index and the driver table.