Friday, July 26, 2013

Find rows that have duplicate values in a column

This post is to describe my test on finding rows that have duplicate values in a column.  The purpose is to understand which method is more efficient: analytic function approach or subquery approach. First of all, a test table called "big_table" was created with 10,000,000 rows and an ID column with number data type. The value of ID is from 1 to 10,000,000. Then I created 10 duplicate rows by the following insert statement:
SQL> insert into big_table select * from big_table where rownum<=10;
I also created an index on the ID column.
SQL> create index big_table_ix on big_table(id) nologging;

Index created.

Elapsed: 00:01:34.60

The analytic function approach used the following SQL statement to find those rows:
        SELECT rid
            FROM (SELECT ROWID rid,
                   ROW_NUMBER () OVER (PARTITION BY id ORDER BY ROWID) rn
                   FROM big_table)
          WHERE rn <>1;

The execution plan of above SQL is as follows:
-----------------------------------------------------------------------------------------
| Id  | Operation           | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |    10M|   238M|       | 90333   (2)| 00:18:04 |
|*  1 |  VIEW               |           |    10M|   238M|       | 90333   (2)| 00:18:04 |
|   2 |   WINDOW SORT       |           |    10M|   171M|   268M| 90333   (2)| 00:18:04 |
|   3 |    TABLE ACCESS FULL| BIG_TABLE |    10M|   171M|       | 32467   (1)| 00:06:30 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("RN"<>1)

The subquery approach used the following SQL statement:
       select rowid
       from big_table  a
       where a.rowid >
         ( select min (rowid)
             from big_table b
             where b.id = a.id) ;

And the execution plan looks like:
------------------------------------------------------------------------------------------------
| Id  | Operation               | Name         | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |              |   495K|    17M|       | 77095   (3)| 00:15:26 |
|*  1 |  FILTER                 |              |       |       |       |            |          |
|   2 |   HASH GROUP BY         |              |   495K|    17M|   425M| 77095   (3)| 00:15:26 |
|*  3 |    HASH JOIN            |              |    10M|   346M|   286M| 39374   (2)| 00:07:53 |
|   4 |     INDEX FAST FULL SCAN| BIG_TABLE_IX |    10M|   171M|       |  5303   (3)| 00:01:04 |
|   5 |     INDEX FAST FULL SCAN| BIG_TABLE_IX |    10M|   171M|       |  5303   (3)| 00:01:04 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWID>MIN(ROWID))
   3 - access("B"."ID"="A"."ID")


If there is no index on the ID colume, the execution plan of the subquery approach looks like:
------------------------------------------------------------------------------------------
| Id  | Operation            | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |           |   495K|    17M|       |   131K  (2)| 00:26:18 |
|*  1 |  FILTER              |           |       |       |       |            |          |
|   2 |   HASH GROUP BY      |           |   495K|    17M|   425M|   131K  (2)| 00:26:18 |
|*  3 |    HASH JOIN         |           |    10M|   346M|   286M| 93703   (2)| 00:18:45 |
|   4 |     TABLE ACCESS FULL| BIG_TABLE |    10M|   171M|       | 32467   (1)| 00:06:30 |
|   5 |     TABLE ACCESS FULL| BIG_TABLE |    10M|   171M|       | 32467   (1)| 00:06:30 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWID>MIN(ROWID))
   3 - access("B"."ID"="A"."ID")


I used the AUTOTRACE tool to obtain the Consistent gets and SET TIMING ON before executing the SQLs. I measured three times for each approach. Below are the results:
aproach      consistent get   exeuction time
-------------------------------------------
analytic     146613           1m50s
             146607           1m48s
             146545           1m36s
---------------------------------------------
subquery      47293           2m25s
              47228           2m18s
              47309           2m23s 
---------------------------------------------
subquery     293172           2m27s
(no index)   293146           2m30s
             293172           2m31s
-------------------------------------------
It can be seen that compared to the subquery approach (with index) analytic function approah is faster by ~ 40 s although it requires more logical I/O. The subquery approach (without index) required double I/O compared to the analytic function approach as it needs to scan the big_table twice as shown in the execution plan. The execution time is not determined by I/O, these query is CPU bound. SQL_TRACE/TKPROF can show that clearly. Below is a summary of data from TKPROF report:
approach     cpu       elapsed      query      
-------------------------------------------
analytic    109.57      111.35      146631
subquery    172.68      173.99      47314
-------------------------------------------
Obviously, CPU time is predominant. That explains the very small difference in term of execution time between subquery with and without index approach. It suggests that creating the index on ther id column brings little benefit to find the duplicate values in the column. However, there could be a case that we already know the duplicate values in the column and what we want to do is to remove those extra rows with the duplicate values. This situation acutaully happened recently in our production enviroment. We did a 1.5M data load into a production table. Before the data load we dropped all the indexes of this table. But we made a mistake, we loaded 50K rows twice. As a result, when we tried to create unique index on the pK column we failed. I used the analytic function approach to delete the extra rows, it took 50 min. The delete statement is similar to the following
        delete from big_table 
        where rowid in (  SELECT rid
            FROM (SELECT ROWID rid,
                   ROW_NUMBER () OVER (PARTITION BY id ORDER BY ROWID) rn
                   FROM big_table)
          WHERE rn <>1
         );

The above approach is not the most efficient one in this case. Becasue we already know the 50K duplicate PK values, the better approach is to create a non-unique index on the PK column, construct a driver table (ids) containing the 50K values, using the following SQL:
    delete from big_table a
 where 
    a.id in ( select id from ids)
 and a.rowid > 
   (select min(rowid) 
       from big_table b where b.id = a.id);


This approach took about 20 min including the time to create the index and the driver table.

No comments: