Friday, July 26, 2013

Find rows that have duplicate values in a column

This post is to describe my test on finding rows that have duplicate values in a column.  The purpose is to understand which method is more efficient: analytic function approach or subquery approach. First of all, a test table called "big_table" was created with 10,000,000 rows and an ID column with number data type. The value of ID is from 1 to 10,000,000. Then I created 10 duplicate rows by the following insert statement:
SQL> insert into big_table select * from big_table where rownum<=10;
I also created an index on the ID column.
SQL> create index big_table_ix on big_table(id) nologging;

Index created.

Elapsed: 00:01:34.60

The analytic function approach used the following SQL statement to find those rows:
        SELECT rid
            FROM (SELECT ROWID rid,
                   ROW_NUMBER () OVER (PARTITION BY id ORDER BY ROWID) rn
                   FROM big_table)
          WHERE rn <>1;

The execution plan of above SQL is as follows:
-----------------------------------------------------------------------------------------
| Id  | Operation           | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |    10M|   238M|       | 90333   (2)| 00:18:04 |
|*  1 |  VIEW               |           |    10M|   238M|       | 90333   (2)| 00:18:04 |
|   2 |   WINDOW SORT       |           |    10M|   171M|   268M| 90333   (2)| 00:18:04 |
|   3 |    TABLE ACCESS FULL| BIG_TABLE |    10M|   171M|       | 32467   (1)| 00:06:30 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("RN"<>1)

The subquery approach used the following SQL statement:
       select rowid
       from big_table  a
       where a.rowid >
         ( select min (rowid)
             from big_table b
             where b.id = a.id) ;

And the execution plan looks like:
------------------------------------------------------------------------------------------------
| Id  | Operation               | Name         | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |              |   495K|    17M|       | 77095   (3)| 00:15:26 |
|*  1 |  FILTER                 |              |       |       |       |            |          |
|   2 |   HASH GROUP BY         |              |   495K|    17M|   425M| 77095   (3)| 00:15:26 |
|*  3 |    HASH JOIN            |              |    10M|   346M|   286M| 39374   (2)| 00:07:53 |
|   4 |     INDEX FAST FULL SCAN| BIG_TABLE_IX |    10M|   171M|       |  5303   (3)| 00:01:04 |
|   5 |     INDEX FAST FULL SCAN| BIG_TABLE_IX |    10M|   171M|       |  5303   (3)| 00:01:04 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWID>MIN(ROWID))
   3 - access("B"."ID"="A"."ID")


If there is no index on the ID colume, the execution plan of the subquery approach looks like:
------------------------------------------------------------------------------------------
| Id  | Operation            | Name      | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |           |   495K|    17M|       |   131K  (2)| 00:26:18 |
|*  1 |  FILTER              |           |       |       |       |            |          |
|   2 |   HASH GROUP BY      |           |   495K|    17M|   425M|   131K  (2)| 00:26:18 |
|*  3 |    HASH JOIN         |           |    10M|   346M|   286M| 93703   (2)| 00:18:45 |
|   4 |     TABLE ACCESS FULL| BIG_TABLE |    10M|   171M|       | 32467   (1)| 00:06:30 |
|   5 |     TABLE ACCESS FULL| BIG_TABLE |    10M|   171M|       | 32467   (1)| 00:06:30 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWID>MIN(ROWID))
   3 - access("B"."ID"="A"."ID")


I used the AUTOTRACE tool to obtain the Consistent gets and SET TIMING ON before executing the SQLs. I measured three times for each approach. Below are the results:
aproach      consistent get   exeuction time
-------------------------------------------
analytic     146613           1m50s
             146607           1m48s
             146545           1m36s
---------------------------------------------
subquery      47293           2m25s
              47228           2m18s
              47309           2m23s 
---------------------------------------------
subquery     293172           2m27s
(no index)   293146           2m30s
             293172           2m31s
-------------------------------------------
It can be seen that compared to the subquery approach (with index) analytic function approah is faster by ~ 40 s although it requires more logical I/O. The subquery approach (without index) required double I/O compared to the analytic function approach as it needs to scan the big_table twice as shown in the execution plan. The execution time is not determined by I/O, these query is CPU bound. SQL_TRACE/TKPROF can show that clearly. Below is a summary of data from TKPROF report:
approach     cpu       elapsed      query      
-------------------------------------------
analytic    109.57      111.35      146631
subquery    172.68      173.99      47314
-------------------------------------------
Obviously, CPU time is predominant. That explains the very small difference in term of execution time between subquery with and without index approach. It suggests that creating the index on ther id column brings little benefit to find the duplicate values in the column. However, there could be a case that we already know the duplicate values in the column and what we want to do is to remove those extra rows with the duplicate values. This situation acutaully happened recently in our production enviroment. We did a 1.5M data load into a production table. Before the data load we dropped all the indexes of this table. But we made a mistake, we loaded 50K rows twice. As a result, when we tried to create unique index on the pK column we failed. I used the analytic function approach to delete the extra rows, it took 50 min. The delete statement is similar to the following
        delete from big_table 
        where rowid in (  SELECT rid
            FROM (SELECT ROWID rid,
                   ROW_NUMBER () OVER (PARTITION BY id ORDER BY ROWID) rn
                   FROM big_table)
          WHERE rn <>1
         );

The above approach is not the most efficient one in this case. Becasue we already know the 50K duplicate PK values, the better approach is to create a non-unique index on the PK column, construct a driver table (ids) containing the 50K values, using the following SQL:
    delete from big_table a
 where 
    a.id in ( select id from ids)
 and a.rowid > 
   (select min(rowid) 
       from big_table b where b.id = a.id);


This approach took about 20 min including the time to create the index and the driver table.

Monday, July 15, 2013

Redo Reduction by Using Global Tempoary Table

There are about 1M rows in the BIG_TABLE. The goal is to load data (transformed in some way in real situation) from BIG_TABLE to MY_BIG_TABLE.The following two procedures were compared:

  • Procedure test_pro



create or replace procedure test_pro IS
 CURSOR BIG_TABLE_CUR
  IS
    SELECT id ,owner ,object_name ,object_type ,created ,status from big_table;
 
  TYPE BIG_TABLE_TYPE
  IS TABLE OF BIG_TABLE_CUR%ROWTYPE INDEX BY PLS_INTEGER;
  BIG_TABLE_REC BIG_TABLE_TYPE;
 
BEGIN
 
  OPEN BIG_TABLE_CUR;
  LOOP
    FETCH BIG_TABLE_CUR BULK COLLECT
    INTO BIG_TABLE_REC LIMIT 2000;
    EXIT WHEN BIG_TABLE_REC.COUNT = 0;
    FORALL i IN 1 .. BIG_TABLE_REC.count 
    INSERT INTO my_big_table VALUES BIG_TABLE_REC (i);
    COMMIT;
  END LOOP;
  CLOSE BIG_TABLE_CUR;
END;

  • Procedure test_pro2, here my_big_table_gtt is a global tempoary table

create or replace procedure test_pro2 IS
 CURSOR BIG_TABLE_CUR
  IS
    SELECT 
        id ,owner ,object_name ,object_type ,created ,status from big_table;
 
  TYPE BIG_TABLE_TYPE
  IS TABLE OF BIG_TABLE_CUR%ROWTYPE INDEX BY PLS_INTEGER;
  BIG_TABLE_REC BIG_TABLE_TYPE;
 
BEGIN
 
  OPEN BIG_TABLE_CUR;
  LOOP
    FETCH BIG_TABLE_CUR BULK COLLECT
    INTO BIG_TABLE_REC LIMIT 2000;
    EXIT WHEN BIG_TABLE_REC.COUNT = 0;
    FORALL i IN 1 .. BIG_TABLE_REC.count 
    INSERT  INTO my_big_table_gtt VALUES BIG_TABLE_REC (i);
    COMMIT;
  END LOOP;
  CLOSE BIG_TABLE_CUR;
  execute immediate 'insert /*+ append */ into my_big_table select * from my_big_table_gtt';
  commit;
END;

I measured the redo by v$mystat in the test, the result is showed as follows:

test_pro         redo  71M    run time:  4.57 s   
test_pro2        redo   6M     run time: 4.59 s

The test database  is  in noarchivelog mode.