Monday, June 16, 2008

Resolve an issue of duplicated rows in a Shareplex replication environment

Table SMC is one of the tables that are replicated from the production database to the reporting database. In production, id is the PK column and there is also a unique constraint (UK) on the columns (oid,eid,mid,st) of the table.

The problem of duplicated rows was identified in terms of there are mutiple rows having same (oid, eid, mid,st) in the SMC table in the reporting database.

The reason for the existing of the duplicated rows may be as follows:
(1) The PK and UK constraints have not been enabled
(2) In the reporting database, single post has been changed to multiple post queue.

Since there is no testing environment available, the actual reason for the problem has not been able to be confirmed by repeating the issue. However, it is obviously wrong that the PK and UK are not enabled in the SMC table in reporting database.

To fix this problem, I have to identify and remove the duplicated rows and then enable the PK and UK.

First of all, by comparing the row counts of the table in production and reporting using the following query, it can be known that the duplication only occurred during one day. ( Since the table is huge: > 100 GB, it is not wise to try to find the duplicated rows in all date range)

select count(*) from sf.smc
where
mc_date >= to_date('2008-05-21 00:00:00', 'YYYY-MM-DD HH24:MI:SS')
and
mc_date <= to_date('2008-05-21 23:59:59', 'YYYY-MM-DD HH24:MI:SS')
/
Then, I created a temporary table from the SMC table in the reporting database as follows:
create table temp_a  nologging
as 
select 
id, oid, eid, mid, st 
FROM
sf.smc
where 
mc_date > to_date('2008-05-21 00:00:00', 'YYYY-MM-DD HH24:MI:SS')
and
mc_date > to_date('2008-05-21 23:59:59', 'YYYY-MM-DD HH24:MI:SS')
/

execute dbms_stats.gather_table_stats(user, 'TEMP_A', null,estimate_percent=>10);

The duplicated rows as well as the baseline row ( i.e. the row to which any other rows compared in order to determine if they are duplicated )are displayed through the following sql:
select id,  oid, eid, st, mid from smc
where (oid, eid, st, mid ) in (
select  oid, eid, st, mid
FROM
temp_a A
WHERE
rowid >
(SELECT min(rowid) FROM temp_a B
WHERE
B.oid = A.oid
and
B.eid = A.eid
and
B.mid = A.mid
and B.st = A.st
)
)
order by 2,3,4,5

The following are the duplicate rows in RPT
ID      OID             EID                   ST              MID
---------- -------------------- -------------------- ----------- ----------
1037693787 CICN087494895        489740124            Change           8
1036151673 CICN087494895        489740124            Change           8
1036151856 CICN087494953        430282795            Change           2
1036152154 CICN087494953        430282795            Change           2
1037621342 ICOG087494888        503141443            Install         19
1036151603 ICOG087494888        503141443            Install         19
1036151617 IICN087494988        503141881            Install          8
1036183677 IICN087494988        503141881            Install          8

8 rows selected.

Note the column ID is different, to decide which row should be removed, we need to know which ID is presented in the produciton database: The following are the corresponding row in the production
ID     OID                EID                   ST             MID
---------- -------------------- -------------------- ---------- ----------
1037693787 CICN087494895        489740124            Change           8
1036152154 CICN087494953        430282795            Change           2
1037621342 ICOG087494888        503141443            Install         19
1036183677 IICN087494988        503141881            Install          8





As a result, the rows with the following id will be deleted in the reporting database:

'1036151673'
'1036151856'
'1036151603'
'1036151617'


Note: schema, table and column names have been modified to hide any possible confidential information


updated: Mar 29, 10 - When dealing with duplicated rows issue, it is obvious we should be very clear about what the meaning of  'duplicate' for a particular table, all  columns value are same or just some columns values are same?. If we only consider PK columns, for example (col1, col2) are supposed to be PK, but the constraint are not enabled and we want to delete the duplicated rows in order to create the PK. We can do:

create table tab_tmp
as
select a.rowid, a.col1, a.col2
   from tab a
  where a.rowid >
     ( select max(rowid)
       from tab b
      where a.col1 = b.col1
         and a.col2=b.col2);

We maybe want to create index on (col1 col2) if not there  first , non-unique of course. Then we delete duplicated row by:

delete from tab a
where a.rowid in ( select rowid from tab_temp);


Updated: Jul 29, 2010 -

In this post: http://viralpatel.net/blogs/2010/06/deleting-duplicate-rows-in-oracle.html , several ways of deleting duplicated rows are summarized. One interesting method is using analytic function, i.e.
DELETE FROM tbl_test
 WHERE ROWID IN (
         SELECT rid
           FROM (SELECT ROWID rid,
                 ROW_NUMBER () OVER (PARTITION BY ser_no, fst_nm, deptid, cmnt ORDER BY ROWID) rn
                 FROM tbl_test)
        WHERE rn <>1);

No comments: