Oracle Study Notes by Denis: Database system-wise slow related to adding datafiles to an ASSM tablespace

From about Feburay to March, one of our databases (Oracle 9.2.0.8) experienced system-wise slow from time to time, which can last about 1-2 hours. We typically see more than 100 active sessions during the slow period, whereas normaly active sessios are less than 30 during bussiness hours. Also we can find from v$session_wait that more than 100 sessions from application server user are waiting for such events as "log file sync", "direct path write(lob)" and "db file sequential read". From the statspack report, we confirmed that the "direct path write(lob)" did not appear in the top 5 timed events section before or after the problem period.

From application team, we know that there is a process that consists of three SQLs issued against a large table with a CLOB column:

(1) INSERT INTO TABLE_WITH_CLOB (MID,AID,OID,SEID,OT,MN,MT,MESSAGE,MP,MPT,AS,PBS)
VALUES(:1,:2,:3,:4,:5,:6,sysdate,EMPTY_CLOB(),:7,sysdate,:8,:9)

(2) SELECT MESSAGE FROM TABLE_WITH_CLOB WHERE MID = :1 FOR UPDATE

(3) Update the CLOB column: MESSAGE ( I don't know the exact sql for this)

During the problem period, the process from multiple sessions are extremly slow. What caused the slowness puzzled us for quite a long time. Initially, whenever the problem occurs, we start to analyze this table. But there are no solid evidences that this really helps. We even set up a cron job to analyze this table every day. One time, I observed that the problem was gone before the analyze job even finished, this indicated that the problem was not related to the optimizer stats about this table.

I was asked to establish more metrices to describe the problem. For example, if we can quantify the work load on this table, possibly we can tell the application team that the problem occurs is because the workload increases. So this is not controlled by our DBA team and we can feel less embarassed about unable to find the cause.

By interacting with Oracle support,I finally think that the file/tablespace I/O stats may be used to quantify the workload.

I used the following query to obtain the file I/O stats from statspack report data:
(Assume COL_CLOB_1 is the tablespace for the CLOB data in the questioned table, COL_CLOB_1_n.dbf is the datafile name typically)



column tsname format a25
column filename format a25
column read_tim format  999,999
column write_tim format 999,999,999

select
   snap_time start_snap_tim,
--   old.tsname,
--   old.filename,
   new.phyrds-old.phyrds   phy_rds,
   new.phywrts-old.phywrts phy_wrts,
   new.readtim-old.readtim read_tim,
   new.writetim-old.writetim write_tim
from
   perfstat.stats$filestatxs old,
   perfstat.stats$filestatxs new,
   perfstat.stats$snapshot   sn
where
   old.snap_id = sn.snap_id
and
   old.filename = new.filename
and
   new.snap_id = sn.snap_id+1
-- and (new.phyrds-old.phyrds) < 0
and
   old.tsname = 'COL_CLOB_LOB_1'
   and old.filename like '%_1_4.dbf'
and to_char(snap_time, 'YYYY-MM-DD') = '&date_char'
order by 1, 2
;

Soon I have discovered that the write_tim value is execeptionally high during the problem period, for example, on Mar 17 from
15:10 to 16:00.

>>>> datafile '/db2/u06/oradata/dbprd/COL_CLOB_1_04.dbf' I/O stats



 START_SNAP_TIM         PHY_RDS   PHY_WRTS READ_TIM    WRITE_TIM
 ------------------- ---------- ---------- -------- ------------
 2008-03-17 13:40:02          7       3527       26       23,517
 2008-03-17 13:50:02         20       3481      124       23,237
 2008-03-17 14:00:04          5       3735        1       28,764
 2008-03-17 14:10:01          9       3730       26       26,216
 2008-03-17 14:20:03          7       3869        1       32,174
 2008-03-17 14:30:03          7       3887       19       30,551
 2008-03-17 14:40:02          7       3890       47       29,537
 2008-03-17 14:50:02         11       3775       33       29,188
 2008-03-17 15:00:04          6       3947       38       27,910
 2008-03-17 15:10:03          7       1294    1,000    2,619,551
 2008-03-17 15:20:10          6        935    1,006    3,076,456
 2008-03-17 15:30:30          2        924    1,491    2,486,059
 2008-03-17 15:40:35          2        846      151    2,762,804
 2008-03-17 15:50:17          1        906      308    2,682,676
 2008-03-17 16:00:46          9       2158      402    1,395,569
 2008-03-17 16:10:12         14      10507       12      144,262
 2008-03-17 16:20:05         16      10177       22       56,643
 2008-03-17 16:30:05         19       7908       86       77,863
 2008-03-17 16:40:03         11       3947        9       21,317
 2008-03-17 16:50:01          8       4079        7       23,791
 2008-03-17 17:00:04          8       3699       13       22,332

I've also foud out that every time we added new datafile to this tablespace (I was able to dig out this info from the alert log), we can see the "write_tim" spike. And every database slow incident we know did occur shortly after we added a new datafile to this tablespace. This tablespace grows very fast, about 10 GB per week. We (differnet DBAs whoever on-call) usually add 3-4GB datafile immediately once getting the space alert email. With this observation, I have suggested that we should only add datafile during weekend or weeknight, avoid doing so during the normal bussiness hours. Since then, we have never encountered the same slowness problem again.

In this case, though we are able to avoid the problem, we don't understant the root-cause fully. I suspect there are something to do with the ASSM tablespace feature. It seems there is a high overhead to intitialize a newly added datafile.

Oracle Study Notes by Denis

I am starting to study SQL Server as well

Tuesday, June 24, 2008

Database system-wise slow related to adding datafiles to an ASSM tablespace

No comments:

Like this Facebook page:

About Me

Twitter

Recent Comments

Blog Archive

Label Cloud

My Blog List

Oracle Links

Random Post