Tuesday, June 24, 2008

Database system-wise slow related to adding datafiles to an ASSM tablespace

From about Feburay to March, one of our databases (Oracle 9.2.0.8) experienced system-wise slow from time to time, which can last about 1-2 hours. We typically see more than 100 active sessions during the slow period, whereas normaly active sessios are less than 30 during bussiness hours. Also we can find from v$session_wait that more than 100 sessions from application server user are waiting for such events as "log file sync", "direct path write(lob)" and "db file sequential read". From the statspack report, we confirmed that the "direct path write(lob)" did not appear in the top 5 timed events section before or after the problem period.

From application team, we know that there is a process that consists of three SQLs issued against a large table with a CLOB column:

(1) INSERT INTO TABLE_WITH_CLOB (MID,AID,OID,SEID,OT,MN,MT,MESSAGE,MP,MPT,AS,PBS)
VALUES(:1,:2,:3,:4,:5,:6,sysdate,EMPTY_CLOB(),:7,sysdate,:8,:9)

(2) SELECT MESSAGE FROM TABLE_WITH_CLOB WHERE MID = :1 FOR UPDATE

(3) Update the CLOB column: MESSAGE ( I don't know the exact sql for this)


During the problem period, the process from multiple sessions are extremly slow. What caused the slowness puzzled us for quite a long time. Initially, whenever the problem occurs, we start to analyze this table. But there are no solid evidences that this really helps. We even set up a cron job to analyze this table every day. One time, I observed that the problem was gone before the analyze job even finished, this indicated that the problem was not related to the optimizer stats about this table.

I was asked to establish more metrices to describe the problem. For example, if we can quantify the work load on this table, possibly we can tell the application team that the problem occurs is because the workload increases. So this is not controlled by our DBA team and we can feel less embarassed about unable to find the cause.

By interacting with Oracle support,I finally think that the file/tablespace I/O stats may be used to quantify the workload.

I used the following query to obtain the file I/O stats from statspack report data:
(Assume COL_CLOB_1 is the tablespace for the CLOB data in the questioned table, COL_CLOB_1_n.dbf is the datafile name typically)




column tsname format a25
column filename format a25
column read_tim format 999,999
column write_tim format 999,999,999

select
snap_time start_snap_tim,
-- old.tsname,
-- old.filename,
new.phyrds-old.phyrds phy_rds,
new.phywrts-old.phywrts phy_wrts,
new.readtim-old.readtim read_tim,
new.writetim-old.writetim write_tim
from
perfstat.stats$filestatxs old,
perfstat.stats$filestatxs new,
perfstat.stats$snapshot sn
where
old.snap_id = sn.snap_id
and
old.filename = new.filename
and
new.snap_id = sn.snap_id+1
-- and (new.phyrds-old.phyrds) < 0
and
old.tsname = 'COL_CLOB_LOB_1'
and old.filename like '%_1_4.dbf'
and to_char(snap_time, 'YYYY-MM-DD') = '&date_char'
order by 1, 2
;





Soon I have discovered that the write_tim value is execeptionally high during the problem period, for example, on Mar 17 from
15:10 to 16:00.

>>>> datafile '/db2/u06/oradata/dbprd/COL_CLOB_1_04.dbf' I/O stats




START_SNAP_TIM PHY_RDS PHY_WRTS READ_TIM WRITE_TIM
------------------- ---------- ---------- -------- ------------
2008-03-17 13:40:02 7 3527 26 23,517
2008-03-17 13:50:02 20 3481 124 23,237
2008-03-17 14:00:04 5 3735 1 28,764
2008-03-17 14:10:01 9 3730 26 26,216
2008-03-17 14:20:03 7 3869 1 32,174
2008-03-17 14:30:03 7 3887 19 30,551
2008-03-17 14:40:02 7 3890 47 29,537
2008-03-17 14:50:02 11 3775 33 29,188
2008-03-17 15:00:04 6 3947 38 27,910
2008-03-17 15:10:03 7 1294 1,000 2,619,551
2008-03-17 15:20:10 6 935 1,006 3,076,456
2008-03-17 15:30:30 2 924 1,491 2,486,059
2008-03-17 15:40:35 2 846 151 2,762,804
2008-03-17 15:50:17 1 906 308 2,682,676
2008-03-17 16:00:46 9 2158 402 1,395,569
2008-03-17 16:10:12 14 10507 12 144,262
2008-03-17 16:20:05 16 10177 22 56,643
2008-03-17 16:30:05 19 7908 86 77,863
2008-03-17 16:40:03 11 3947 9 21,317
2008-03-17 16:50:01 8 4079 7 23,791
2008-03-17 17:00:04 8 3699 13 22,332






I've also foud out that every time we added new datafile to this tablespace (I was able to dig out this info from the alert log), we can see the "write_tim" spike. And every database slow incident we know did occur shortly after we added a new datafile to this tablespace. This tablespace grows very fast, about 10 GB per week. We (differnet DBAs whoever on-call) usually add 3-4GB datafile immediately once getting the space alert email. With this observation, I have suggested that we should only add datafile during weekend or weeknight, avoid doing so during the normal bussiness hours. Since then, we have never encountered the same slowness problem again.

In this case, though we are able to avoid the problem, we don't understant the root-cause fully. I suspect there are something to do with the ASSM tablespace feature. It seems there is a high overhead to intitialize a newly added datafile.

No comments: