Thursday, March 28, 2013

Problems of Temporary Tablespace Usage after a Recent Release

After a recent release, we’ve encountered two problems in a production database:  (1) Temporary tablespace usage high; (2) Instance shutdown when temporary tablespace space used up.
Problem 2 turns out to be an Oracle feature:


Root cause of crash appears to be "Bug" 7691270 - Crash the DB in case of write errors (rather than just offline files) [ID 7691270.8]

Datafile Write Errors (Oracle Database 11gR1 to 11gR2 Change):

Starting with the patchset for Oracle Database 11g Release 2, a write error to any data file will cause the database to perform a shutdown abort of the instance.

In prior releases, I/O errors to datafiles not in the system tablespace would offline the respective datafiles when the database is in archivelog mode. This behavior is not always desirable. Some customers would prefer that the instance crash due to a datafile write error. A new hidden parameter called _datafile_write_errors_crash_instance has been introduced to control whether the instance should crash on a write error or if the datafiles should be taken offline on a write error.

If _datafile_write_errors_crash_instance = TRUE (default) then any write to a datafile which fails due to an IO error causes an instance crash.
If _datafile_write_errors_crash_instance = FALSE then the behavior reverts to the previous behavior (before this fix) such that a write error to a datafile offlines the file (provided the DB is in archivelog mode and the file is not in SYSTEM tablespace in which case the instance is aborted)

Best Practices: Monitor temporary tablespace usage with alerts, tune SQL to use less temporary space.

The increasing usage of temporary space after the release was due to the increasing usage of temporary lobs, which reside in the temporary tablespace. For example, there were 30+ queries using to_clob function in the release. Oracle won’t release temp lob space in the Weblogic connection unless we recycle the Weblogic app servers. There are several related MOS notes about this behavior:

  • How to Release the Temp LOB Space and Avoid Hitting ORA-1652 (Doc ID 802897.1)
  • How to Release Temporary LOB Segments without Closing the JDBC Connection (Doc ID 1384829.1)
  • Bug 5723140 - Temp LOB space not released after commit (Doc ID 5723140.8)
  • Temporary LOB space not released after commit: ora-1652 being hit (Doc ID 750209.1)


In the above notes, the fix to the temp lob space releasing problem is to set 60025 event. Below is a simple test to demonstrate this event works:

In Session 1, execute the following statement

select to_clob(text) from dba_source where rownum <100;

In session 2 monitoring session 1 temp space usage with:

SELECT   S.sid || ',' || S.serial# sid_serial,
         SUM (T.blocks) * TBS.block_size / 1024 / 1024 mb_used,
         COUNT(*) sort_ops
FROM     v$sort_usage T, v$session S, dba_tablespaces TBS, v$process P
WHERE    T.session_addr = S.saddr
AND      S.paddr = P.addr
and    s.sid=2927   --- this is the sid of session 1
AND      T.tablespace = TBS.tablespace_name
GROUP BY S.sid, S.serial#, S.username, S.osuser, P.spid, S.module,
         S.machine, S.program, TBS.block_size, T.tablespace,T.segtype 
ORDER BY mb_used 

You will see even after session 1 finish and idle there, in session 2 you will have 1MB temp space allocated to it

Second test:

In session 1, before run that query do:
alter session set events '60025 trace name context forever';

Then in session 2, you will see after session 1 becomes idle, no temp space allocated.

We did not set this event in our production to fix the issue, instead, dev team change the code and avoid use temporary lob. To demonstrate difference of the temp space usage before and after release and after the fix, I queried dba_hist_active_sess_history view.  From 11g there is a field called TEMP_SPACE_ALLOCATED in this view. So it is possible to calculate the average temp space allocated per active session during a certain period of time:


===== Look at all active session between 16:00 ~ 16:59 in past 30 days in dba_hist_active_sess_history
select stime, size_m/c avg_mb_sess from
  select to_char(sample_time, 'YYYY-MM-DD HH24') stime,count(*) c, round(sum(TEMP_SPACE_ALLOCATED)/1024/1024) size_m
  from dba_hist_active_sess_history
  where to_char(sample_time, 'HH24') = 16 and sample_time > sysdate-30
  group by to_char(sample_time, 'YYYY-MM-DD HH24')
order by 1

------------- -----------
2013-02-26 16   .00135459
2013-02-27 16  .002029264
2013-02-28 16  .000963298
2013-03-01 16  .000670166
2013-03-02 16  .001373417
2013-03-03 16   .00049838
2013-03-04 16  .000991616
2013-03-05 16  1.49810976
2013-03-06 16  1.12322684
2013-03-07 16  .000634719
2013-03-08 16  .001408054
2013-03-09 16  .081960706
2013-03-10 16  .000473485
2013-03-11 16  1.21929313
2013-03-12 16  .001113121
2013-03-13 16  .001084893
2013-03-14 16  .000864518
2013-03-15 16  1.70157371
2013-03-16 16  .114989041
2013-03-17 16  231.198656
2013-03-18 16  254.193626
2013-03-19 16  96.2858973
2013-03-20 16   93.731513
2013-03-21 16  135.284032
2013-03-22 16    1.462451
2013-03-23 16  .000423101
2013-03-24 16  .000462321
2013-03-25 16  .000395163
2013-03-26 16  16.1351924
2013-03-27 16  1.84635633

The release was happened on Saturday night of Mar 16 and Dev team fixed code on Mar 21 night completely. It can be seen, from Mar 17 till Mar 21, the average temp space allocated per active sessions were much higher compared to those before release and after the fix. During Mar 17 and Mar 21, what we did was that once the total temp space reached 150 GB, we recycled the app servers  in a controlled manner to release the temp space.