Problem 2 turns out to be an Oracle feature:
Root cause of crash appears to be
"Bug" 7691270 - Crash the DB in case of write errors (rather than
just offline files) [ID 7691270.8]
Datafile Write Errors (Oracle Database 11gR1 to 11gR2 Change):
Starting with the 11.2.0.2 patchset for Oracle Database 11g Release 2, a write error to any data file will cause the database to perform a shutdown abort of the instance.
In prior releases, I/O errors to datafiles not in the system tablespace would offline the respective datafiles when the database is in archivelog mode. This behavior is not always desirable. Some customers would prefer that the instance crash due to a datafile write error. A new hidden parameter called _datafile_write_errors_crash_instance has been introduced to control whether the instance should crash on a write error or if the datafiles should be taken offline on a write error.
If _datafile_write_errors_crash_instance = TRUE (default) then any write to a datafile which fails due to an IO error causes an instance crash.
If _datafile_write_errors_crash_instance = FALSE then the behavior reverts to the previous behavior (before this fix) such that a write error to a datafile offlines the file (provided the DB is in archivelog mode and the file is not in SYSTEM tablespace in which case the instance is aborted)
Best Practices: Monitor temporary tablespace usage with alerts, tune SQL to use less temporary space.
Datafile Write Errors (Oracle Database 11gR1 to 11gR2 Change):
Starting with the 11.2.0.2 patchset for Oracle Database 11g Release 2, a write error to any data file will cause the database to perform a shutdown abort of the instance.
In prior releases, I/O errors to datafiles not in the system tablespace would offline the respective datafiles when the database is in archivelog mode. This behavior is not always desirable. Some customers would prefer that the instance crash due to a datafile write error. A new hidden parameter called _datafile_write_errors_crash_instance has been introduced to control whether the instance should crash on a write error or if the datafiles should be taken offline on a write error.
If _datafile_write_errors_crash_instance = TRUE (default) then any write to a datafile which fails due to an IO error causes an instance crash.
If _datafile_write_errors_crash_instance = FALSE then the behavior reverts to the previous behavior (before this fix) such that a write error to a datafile offlines the file (provided the DB is in archivelog mode and the file is not in SYSTEM tablespace in which case the instance is aborted)
Best Practices: Monitor temporary tablespace usage with alerts, tune SQL to use less temporary space.
- How to Release the Temp LOB Space and Avoid Hitting ORA-1652 (Doc ID 802897.1)
- How to Release Temporary LOB Segments without Closing the JDBC Connection (Doc ID 1384829.1)
- Bug 5723140 - Temp LOB space not released after commit (Doc ID 5723140.8)
-
Temporary LOB space not released after commit: ora-1652 being hit (Doc ID
750209.1)
In the above notes, the fix to the temp lob space
releasing problem is to set 60025 event. Below is a simple test to demonstrate
this event works:
In Session 1, execute the following statement select to_clob(text) from dba_source where rownum <100; In session 2 monitoring session 1 temp space usage with: SELECT S.sid || ',' || S.serial# sid_serial, S.username, S.osuser, P.spid, S.module, S.machine, S.program, SUM (T.blocks) * TBS.block_size / 1024 / 1024 mb_used, T.tablespace, T.segtype, COUNT(*) sort_ops FROM v$sort_usage T, v$session S, dba_tablespaces TBS, v$process P WHERE T.session_addr = S.saddr AND S.paddr = P.addr and s.sid=2927 --- this is the sid of session 1 AND T.tablespace = TBS.tablespace_name GROUP BY S.sid, S.serial#, S.username, S.osuser, P.spid, S.module, S.machine, S.program, TBS.block_size, T.tablespace,T.segtype ORDER BY mb_used / You will see even after session 1 finish and idle there, in session 2 you will have 1MB temp space allocated to it Second test: In session 1, before run that query do: alter session set events '60025 trace name context forever'; Then in session 2, you will see after session 1 becomes idle, no temp space allocated.
We did not set this event in our production to fix the issue,
instead, dev team change the code and avoid use temporary lob. To demonstrate
difference of the temp space usage before and after release and after the fix,
I queried dba_hist_active_sess_history view. From 11g there is a field called TEMP_SPACE_ALLOCATED
in this view. So it is possible to calculate the average temp space allocated
per active session during a certain period of time:
===== Look at all active session between 16:00 ~ 16:59 in past 30 days in dba_hist_active_sess_history
select stime, size_m/c avg_mb_sess from
(
select to_char(sample_time, 'YYYY-MM-DD HH24') stime,count(*) c, round(sum(TEMP_SPACE_ALLOCATED)/1024/1024) size_m
from dba_hist_active_sess_history
where to_char(sample_time, 'HH24') = 16 and sample_time > sysdate-30
group by to_char(sample_time, 'YYYY-MM-DD HH24')
)
order by 1
STIME AVG_MB_SESS
------------- -----------
2013-02-26 16 .00135459
2013-02-27 16 .002029264
2013-02-28 16 .000963298
2013-03-01 16 .000670166
2013-03-02 16 .001373417
2013-03-03 16 .00049838
2013-03-04 16 .000991616
2013-03-05 16 1.49810976
2013-03-06 16 1.12322684
2013-03-07 16 .000634719
2013-03-08 16 .001408054
2013-03-09 16 .081960706
2013-03-10 16 .000473485
2013-03-11 16 1.21929313
2013-03-12 16 .001113121
2013-03-13 16 .001084893
2013-03-14 16 .000864518
2013-03-15 16 1.70157371
2013-03-16 16 .114989041
2013-03-17 16 231.198656
2013-03-18 16 254.193626
2013-03-19 16 96.2858973
2013-03-20 16 93.731513
2013-03-21 16 135.284032
2013-03-22 16 1.462451
2013-03-23 16 .000423101
2013-03-24 16 .000462321
2013-03-25 16 .000395163
2013-03-26 16 16.1351924
2013-03-27 16 1.84635633
The release was happened on Saturday night of Mar 16 and Dev team
fixed code on Mar 21 night completely. It can be seen, from Mar 17 till Mar 21,
the average temp space allocated per active sessions were much higher compared
to those before release and after the fix. During Mar 17 and Mar 21, what we
did was that once the total temp space reached 150 GB, we recycled the app
servers in a controlled manner to release the temp space.