Friday, August 15, 2008

Lost of a redo log group due to DBA's mistake

Just few days before I joined the current DBA team, there was a serious incident that lead to a production database hung due to a DBA's mistake. I dig into the alert log of that database, trying to understand the sequence of event occurred for the incident. Here are my findings (actual database name changed to PRODDB):

(1) At this moment, the DBA issued a statement to add a redo log group consisting of one member.

----------------
Sat Aug 2 03:50:24 2008
alter database add logfile group 12 '/opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc' size 3G
----------------

Note: The DBA used the archive log file name format and location for the online redo log file. He obviously confused online redo log files with archived log files. The following query shows that at 03:30, the real archive log file "PRODDB0000124998_1_589897528.arc" corresponding to sequence# 124998 did not exist at the location before he issued the statement.

SQL> select sequence#, completion_time from v$archived_log where completion_time between sysdate-12 and sysdate -10


SEQUENCE# COMPLETION_TIME
---------- --------------------
124984 2008-AUG-02 00:50:32
124985 2008-AUG-02 02:12:26
124986 2008-AUG-02 03:02:46
124987 2008-AUG-02 03:43:32
124992 2008-AUG-02 04:06:38
124988 2008-AUG-02 04:07:19
124991 2008-AUG-02 04:08:19
124989 2008-AUG-02 04:35:42
124990 2008-AUG-02 04:38:58
124993 2008-AUG-02 04:50:42
124994 2008-AUG-02 04:51:13
124995 2008-AUG-02 04:51:41
124996 2008-AUG-02 04:54:50
124997 2008-AUG-02 05:17:21
124999 2008-AUG-02 05:35:23
124998 2008-AUG-02 05:39:57
125001 2008-AUG-02 05:49:09
125002 2008-AUG-02 05:50:44
125003 2008-AUG-02 05:51:48

(2) At this moment, all online logs needed archiving, swiching logfile can not proceed, so database was hung.

------
Sat Aug 2 03:55:49 2008
Thread 1 cannot allocate new log, sequence 124999
All online logs needed archiving
Current log# 6 seq# 124998 mem# 0: /opt/app/prddbserv/oracle/PRODDB/NEW_redo2/redoPRODDB_F1.dbf
Current log# 6 seq# 124998 mem# 1: /opt/app/prddbserv/oracle/PRODDB/NEW_redo4/redoPRODDB_F2.dbf
------
Note: This is why the DBA wanted to add new online redo log groups at the first place.

(3) At this moment, log switch occurred
------
Sat Aug 2 04:06:39 2008
Beginning log switch checkpoint up to RBA [0x1e847.2.10], SCN: 10077378246296
Thread 1 advanced to log sequence 124999
Current log# 9 seq# 124999 mem# 0: /opt/app/prddbserv/oracle/PRODDB/NEW_redo1/redoPRODDB_I1.dbf
Current log# 9 seq# 124999 mem# 1: /opt/app/prddbserv/oracle/PRODDB/NEW_redo3/redoPRODDB_I2.dbf
------
(4) At this moment, adding logfile completed:
------
Sat Aug 2 04:11:13 2008
Completed: alter database add logfile group 12 '/opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc'
size 3G
------

(5) At this moment, log switch occurred and the newly added logfile became the current log
------
Sat Aug 2 04:16:59 2008
Beginning log switch checkpoint up to RBA [0x1e848.2.10], SCN: 10077390512906
Thread 1 advanced to log sequence 125000
Current log# 12 seq# 125000 mem# 0: /opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc
------

Note: From the seq# 125000, we can know when Oracle archives this online redo log with the wrong name, the
generated archived log will be "/opt/app/prddbserv/oracle/PRODDB/arc/PRODDB000012500_1_589897528.arc"

(6) At this moment, log switch again
------
Sat Aug 2 05:10:20 2008
Beginning log switch checkpoint up to RBA [0x1e849.2.10], SCN: 10077430464587
Thread 1 advanced to log sequence 125001
Current log# 7 seq# 125001 mem# 0: /opt/app/prddbserv/oracle/PRODDB/NEW_redo1/redoPRODDB_G1.dbf
Current log# 7 seq# 125001 mem# 1: /opt/app/prddbserv/oracle/PRODDB/NEW_redo3/redoPRODDB_G2.dbf
------

(7) At this moment, after server log switch, Oracle advanced to log sequence 125003.
And more importantly, from the query in (1), we know Oracle completed an archive log with the name
"/opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc" at 05:39:57.
The real archive log overwrote the online redo log with the same name created by the DBA


Note: 124998 2008-AUG-02 05:39:57


(8)ARC1 throw error at this time, notice log 12 sequence# 125000 cannot be archived,

------
Sat Aug 2 05:28:54 2008
ARC1: Log corruption near block 632833 change 10077402450354 time ?
Sat Aug 2 05:28:54 2008
Errors in file /opt/app/prddbserv/oracle/PRODDB/logs/bdump/p6nbertg_arc1_354.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 632833 change 10077402450354 time 08/02/2008 04:16:58
ORA-00312: online log 12 thread 1: '/opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc'
ARC1: All Archive destinations made inactive due to error 354
------

Note: because at this time '/opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc' was no longer online log, instead it was a real archived log.


(9) From this point on, the Database essentially was not funtioning and throw errors into alert log reaptedly.

Below prensented some alert log messages during the the fixing process:

(1) Shutdown the archive processes

-----
Sat Aug 2 10:02:51 2008
Shutting down archive processes
-----

(2) Modify # of archive processes
-----
Sat Aug 2 10:02:51 2008
ALTER SYSTEM SET log_archive_max_processes=1 SCOPE=MEMORY;
-----

Note: from the timestamp (1) and (2) occurs at the same time, should be from one command

(3) Oracle gives WARNING message

-----
Sat Aug 2 10:02:54 2008
WARNING! CLEARING REDO LOG WHICH HAS NOT BEEN ARCHIVED. BACKUPS TAKEN
BEFORE 08/02/2008 05:10:19 (CHANGE 10077430464587) CANNOT BE USED FOR RECOVERY.
Clearing online log 12 of thread 1 sequence number 125000
-----

(4)
Sat Aug 2 10:03:48 2008
ARCH shutting down
ARC1: Archival stopped

Sat Aug 2 10:04:14 2008
Beginning log switch checkpoint up to RBA [0x1e857.2.10], SCN: 10077532552587
Thread 1 advanced to log sequence 125015
Current log# 12 seq# 125015 mem# 0: /opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc

(5)
Sat Aug 2 10:04:14 2008
Completed: alter database clear unarchived logfile group 12

Sat Aug 2 10:05:09 2008
archivelog header validation failure for file /opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc
archivelog header validation failure for file /opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc

(6)
Sat Aug 2 10:05:24 2008
ARCH: STARTING ARCH PROCESSES
ARC1: Archival started
ARC2: Archival started
ARC3: Archival started
ARCH: STARTING ARCH PROCESSES COMPLETE


Sat Aug 2 10:05:25 2008
ALTER SYSTEM SET log_archive_max_processes=6 SCOPE=MEMORY;
ARC3 started with pid=51, OS id=11043

(7)
Sat Aug 2 10:15:48 2008
alter database drop logfile group 12
Sat Aug 2 10:15:48 2008
Completed: alter database drop logfile group 12

No comments: