Tuesday, January 26, 2010

NetApp Snapshot backup and recovery for Oracle databases

In my current working environment, we use NetApp Snapshot technology to backup some of  our Oracle databases. I have recently participated in a Snapshot backup and recovery test for three production databases that were migrated to new servers. In the tests, the Storage Admin team performed Snapshot backup and restore, the Unix Admin team performed diskgroup deport and import operations, and we, DBAs, were responsible for the database  recovery.

The databases have file layout in such a way that all of  data files and control files is in the volumn /db1,  online redo log files are in the volumn /db1redo, and archived logs are saved in volumn  /db1arch. The Snapshot backups are performed on all three volumns mentioned. We tested two recovery scenarios:

1. Restore /db1 and /db1redo
2. Restore /db1 only

I think if you are an experienced DBA, you will immediately question the practise of backup and restore of online redo logs. You are right. I had doubts at the time. Today, I have done simulations of these two scenarios in a test envrinment. I have been more clear that we should not do that at all.

Below I will describe the test cases in more details, which are the simulations of what we did in the Snapshot backup and recoery test in the production servers a few days ago.

Case 1 - Restore data files, control files and redo log files (e.g. /db1 and /db1redo)
 
1. Perform hotbackup
    #> hot_backp.sh
    Note: script see Appendix

2. Create a table with one row and switch logfile
    SQL> drop table system.t;
    SQL> create table system.t as select * from all_tables where rownum <=1;
    SQL> alter system switch logfile;

3. Insert the second row and switch logfile
   SQL> insert into system.t select * from all_tables where rownum <=1;
   SQL> commit;
   SQL> alter system switch logfile;

4. Insert the third row and switch logfile

5. Insert the forth row WITHOUT switch logfile
    At this point, we should see total 4 rows in the table t:
    sys@DBWRKEV1> select count(*) from system.t ;
    COUNT(*)
     ---------
     4

6. Delete all data files to simulate media failure, shutdown instance
     rm /db1/u01/oradata/dbwrkev1/system01.dbf
     rm /db1/u04/oradata/dbwrkev1/undotbs01.dbf
     rm /db1/u02/oradata/dbwrkev1/sysaux01.dbf
     rm /db1/u02/oradata/dbwrkev1/users01.dbf

7. Restore data files, control files and redo log file
   #> restore_data.sh
   #> restore_ctl.sh
   #> restore_redo.sh

8. Recover database
   After mount the database, I did:

    idle> recover database;
    Media recovery complete.
    idle> alter database open;
    Database altered.

    So here we observed the same in the real test a few days ago, we thought we did a complete recovery, but really?

     idle> select count(*) from system.t;
              select count(*) from system.t
                *
               ERROR at line 1:
              ORA-00942: table or view does not exist

   We can see the table t does not exist at all.

Case 2 Restore data file and control files (e.g. /db1)

Note: due to our data files and control files are both in /db1, with Snapshot restore we can not restore only one of them. This is one of the disavantages of NetApp Snapshot as I can see compared to RMAN backup and restore.

1-6. same as in case 1

7. Restore data files and control files
    #> restore_data.sh
    #> restore_ctl.sh

8. Recover database
   After mount the database, I did:

idle> recover database until cancel using backup controlfile;
ORA-00279: change 10467469061882 generated at 01/26/2010 10:45:48 needed for thread 1
ORA-00289: suggestion : /db1/archive/dbwrkev1/dbwrkev1_1_27_708088679.arc
ORA-00280: change 10467469061882 for thread 1 is in sequence #27
 
 
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
 
ORA-00279: change 10467469062192 generated at 01/26/2010 10:51:40 needed for thread 1
ORA-00289: suggestion : /db1/archive/dbwrkev1/dbwrkev1_1_28_708088679.arc
ORA-00280: change 10467469062192 for thread 1 is in sequence #28
ORA-00278: log file '/db1/archive/dbwrkev1/dbwrkev1_1_27_708088679.arc' no longer needed for this recovery
 
 
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
 
ORA-00279: change 10467469064362 generated at 01/26/2010 11:43:01 needed for thread 1
ORA-00289: suggestion : /db1/archive/dbwrkev1/dbwrkev1_1_29_708088679.arc
ORA-00280: change 10467469064362 for thread 1 is in sequence #29
ORA-00278: log file '/db1/archive/dbwrkev1/dbwrkev1_1_28_708088679.arc' no longer needed for this recovery
 
 
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
 
ORA-00279: change 10467469064390 generated at 01/26/2010 11:43:54 needed for thread 1
ORA-00289: suggestion : /db1/archive/dbwrkev1/dbwrkev1_1_30_708088679.arc
ORA-00280: change 10467469064390 for thread 1 is in sequence #30
ORA-00278: log file '/db1/archive/dbwrkev1/dbwrkev1_1_29_708088679.arc' no longer needed for this recovery
 
 
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
 
ORA-00279: change 10467469064400 generated at 01/26/2010 11:44:17 needed for thread 1
ORA-00289: suggestion : /db1/archive/dbwrkev1/dbwrkev1_1_31_708088679.arc
ORA-00280: change 10467469064400 for thread 1 is in sequence #31
ORA-00278: log file '/db1/archive/dbwrkev1/dbwrkev1_1_30_708088679.arc' no longer needed for this recovery
 
 
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
 
ORA-00308: cannot open archived log '/db1/archive/dbwrkev1/dbwrkev1_1_31_708088679.arc'
ORA-27037: unable to obtain file status
SVR4 Error: 2: No such file or directory
Additional information: 3
 
 
idle> recover database until cancel using backup controlfile;
ORA-00279: change 10467469064400 generated at 01/26/2010 11:44:17 needed for thread 1
ORA-00289: suggestion : /db1/archive/dbwrkev1/dbwrkev1_1_31_708088679.arc
ORA-00280: change 10467469064400 for thread 1 is in sequence #31
 
 
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
cancel
Media recovery cancelled.
idle> alter database open resetlogs;
 
Database altered.
 
idle> select count(*) from system.t;
 
  COUNT(*)
----------
        3


So, obviously we lost the changes recorded in the online redo log. This is an imcomplete recvoery, corresponding to lost all control files and (some of) data files, but online redo logs are intact. However, if we know the current online redo log at the time of media failure ( We can find it out in the alert log file), we probably can recover more.

For example, in my test, I knew the current online redo logs are those of group 2. So I did:

#> cp redo02a.log /db1/archive/dbwrkev1/dbwrkev1_1_31_708088679.arc

Then I repeated the steps 6,7,8 as in the test case 2, let's see what we got:

idle> recover database until cancel using backup controlfile;
ORA-00279: change 10467469061882 generated at 01/26/2010 10:45:48 needed for thread 1
ORA-00289: suggestion : /db1/archive/dbwrkev1/dbwrkev1_1_27_708088679.arc
ORA-00280: change 10467469061882 for thread 1 is in sequence #27
 
 
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
 
ORA-00279: change 10467469062192 generated at 01/26/2010 10:51:40 needed for thread 1
ORA-00289: suggestion : /db1/archive/dbwrkev1/dbwrkev1_1_28_708088679.arc
ORA-00280: change 10467469062192 for thread 1 is in sequence #28
ORA-00278: log file '/db1/archive/dbwrkev1/dbwrkev1_1_27_708088679.arc' no longer needed for this recovery
 
 
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
 
ORA-00279: change 10467469064362 generated at 01/26/2010 11:43:01 needed for thread 1
ORA-00289: suggestion : /db1/archive/dbwrkev1/dbwrkev1_1_29_708088679.arc
ORA-00280: change 10467469064362 for thread 1 is in sequence #29
ORA-00278: log file '/db1/archive/dbwrkev1/dbwrkev1_1_28_708088679.arc' no longer needed for this recovery
 
 
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
 
ORA-00279: change 10467469064390 generated at 01/26/2010 11:43:54 needed for thread 1
ORA-00289: suggestion : /db1/archive/dbwrkev1/dbwrkev1_1_30_708088679.arc
ORA-00280: change 10467469064390 for thread 1 is in sequence #30
ORA-00278: log file '/db1/archive/dbwrkev1/dbwrkev1_1_29_708088679.arc' no longer needed for this recovery
 
 
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
 
ORA-00279: change 10467469064400 generated at 01/26/2010 11:44:17 needed for thread 1
ORA-00289: suggestion : /db1/archive/dbwrkev1/dbwrkev1_1_31_708088679.arc
ORA-00280: change 10467469064400 for thread 1 is in sequence #31
ORA-00278: log file '/db1/archive/dbwrkev1/dbwrkev1_1_30_708088679.arc' no longer needed for this recovery
 
 
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
 
Log applied.
Media recovery complete.
idle> alter database open resetlogs;
 
Database altered.
 
idle> select count(*) from system.t;
 
  COUNT(*)
----------
         4
 
Bingo! It seemed I did a complete recovery!

[Jan 27,2010 updated: this post shows another way to do a complete revcovery in this situation]

Another thing I want to point out is that if we only lost data files whereas the control files and online redo logs are intact, we can do a complete recovery easily as long as we have archived logs. However, in our envionment when we restore snapshot backup, we also overwrite the current control files. So we must be cautious and it is thus advisable to copy the intact control files to some other location. It is probaly a good idea that we should not place all the control files in the same volumn as data files.

BTW, the information in the following two links gave me some ideas about this topic.

http://freekdhooge.wordpress.com/2007/12/03/disaster-recovery-troubles/
http://download.oracle.com/docs/cd/B10501_01/server.920/a96519/strategy.htm#1004884


Appendix - Scripts used:

#!/bin/ksh
# script:  hot_backup.sh
#
sqlplus "/ as sysdba" <<EOF
alter tablespace UNDOTBS1 begin backup;
alter tablespace USERS    begin backup;
alter tablespace SYSTEM   begin backup;
alter tablespace SYSAUX   begin backup;
EOF
echo " -- tablespace begin backup mode "

cp /db1/u01/oradata/dbwrkev1/system01.dbf    /db1/backup
cp /db1/u04/oradata/dbwrkev1/undotbs01.dbf   /db1/backup
cp /db1/u02/oradata/dbwrkev1/sysaux01.dbf    /db1/backup
cp /db1/u02/oradata/dbwrkev1/users01.dbf     /db1/backup
cp /db1/u05/archive/dbwrkev1/redo03a.log     /db1/backup
cp /db1/u06/archive/dbwrkev1/redo03b.log     /db1/backup
cp /db1/u05/archive/dbwrkev1/redo02a.log     /db1/backup
cp /db1/u06/archive/dbwrkev1/redo02b.log     /db1/backup
cp /db1/u05/archive/dbwrkev1/redo01a.log     /db1/backup
cp /db1/u06/archive/dbwrkev1/redo01b.log     /db1/backup
cp /db1/u01/oradata/dbwrkev1/control01.ctl   /db1/backup 
cp /db1/u02/oradata/dbwrkev1/control02.ctl   /db1/backup 
cp /db1/u03/oradata/dbwrkev1/control03.ctl   /db1/backup


echo " -- data file, control file and redo log backup done "

sqlplus "/ as sysdba" <<EOF
alter tablespace UNDOTBS1 end backup;
alter tablespace USERS    end backup;
alter tablespace SYSTEM   end backup;
alter tablespace SYSAUX   end backup;
EOF

echo " -- tablespace end backup mode "

sqlplus "/ as sysdba" <<EOF
alter system archive log current;
alter database backup controlfile to '/tmp/backupctrl.ctl' reuse;
EOF
echo " -- archive current redo log and backup controlfile done"

#!/bin/ksh
# script: restore_data.sh
#

cp /db1/backup/system01.dbf  /db1/u01/oradata/dbwrkev1 
cp /db1/backup/undotbs01.dbf /db1/u04/oradata/dbwrkev1 
cp /db1/backup/sysaux01.dbf  /db1/u02/oradata/dbwrkev1 
cp /db1/backup/users01.dbf   /db1/u02/oradata/dbwrkev1 

#!/bin/ksh
#   restore_ctl.sh 
#

cp /db1/backup/control01.ctl /db1/u01/oradata/dbwrkev1   
cp /db1/backup/control02.ctl /db1/u02/oradata/dbwrkev1   
cp /db1/backup/control03.ctl /db1/u03/oradata/dbwrkev1   

#!/bin/ksh
# script: restore_redo.sh
#
cp /db1/backup/redo03a.log   /db1/u05/archive/dbwrkev1    
cp /db1/backup/redo03b.log   /db1/u06/archive/dbwrkev1    
cp /db1/backup/redo02a.log   /db1/u05/archive/dbwrkev1    
cp /db1/backup/redo02b.log   /db1/u06/archive/dbwrkev1     
cp /db1/backup/redo01a.log   /db1/u05/archive/dbwrkev1     
cp /db1/backup/redo01b.log   /db1/u06/archive/dbwrkev1  


No comments: