Saturday, August 23, 2008

Relink Oracle Binaries

In my currunt working environment, it has been made a policy that DBA should relink Oracle binaries every time that OS patch is performed. I have already supported this task several times.

The general steps to relink Oracle is as follows:

1. Logon to the server

2. Shutdown all the databases:

3. Shutdown the listener:
lsnrctl stop listener_name

4. Relink Oracle binaries:
cd $OH/bin (or cd $ORACLE_HOME/bin)
relink all>logfile_name

Note: if databases have different ORALCE_HOME, each of them needs to be relinked.

5. Startup the databases

5.1 If see error while connecting sqlplus as /bin/ksh: sqlplus: not found
then run this command and it will fix the issue.

cd $OH/sqlplus/lib
make -f ins_sqlplus.mk install


Note: before issuing this command, I found that I have to move back the sqlplus otherwise the command will fail:
mv -f /usr/local/oracle/10.2.0/bin/sqlplusO /usr/local/oracle/10.2.0/bin/sqlplus

The following shows what has happened when issuing the command last time:



oracle@whpssc1a(162) P0SSCW /usr/local/oracle/10.2.0/sqlplus/lib

$ make -f ins_sqlplus.mk install

cp -f /usr/local/oracle/10.2.0/lib/libsqlplus.sl \
/usr/local/oracle/10.2.0/sqlplus/lib/libsqlplus.sl
chmod 755 /usr/local/oracle/10.2.0/bin/helpins
chmod 755 /usr/local/oracle/10.2.0/bin/pupbld
mv -f /usr/local/oracle/10.2.0/bin/sqlplus /usr/local/oracle/10.2.0/bin/sqlplusO
mv -f /usr/local/oracle/10.2.0/sqlplus/bin/sqlplus /usr/local/oracle/10.2.0/bin/sqlplus
chmod 751 /usr/local/oracle/10.2.0/bin/sqlplus
rm -f /usr/local/oracle/10.2.0/sqlplus/lib/libsqlplus.sl



after this connect again as
sqlplus ‘/ as sysdba’

startup

7. Start the listener:
lsnrctl start listener_name

Tuesday, August 19, 2008

Trouble to startup a database: ORA-27102: out of memory

Last night, we had a task to bounce a database. I shut it down without any problem. Then I enter 'startup' in the SQL* Plus command line. However, the Oracle seemed stucking there forever. Per instruction by the team leader, in another session, I used 'shutdown abort' and cleanup any background processes at OS level as well. Then I tried to startup the database again , but recieving the following error:


SQL>startup

ORA-27102: out of memory

HP-UX Error: 12: Not enough space


We tried couple of thing before contacting primary DBA, including

1) Commenting out the large_pool and db_keep_cache_size ( ~ 100G) and tried to startup and it did not work.

2) We brought down another database on the same server to clear up some memory but this also could not startup the database either.


At last, we contact the primary DBA and with his help we found that there were so many active memory segments with ‘no-entry’. See the following for example:


$ ipcs -am grep oracle grep no-entry
m 524297 0x00000000 --rw-r----- oracle dba oracle dba 1 33554432 8330 8330 1:06:21 no-entry 1:06:21
m 327718 0x00000000 D-rw-r----- oracle dba oracle dba 1 55834574848 8330 8330 1:06:21 no-entry 1:16:57
m 32702521 0x00000000 D-rw-r----- oracle dba oracle dba 1 3506438144 8330 8330 1:06:21 no-entry 1:16:57
m 70254652 0x00000000 D-rw-r----- oracle dba oracle dba 1 6559891456 8330 8330 1:06:21 no-entry 1:16:57


We removed them manually, ie by issuing the command:
$ ipcrm -m 524297


We then started the database and it came up. Again we brought the database down in order to see if we can startup the database with the original parameter file (pfile). The same issue with same error massage came back up. When checked, the no-entry active memory segments were back there again. This time we were not able to even remove them. But in the mean time we were able to identify one background process associated with the database:


$> ps -ef grep P2BL47A
oracle 8330 1 252 01:06:19 ? 21:27 oracleP2BL47A (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))


We killed that process and after that all the no-entry active memory segments were also gone. We reduced the db_keep_cache_size to 55G from 100G and started the database and it worked this time and database came up fine.


The primary DBA will investigate this issue further.

Sunday, August 17, 2008

Archived redo log destination filling up

What can you do while the archived redo log destination gets filling up? One of the clients I worked before has set up a cron job to monitor the space and automatically move the old archived redo logs to the second location if certain threshold reaches. The script also does the cleaup for those acrhived logs older than three days ago. The client adopted NetApp snapshot for backup solution.

The current client I am working for uses different strategy. They also monitor the space, however instead of moving the arhived log, they change the archived redo log destination dynamically when the threshold reaches. In addtion, there is a rman job runing every 10-15 min to backup archived redo logs to tape and cleanup the logs.

When I try to test the syntax of changing archived redo log destination dynamically, I get confused by Oracle online documentation. After searching internet and test a little while, I have learned that the 'location' keyword is necessary and no space inside the quotation mark is allowed. While Oracle documention shows the space there. Examples are shown as follows:


- missing 'location' keyword

sys@DB10G> alter system set log_archive_dest_1='C:\temp';
alter system set log_archive_dest_1='C:\temp'
*
ERROR at line 1:
ORA-32017: failure in updating SPFILE
ORA-16179: incremental changes to "log_archive_dest_1" not allowed with SPFILE


- correct syntax

sys@DB10G> alter system set log_archive_dest_1='location=C:\temp';

System altered.

- space can not appear inside the quotation mark

sys@DB10G> alter system set log_archive_dest_1='location = C:\temp';
alter system set log_archive_dest_1='location = C:\temp'
*
ERROR at line 1:
ORA-32017: failure in updating SPFILE
ORA-16179: incremental changes to "log_archive_dest_1" not allowed with SPFILE


sys@DB10G> alter system set log_archive_dest_1='location = C:\';
alter system set log_archive_dest_1='location = C:\'
*
ERROR at line 1:
ORA-32017: failure in updating SPFILE
ORA-16179: incremental changes to "log_archive_dest_1" not allowed with SPFILE


- This is correct

sys@DB10G> alter system set log_archive_dest_1='location=C:\';

System altered.

Friday, August 15, 2008

Lost of a redo log group due to DBA's mistake

Just few days before I joined the current DBA team, there was a serious incident that lead to a production database hung due to a DBA's mistake. I dig into the alert log of that database, trying to understand the sequence of event occurred for the incident. Here are my findings (actual database name changed to PRODDB):

(1) At this moment, the DBA issued a statement to add a redo log group consisting of one member.

----------------
Sat Aug 2 03:50:24 2008
alter database add logfile group 12 '/opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc' size 3G
----------------

Note: The DBA used the archive log file name format and location for the online redo log file. He obviously confused online redo log files with archived log files. The following query shows that at 03:30, the real archive log file "PRODDB0000124998_1_589897528.arc" corresponding to sequence# 124998 did not exist at the location before he issued the statement.

SQL> select sequence#, completion_time from v$archived_log where completion_time between sysdate-12 and sysdate -10


SEQUENCE# COMPLETION_TIME
---------- --------------------
124984 2008-AUG-02 00:50:32
124985 2008-AUG-02 02:12:26
124986 2008-AUG-02 03:02:46
124987 2008-AUG-02 03:43:32
124992 2008-AUG-02 04:06:38
124988 2008-AUG-02 04:07:19
124991 2008-AUG-02 04:08:19
124989 2008-AUG-02 04:35:42
124990 2008-AUG-02 04:38:58
124993 2008-AUG-02 04:50:42
124994 2008-AUG-02 04:51:13
124995 2008-AUG-02 04:51:41
124996 2008-AUG-02 04:54:50
124997 2008-AUG-02 05:17:21
124999 2008-AUG-02 05:35:23
124998 2008-AUG-02 05:39:57
125001 2008-AUG-02 05:49:09
125002 2008-AUG-02 05:50:44
125003 2008-AUG-02 05:51:48

(2) At this moment, all online logs needed archiving, swiching logfile can not proceed, so database was hung.

------
Sat Aug 2 03:55:49 2008
Thread 1 cannot allocate new log, sequence 124999
All online logs needed archiving
Current log# 6 seq# 124998 mem# 0: /opt/app/prddbserv/oracle/PRODDB/NEW_redo2/redoPRODDB_F1.dbf
Current log# 6 seq# 124998 mem# 1: /opt/app/prddbserv/oracle/PRODDB/NEW_redo4/redoPRODDB_F2.dbf
------
Note: This is why the DBA wanted to add new online redo log groups at the first place.

(3) At this moment, log switch occurred
------
Sat Aug 2 04:06:39 2008
Beginning log switch checkpoint up to RBA [0x1e847.2.10], SCN: 10077378246296
Thread 1 advanced to log sequence 124999
Current log# 9 seq# 124999 mem# 0: /opt/app/prddbserv/oracle/PRODDB/NEW_redo1/redoPRODDB_I1.dbf
Current log# 9 seq# 124999 mem# 1: /opt/app/prddbserv/oracle/PRODDB/NEW_redo3/redoPRODDB_I2.dbf
------
(4) At this moment, adding logfile completed:
------
Sat Aug 2 04:11:13 2008
Completed: alter database add logfile group 12 '/opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc'
size 3G
------

(5) At this moment, log switch occurred and the newly added logfile became the current log
------
Sat Aug 2 04:16:59 2008
Beginning log switch checkpoint up to RBA [0x1e848.2.10], SCN: 10077390512906
Thread 1 advanced to log sequence 125000
Current log# 12 seq# 125000 mem# 0: /opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc
------

Note: From the seq# 125000, we can know when Oracle archives this online redo log with the wrong name, the
generated archived log will be "/opt/app/prddbserv/oracle/PRODDB/arc/PRODDB000012500_1_589897528.arc"

(6) At this moment, log switch again
------
Sat Aug 2 05:10:20 2008
Beginning log switch checkpoint up to RBA [0x1e849.2.10], SCN: 10077430464587
Thread 1 advanced to log sequence 125001
Current log# 7 seq# 125001 mem# 0: /opt/app/prddbserv/oracle/PRODDB/NEW_redo1/redoPRODDB_G1.dbf
Current log# 7 seq# 125001 mem# 1: /opt/app/prddbserv/oracle/PRODDB/NEW_redo3/redoPRODDB_G2.dbf
------

(7) At this moment, after server log switch, Oracle advanced to log sequence 125003.
And more importantly, from the query in (1), we know Oracle completed an archive log with the name
"/opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc" at 05:39:57.
The real archive log overwrote the online redo log with the same name created by the DBA


Note: 124998 2008-AUG-02 05:39:57


(8)ARC1 throw error at this time, notice log 12 sequence# 125000 cannot be archived,

------
Sat Aug 2 05:28:54 2008
ARC1: Log corruption near block 632833 change 10077402450354 time ?
Sat Aug 2 05:28:54 2008
Errors in file /opt/app/prddbserv/oracle/PRODDB/logs/bdump/p6nbertg_arc1_354.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 632833 change 10077402450354 time 08/02/2008 04:16:58
ORA-00312: online log 12 thread 1: '/opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc'
ARC1: All Archive destinations made inactive due to error 354
------

Note: because at this time '/opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc' was no longer online log, instead it was a real archived log.


(9) From this point on, the Database essentially was not funtioning and throw errors into alert log reaptedly.

Below prensented some alert log messages during the the fixing process:

(1) Shutdown the archive processes

-----
Sat Aug 2 10:02:51 2008
Shutting down archive processes
-----

(2) Modify # of archive processes
-----
Sat Aug 2 10:02:51 2008
ALTER SYSTEM SET log_archive_max_processes=1 SCOPE=MEMORY;
-----

Note: from the timestamp (1) and (2) occurs at the same time, should be from one command

(3) Oracle gives WARNING message

-----
Sat Aug 2 10:02:54 2008
WARNING! CLEARING REDO LOG WHICH HAS NOT BEEN ARCHIVED. BACKUPS TAKEN
BEFORE 08/02/2008 05:10:19 (CHANGE 10077430464587) CANNOT BE USED FOR RECOVERY.
Clearing online log 12 of thread 1 sequence number 125000
-----

(4)
Sat Aug 2 10:03:48 2008
ARCH shutting down
ARC1: Archival stopped

Sat Aug 2 10:04:14 2008
Beginning log switch checkpoint up to RBA [0x1e857.2.10], SCN: 10077532552587
Thread 1 advanced to log sequence 125015
Current log# 12 seq# 125015 mem# 0: /opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc

(5)
Sat Aug 2 10:04:14 2008
Completed: alter database clear unarchived logfile group 12

Sat Aug 2 10:05:09 2008
archivelog header validation failure for file /opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc
archivelog header validation failure for file /opt/app/prddbserv/oracle/PRODDB/arc/PRODDB0000124998_1_589897528.arc

(6)
Sat Aug 2 10:05:24 2008
ARCH: STARTING ARCH PROCESSES
ARC1: Archival started
ARC2: Archival started
ARC3: Archival started
ARCH: STARTING ARCH PROCESSES COMPLETE


Sat Aug 2 10:05:25 2008
ALTER SYSTEM SET log_archive_max_processes=6 SCOPE=MEMORY;
ARC3 started with pid=51, OS id=11043

(7)
Sat Aug 2 10:15:48 2008
alter database drop logfile group 12
Sat Aug 2 10:15:48 2008
Completed: alter database drop logfile group 12

Tuesday, August 12, 2008

Alert log error message related to listener down on a remote host

The alert log of an Oracle database is an indenspensible source for diagnosing many problems. Any error messages from it deserve investigation proactively. Last night I had a chance to expose myself of some views that I am not very familar with during the process of understanding alert log error messages.

The error messages from the alert log I recieved in email were as follows:

------
proddb12 on bspdm15a : Mon Aug 11 21:50:19 PDT 2008 Message from red_alert ...

Below are ORA- errors found in alert log (/opt/app/p1fwd1d2/oracle/admin/proddb12/bdump/alert_proddb12.log). See log for more information.

ORA-12012: error on auto execute of job 171
ORA-12541: TNS:no listener
ORA-12012: error on auto execute of job 183
ORA-12541: TNS:no listener
ORA-12012: error on auto execute of job 170

-----

To see what those jobs are, I query the dba_jobs view:

select job, what from dba_jobs
where job in (171,183,170)
/

JOB
----------
WHAT
--------------------------------------------------------------------------------
183
dbms_refresh.refresh('"GOLD_GPRS"."BORDER_PLMN"');

170
dbms_refresh.refresh('"GOLD_GPRS"."DG_CATEGORY"');

171
dbms_refresh.refresh('"GOLD_GPRS"."DG_SPEC_GROUP"');



From the above results, it can be concluded that those jobs are releated to snaphot feature.I subsequently checked the dba_refresh and dba_snapshots views, and was able to find out the refresh group those jobs are responsible for and the dblink that the refresh group used:

1 select refgroup, job from dba_refresh
2* where job in (183,170,171)
SQL> /

REFGROUP JOB
---------- ----------
21 183
11 170
12 171



SQL> ;
1* select name, refresh_group, master_link from dba_snapshots
SQL> /

NAME REFRESH_GROUP MASTER_LINK
-------------------- ------------- ----------------------------------------
...
BORDER_PLMN 21 @PRODDB_GOLD_GPRS_LINK.EDC.COMPANY.NET
DG_CATEGORY 11 @PRODDB_GOLD_GPRS_LINK.EDC.COMPANY.NET
DG_SPEC_GROUP 12 @PRODDB_GOLD_GPRS_LINK.EDC.COMPANY.NET
...


Querying the dba_db_links, got the following:

select db_link, host from dba_db_links
/

DB_LINK HOST
------------------------------ ----------
PROD.EDC.COMPANY.NET proddb
SDB.EDC.COMPANY.NET p3fwdar
PRODDB_GOLD_GPRS_LINK.EDC.COMPANY.NET PRODDB
P0FUSWKS.WDC.COMPANY.NET p0fuswks
PRODDB_CSGNOR_LINK.EDC.COMPANT.NET PRODDB


Finally checked the availability of PRODDB by tnsping:

$ tnsping proddb

TNS Ping Utility for Solaris: Version 10.2.0.4.0 - Production on 11-AUG-2008 22:41:41

Copyright (c) 1997, 2007, Oracle. All rights reserved.

Used parameter files:


Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION = (SDU=32767) (ADDRESS = (PROTOCOL = TCP)(HOST = p1fd1d1.edc.company.net)(PORT = 1547)) (CONNECT_DATA = (SERVICE_NAME = proddb)))
TNS-12541: TNS:no listener

At this point, from the snapshot point of view it was obvious that the listener in the master database server was unavailable at that time, which caused we see 'TNS:no listener' in the snapshot database alert logs. Later on we did receive email from other DBAs saying that maintenance tasks were going on the master database server.

So in this case, the alert log message of a database revealed problems of a remote database.

Friday, August 08, 2008

ORACLE_PATH and SQLPATH - continue from yesterday

Continue on yesterday's post.

I found out that the temp.sql that I actually run is located at the following directory:

oracle@dhpnbe5b(391) PRODDB /home/oracle/dba/tools

And its contents are:


$ cat temp.sql
create or replace procedure sel_all_sql
( sid_in in number )
is
begin
# set verify off
# set feedback off
# column sid heading "OracleSessionId"
# column sqltext heading "Sql text"
select sql_text
from V$SESSION a , V$SQLTEXT b
where a.sql_address = b.address
and a.sql_hash_value = b.hash_value
and a.sid = sid_in
order by piece;
end;
/



It looks no harm.

Did a little google, I learned that there are two envrionment variables that are relevent to this issue:

(1) ORACLE_PATH

Defines the path names to locate files. If ORACLE_PATH is set, the application searches the directories specified by $ORACLE_PATH first, then the current directory. If ORACLE_PATH is not set, then the application uses the current directory. This variable is used by SQLPlus, Forms and Menu.

(2) SQLPATH

Specifies the directory or list of directories that SQLPlus searches for a login.sql file.

I have an impression that the Oracle searches SQLPATH, never known befeore that Oracle acutually searches ORACLE_PATH.

Regarding the environment in which I run the temp.sql, I did not find any temp.sql under SQLPATH, however, the above temp.sql is really under ORACLE_PATH. See the following:

oracle@dhpnbe5b(385) PRODDB /home/oracle
$ echo $ORACLE_PATH
/home/oracle/dba/tools:.:/home/oracle/dba/sql:/home/oracle/dba/tools

Notice that the '/home/oracle/dba/tools' appears twice, one is preceding the current dir '.', this explains what happend yesterday, which Oracle did not pick up the temp.sql to run from my current directory.

I will test out that if ORACLE_PATH is not set whether Oracle will search SQLPATH. But I know it will at least with my 10g SQL Plus/database on Windows XP pc.

Thursday, August 07, 2008

Surprise from runing a SQL* Plus script

Last night was my first day working for a new client. It is a night shift on-call production support job. I was asked to add space to a bunch of databases. For one database, in SQL* Plus command line, I typed:

SQL> ho vi temp.sql

Then in the temp.sql I edited some commands which is nothing but to select from some DBA_* views to find out space usage. So after ediinbt, I did:

SQL> @temp

To my surprise, Oracle gives error message, something like ' Procedure created with compilation error'. I immediately reallized that there must be a script called 'temp.sql' in the SQLPATH. I was scared as this is a production database. Have I modified a procedure that is essential to bussiness? I was picturing the worst-case scenario, bussiness failed etc. Really scared. However later on, I found I connect as SYS user, the procedure was thus created under SYS schema ( what I thought at that time, should be confirmed), which is unlikely used by application user. Anyway mistake has already been made. So I moved on. But I made another mistake - I did not write down in which database I created the procedure. Now I can not recall. Tonight I shall find out and confirm that there should be no impact by creating this procedure under SYS.

I frequently use the name temp.sql to edit SQL* Plus statments in it and run the script as described above. In the furture I should be more careful. It appears that I am not the only one like the name 'temp.sql'. Probably I should give up this habit and use copy/paste instead.

God bless me!

Wednesday, August 06, 2008

I Become Smart To Solve Problems

To help remember the basic functions of a DBA, I created this sentence:

"I Become Smart To Solve Problems"

Anyway, a good DBA should work smart instead of always working hard, right?


IBSTSP
  • I - Installation/configuration/patch/migration
  • B - Backup/Recovery
  • S - Security : "AAA" - authentication/authorization/auditing
  • T - Troubleshooting
  • S - Storage/Capacity planning
  • P - Performance monitoring and tuning