Called to help troubleshooting a production issue: an insert statemnt blocked many others. However, when I checked the database on 7:30 PM, the symptoms have gone.
In retrospective, the problem is most likely due to a configuration issue in context of DMT. This database was upgraded to 10g a year ago from 8i. A few tablespaces is still dictionary-managed.
Here is the AWR "Top 5 Timed Events" during the problem period:
Top 5 Timed Events (18:30 -19:00)
Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
enq: TX - contention 44,494 130,783 2,939 59.7 Other
enq: ST - contention 27,122 79,719 2,939 36.4 Configuration
CPU time 3,700 1.7
read by other session 839,265 2,843 3 1.3 User I/O
db file scattered read 405,142 2,121 5 1.0 User I/O
From: http://www.saptechies.com/faq-oracle-enqueues/, we can read some info about Oracle enqueue.
System enqueues:
ST (space transaction enqueue): This enqueue is held in dictionary-managed tablespaces within extent allocations and releases.
TYPE = ST
Very large numbers of extents are allocated or deallocated.
You can solve the problem permanently by using LMTS instead of DMTS (refer to Notes 214995 and 659946). If you use DMTS, you must ensure that the extents of temporary segments, tables and indexes are sufficiently big so that you can avoid large quantities of extents being allocated and deallocated.
I wan able to find the blocker from the DBA_HIST_ACTIVE_SESS_HISTORY view. Drunig 18:30 - 19:00 Session 651 is blocking many other sessions, which either wait for enq: TX or enq:ST for examples:
SAMPLE_TIME SESSION_ID SQL_ID EVENT BLOCKING_SESSION
---------------------------------------- ---------- ------------- ------------------------------ ----------------
08-FEB-11 06.59.56.243 PM 242 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 345 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 414 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 469 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 591 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 593 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 594 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 595 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 596 6wyrqa50whrjg enq: TX - contention 651
...
SAMPLE_TIME SESSION_ID SQL_ID EVENT BLOCKING_SESSION
---------------------------------------- ---------- ------------- ------------------------------ ----------------
08-FEB-11 06.59.56.243 PM 1976 dpc1k1djg5gh0 enq: ST - contention 651
08-FEB-11 06.59.56.243 PM 49 dpc1k1djg5gh0 enq: ST - contention 651
08-FEB-11 06.59.56.243 PM 1841 dpc1k1djg5gh0 enq: ST - contention 651
08-FEB-11 06.59.56.243 PM 1763 dpc1k1djg5gh0 enq: ST - contention 651
08-FEB-11 06.59.56.243 PM 1693 dpc1k1djg5gh0 enq: ST - contention 651
08-FEB-11 06.59.56.243 PM 1660 dpc1k1djg5gh0 enq: ST - contention 651
....
Session 651 that was doing the same insert lasted about 1hour, from about 6:17 to 7:15
SAMPLE_TIME SESSION_ID SQL_ID EVENT BLOCKING_SESSION
---------------------------------------- ---------- ------------- ------------------------------ ----------------
08-FEB-11 06.17.23.778 PM 651 6wyrqa50whrjg
08-FEB-11 06.17.33.998 PM 651 6wyrqa50whrjg
08-FEB-11 06.17.44.198 PM 651 6wyrqa50whrjg
08-FEB-11 06.17.54.508 PM 651 6wyrqa50whrjg
...
08-FEB-11 07.14.45.919 PM 651 6wyrqa50whrjg
08-FEB-11 07.14.56.049 PM 651 6wyrqa50whrjg
08-FEB-11 07.15.06.159 PM 651 6wyrqa50whrjg
-- end --
After this sql ( 6wyrqa50whrjg is an simple insert into a table) was done, things become normal. I am still not sure why this insert took so long. I was unable to know what wait event it was experiencing. Apparantly, Oracle was allocating a new extent to the table segment, but why took about an hour? The table referred in the SQL has two extents now, first one has 64000 blocks, the second has 16000 blocks.
So the scenario appeared to be that a session wants to insert a row into a table that resides in a DMT tablespace, but the table segment is run out of space, Oracle firstly has to allocate a new extent to the table segment; the session is waiting for this allocation to complete; it blocks all other sessions doing the same insert into the same table with the latter waiting for enq:TX. It also blocks other sessions with enq:ST events, which are doing DML on different tables that resides in the same tablespace.
Anyway we really need to migrate DMT to LMT asap.
Showing posts with label troubleshooting. Show all posts
Showing posts with label troubleshooting. Show all posts
Wednesday, February 09, 2011
Monday, July 13, 2009
Having troubles in creating a database link
Today, I have spent quite some time to try to get a db link from a 10g db (DBWRKEV1) to a 9i db (DBWRK920) work.
At DBWRKEV1, we can see the global_name has a funny suffix.
z9xxx94@DBWRKEV1> select * from global_name;
GLOBAL_NAME
---------------------------------------------------------
DBWRKEV1.REGRESS.RDBMS.DEV.US.ORACLE.COM
Based on this post, I tried to remove it by this way:
SQL> show user;
USER is "SYS"
SQL> update GLOBAL_NAME SET GLOBAL_NAME='DBWRKEV1';
1 row updated.
SQL> commit;
Commit complete.
z9xxx94@DBWRKEV1> select * from global_name;
GLOBAL_NAME
---------------------------------------------
DBWRKEV1
Then, I tried to create the db link again:
z9xxx94@DBWRKEV1> create database link dbwrk920 connect to z9xxx94 identified by vxxx using 'dbwrk920';
Database link created.
z9xxx94@DBWRKEV1> select * from global_name@dbwrk920;
select * from global_name@dbwrk920
*
ERROR at line 1:
ORA-02085: database link DBWRK920.REGRESS.RDBMS.DEV.US.ORACLE.COM connects to DBWRK920
Still, I have not succeeded. It should be noted that the initialization parameter global_names is TRUE in the DBWRKEV1(10g)
Finally, I was able to use the db link through following approach:
At DBWRK920
z9xxx94@DBWRK920> select * from global_name;
GLOBAL_NAME
---------------------------------------------------------------------------
DBWRK920
z9xxx94@DBWRK920> alter database rename global_name to dbwrk920.world;
Database altered.
z9xxx94@DBWRK920> select * from global_name;
GLOBAL_NAME
---------------------------------------------------------------------------
DBWRK920.WORLD
At DBWRKEV1
z9xxx94@DBWRKEV1> create database link dbwrk920.world connect to z9xxx94 identified by vxxx using 'dbwrk920';
Database link created.
z9xxx94@DBWRKEV1>
z9xxx94@DBWRKEV1> select * from global_name@dbwrk920.world;
GLOBAL_NAME
--------------------------------------------------------------
DBWRK920.WORLD
At DBWRKEV1, we can see the global_name has a funny suffix.
z9xxx94@DBWRKEV1> select * from global_name;
GLOBAL_NAME
---------------------------------------------------------
DBWRKEV1.REGRESS.RDBMS.DEV.US.ORACLE.COM
Based on this post, I tried to remove it by this way:
SQL> show user;
USER is "SYS"
SQL> update GLOBAL_NAME SET GLOBAL_NAME='DBWRKEV1';
1 row updated.
SQL> commit;
Commit complete.
z9xxx94@DBWRKEV1> select * from global_name;
GLOBAL_NAME
---------------------------------------------
DBWRKEV1
Then, I tried to create the db link again:
z9xxx94@DBWRKEV1> create database link dbwrk920 connect to z9xxx94 identified by vxxx using 'dbwrk920';
Database link created.
z9xxx94@DBWRKEV1> select * from global_name@dbwrk920;
select * from global_name@dbwrk920
*
ERROR at line 1:
ORA-02085: database link DBWRK920.REGRESS.RDBMS.DEV.US.ORACLE.COM connects to DBWRK920
Still, I have not succeeded. It should be noted that the initialization parameter global_names is TRUE in the DBWRKEV1(10g)
Finally, I was able to use the db link through following approach:
At DBWRK920
z9xxx94@DBWRK920> select * from global_name;
GLOBAL_NAME
---------------------------------------------------------------------------
DBWRK920
z9xxx94@DBWRK920> alter database rename global_name to dbwrk920.world;
Database altered.
z9xxx94@DBWRK920> select * from global_name;
GLOBAL_NAME
---------------------------------------------------------------------------
DBWRK920.WORLD
At DBWRKEV1
z9xxx94@DBWRKEV1> create database link dbwrk920.world connect to z9xxx94 identified by vxxx using 'dbwrk920';
Database link created.
z9xxx94@DBWRKEV1>
z9xxx94@DBWRKEV1> select * from global_name@dbwrk920.world;
GLOBAL_NAME
--------------------------------------------------------------
DBWRK920.WORLD
Saturday, November 15, 2008
Troubleshooting - ORA-02049: timeout: distributed transaction waiting for lock
The following steps/observations are involved in solving this issue:
1. Application team asked DBA for help to solve the following error:
We are connecting to NCAMASTERCON/xxxxx@PRDDB1
2. I found out that au1011 on PRDDB1 is a synonym for an object on a remote database PRDDB2
The synonym definition is as follows:
CREATE OR REPLACE SYNONYM "NCAMASTERCON"."AU1011" FOR "NCAUSAGECON"."AU1011"@"NCAUSAGECON.PRDDB2";
3. Then, I found out on PRDDB2 the "AU1011" is also a synonym, it is defined as:
CREATE OR REPLACE SYNONYM "NCAUSAGECON"."AU1011" FOR "NCAUSAGEOWN"."ACCUMULATED_USAGE_10_B";
4. Checked if there are any locks on the table: ACCUMULATED_USAGE_10_B
5. The SQL running by session 2278 is:
6. Further check revealed that session 2278 on PRDDB2 was started 97784 seconds ago
and the last call it made was 97773 seconds ago
7. The application DBA observed the following:
The session (sid=7265) shows as KILLED on PRDDB1, but the corresponding remote session on PRDDB2(sid=2278) is still active.
8. Confirmed by application team, we killed the session 2278 on PRDDB2
1. Application team asked DBA for help to solve the following error:
We are connecting to NCAMASTERCON/xxxxx@PRDDB1
SQL> update au1011
2 set rerate_request_cd = '5',dl_service_code = 'FXLSA', sys_update_date = sysdate
3 where ban in (select ban from TEMP_LSAFF_US1011)
4 and record_type = '1';
commit;
update au1011
*
ERROR at line 1:
ORA-02049: timeout: distributed transaction waiting for lock
ORA-02063: preceding line from NCAUSAGECON.PRDDB2
2. I found out that au1011 on PRDDB1 is a synonym for an object on a remote database PRDDB2
The synonym definition is as follows:
CREATE OR REPLACE SYNONYM "NCAMASTERCON"."AU1011" FOR "NCAUSAGECON"."AU1011"@"NCAUSAGECON.PRDDB2";
3. Then, I found out on PRDDB2 the "AU1011" is also a synonym, it is defined as:
CREATE OR REPLACE SYNONYM "NCAUSAGECON"."AU1011" FOR "NCAUSAGEOWN"."ACCUMULATED_USAGE_10_B";
4. Checked if there are any locks on the table: ACCUMULATED_USAGE_10_B
system@PRDDB2> @l2/locked_object_by_name.sql
Enter value for object_name: ACCUMULATED_USAGE_10_B
Lock Lock Time Object
SID Type Held Owner Object Name Held (min) Status
------ ---- ------ ---------------- ------------------------------ ---------- ------------
2278 DML Row-X NCAUSAGEOWN ACCUMULATED_USAGE_10_B 1622 VALID
5. The SQL running by session 2278 is:
system@PRDDB2> @sql_cur_sid
Enter value for sid: 2278
SQL_TEXT SQL_ID SID USER_NAME
---------------------------------------- ------------- ------ ------------
UPDATE /*+ PARALLEL ("A1",4) */ "NCAUSAG 93vbp0jv2c7w4 2278 NCAUSAGECON
ECON"."AU1011" "A1" SET "RERATE_REQUEST_
CD" = '5',"DL_SERVICE_CODE" = 'FXLSA',"S
YS_UPDATE_DATE" = SYSDATE@! WHERE ("A1".
"BAN","A1"."SUBSCRIBER_NO")=ANY (SELECT
DISTINCT "A2"."BAN","A1"."SUBSCRIBER_NO"
FROM "TEMP_LSAFF_US1011"@! "A2") AND "A
1"."RECORD_TYPE"='1'
6. Further check revealed that session 2278 on PRDDB2 was started 97784 seconds ago
and the last call it made was 97773 seconds ago
7. The application DBA observed the following:
The session (sid=7265) shows as KILLED on PRDDB1, but the corresponding remote session on PRDDB2(sid=2278) is still active.
8. Confirmed by application team, we killed the session 2278 on PRDDB2
Monday, October 27, 2008
Identify and kill a blocking session - enq: TX - row lock contention
This post documented steps I have taken to identify and kill a blocking session:
1. Recived alert:
DBNAME:critical-Higher wait events:enq: TX - row lock contention-63(threshold 50)
2. Checking the wait events, find that the count for 'enq: TX - row lock contention' is high
3. Checking locks:
Looks everyone is waiting for sid=7752
4. Check the session sid=7752
5. The following query shows that session sid=7752 is 19367 sec old, and the last call made by it is 19366 sec ago.
6. After killing the session sid=7752, wait event showed as follows:
####---------------------------------------------------------------------------
Appendix - scripts used:
1. Recived alert:
DBNAME:critical-Higher wait events:enq: TX - row lock contention-63(threshold 50)
2. Checking the wait events, find that the count for 'enq: TX - row lock contention' is high
COUNT(*) EVENT MAX(SECONDS_IN_WAIT)
---------- --------------------------------------------- --------------------
5753 SQL*Net message from client 6048278
188 SQL*Net break/reset to client 33425
149 enq: TX - row lock contention 12274
20 rdbms ipc message 41691
1 Streams AQ: qmn slave idle wait 556815
1 Streams AQ: waiting for messages in the queue 5
1 Streams AQ: waiting for time management or cl 33106
eanup tasks
1 db file parallel write 0
1 direct path write 2
1 Streams AQ: qmn coordinator idle wait 6184833
1 jobq slave wait 18
1 os thread startup 0
1 pmon timer 18795
1 smon timer 5504
1 SQL*Net message to client 0
3. Checking locks:
Looks everyone is waiting for sid=7752
system@DBNAME> @qlocks
Lock Time Held
SID Lock Type Requested Lock Held (minutes) Block
------ --------- --------- --------- --------- -----
7752 Trans None Exclusive 325 Yes
3994 Trans Share None 324 No
8305 Trans Share None 323 No
4922 Trans Share None 292 No
3977 Trans Share None 285 No
7668 Trans Share None 282 No
4379 Trans Share None 124 No
5423 Trans Share None 1 No
7611 Trans Share None 1 No
6080 Trans Share None 1 No
3843 Trans Share None 0 No
4731 Trans Share None 0 No
( truncated ... total 116 rows )
system@DBNAME> @blocker
ADDR KADDR SID TY ID1 ID2 LMODE REQUEST CTIME Block
---------------- ---------------- ------ -- ---------- ---------- ---------- ---------- ---------- ----------
0000000516758030 00000005167581B8 7752 TX 3080222 128785 6 0 19544 ##########
4. Check the session sid=7752
system@DBNAME> @sess_by_sid
Enter value for sid: 7752
Shadow Remote Remote
SID-SERIAL OS User Username Module Process Process Machine Logon Time Status
------------ -------- ---------------- -------------- ------------ ------------ -------------------- --------------- --------
'7752,55822' doammw27 OLAMLOG 8786 1234 dspcef07 10/26 13:42:04 INACTIVE
5. The following query shows that session sid=7752 is 19367 sec old, and the last call made by it is 19366 sec ago.
system@DBNAME> @trans_long_running.sql
ADDR Status UBAREC START_SCN SYSDATE START_DATE SECS RSSIZE SID LAST_CALL_ET Status Username OS User SHARABLE_MEM
---------------- -------- ---------- ------------------- ------------------- ------------------- ---------- ---------- ----- ------------ -------- ---------------- -------- ------------
0000000516758030 ACTIVE 13 10172725882034 2008-10-26 19:05:54 2008-10-26 13:43:07 19367 1171456 7752 19366 INACTIVE OLAMLOG doammw27
6. After killing the session sid=7752, wait event showed as follows:
COUNT(*) EVENT MAX(SECONDS_IN_WAIT)
---------- --------------------------------------------- --------------------
5943 SQL*Net message from client 6049401
188 SQL*Net break/reset to client 34548
22 rdbms ipc message 42814
1 smon timer 276
1 Streams AQ: waiting for time management or cl 34229
eanup tasks
1 jobq slave wait 47
1 pmon timer 283
1 Streams AQ: qmn slave idle wait 557938
1 Streams AQ: qmn coordinator idle wait 6185956
1 SQL*Net message to client 0
1 Streams AQ: waiting for messages in the queue 5
####---------------------------------------------------------------------------
Appendix - scripts used:
$ cat event.sql rem file: event.sql rem rem Lists the Current Waits in the database. rem First Script to run in the database to debug performance issues. rem This query lists the Count of Sessions group by Wait Event, and rem sorted in Ascending Order. rem Look for Top 5 Wait events from this query. rem Also listed is the maximum time ( seconds) sessions rem have waited against a particular wait event. Count Information is rem more valuable than the Max time information. rem column event format a45 select count(*), event, max(seconds_in_wait) from v$session_wait group by event order by 1 desc / ####--------------------------------------------------------------------------- $ cat qlocks.sql clear columns clear computes clear breaks set linesize 240 set pagesize 40 column sid format 99999 heading "SID" column locktype format A9 heading "Lock Type" column lockreq format A9 heading "Lock Requested" column lockheld format A9 heading "Lock Held" column timeheld format 9999999 heading "Time Held(minutes)" column block format a5 heading "Block" select /* RULE */ l.sid sid, decode(l.type,'TM','DML','TX','Trans','UL','User',l.type) locktype, decode(l.request,0,'None',1,'Null',2,'Row-S',3,'Row-X',4,'Share',5,'S/Row-X',6,'Exclusive') lockreq, decode(l.lmode,0,'None',1,'Null',2,'Row-S',3,'Row-X',4,'Share',5,'S/Row-X',6,'Exclusive') lockheld, l.ctime/60 timeheld, decode(l.block,0,'No',1,'Yes') block from v$lock l where l.request != 0 or l.block != 0 order by l.id1, l.lmode desc, l.ctime desc / ####--------------------------------------------------------------------------- $ cat blocker.sql select * from v$lock where block > 0 / ####--------------------------------------------------------------------------- $ cat sess_by_sid.sql set linesize 250 set pagesize 1000 set verify off column SidSerial format a12 heading "SID-SERIAL" column sid format 9999 heading "SID" column serial format 99999 heading "Serial " column osuser format a8 heading "OS User" column username format a16 heading "Username" column module format a14 heading "Module" column spid format 999999 heading "ShadowProcess" column process format 999999 heading "RemoteProcess" column machine format a20 heading "RemoteMachine" column time format a15 heading "Logon Time" column status format a8 heading "Status" select ''''||s.sid||','||s.serial# ||'''' SidSerial, --select s.sid sid , --s.serial# serial, s.osuser osuser, s.username username, s.module module, p.spid spid, s.process process, s.machine machine, to_char(s.logon_time, 'mm/dd hh24:mi:ss') time, s.status status from v$session s, v$process p where s.paddr = p.addr (+) and s.sid = &sid / ####--------------------------------------------------------------------------- $ cat trans_long_running.sql column start_scn format 999999999999999999 select t.addr, t.status, t.UBAREC, t.start_scn, SYSDATE, t.start_date, ( (SYSDATE - t.start_date) * 86400 ) secs, r.rssize, s.sid, s.last_call_et, s.status, s.username, s.osuser, sa.sharable_mem from v$transaction t join v$rollstat r on t.xidusn = r.usn join v$session s on s.saddr = t.ses_addr left outer join v$sqlarea sa on s.sql_address = sa.address where s.sid = 7752 /
Tuesday, September 23, 2008
Troubleshooting ORA-16191 : Primary log shipping client not logged on standby
I recieved an error ORA-16191 from email alert tonight when I was on-call. I searched the internet immediately, found that:
---
ORA-16191: Primary log shipping client not logged on standby
Cause: An attempt to ship redo to standby without logging on to standby or with invalid user credentials.
Action: Check that primary and standby are using password files and that both primary and standby have the same SYS password. Restart primary and/or standby after ensuring that password file is accessible and REMOTE_LOGIN_PASSWORDFILE initialization parameter is set to SHARED or EXCLUSIVE
---
I checked the alert log file of the primary database, found that the alert starting from 1:18
I also checked the alert log file of the standby database, find that the someone probably was doing shutdown/start the standby db
The timestamp of the pmon process also indicated it was just started tonight.
Another team member explained to me that we change the SYS password every 2 or 3 months, however,
when we change for the primary database, we don't change for the standby database. So after standby db down and up, Oracle trying to connect the standby using the new password, but the password file in the standby db site still contains old password. This is the reason that I saw the error.
Based on what he said and the problem observed, it looks to me that somehow there is a "connection" from primary db to the standby db using the password in the password files and the two password files have to be in sync to enable such a connection. When we changed the sys password in the primary only, the password files are out of sync. So when standby is bounced, this "connection" need to be re-established but it can not because the out of sync situation.
I thus used the "orapwd" command to create a new password file on the standby site. Problem resolved. No such alerts recieved again.
---
ORA-16191: Primary log shipping client not logged on standby
Cause: An attempt to ship redo to standby without logging on to standby or with invalid user credentials.
Action: Check that primary and standby are using password files and that both primary and standby have the same SYS password. Restart primary and/or standby after ensuring that password file is accessible and REMOTE_LOGIN_PASSWORDFILE initialization parameter is set to SHARED or EXCLUSIVE
---
I checked the alert log file of the primary database, found that the alert starting from 1:18
Tue Sep 23 01:18:15 2008
Error 1017 received logging on to the standby
------------------------------------------------------------
RA-16191: Primary log shipping client not logged on standby
PING[ARC3]: Heartbeat failed to connect to standby 'PS4008A.world'. Error is 16191.
Tue Sep 23 01:23:16 2008
Error 1031 received logging on to the standby
Tue Sep 23 01:23:16 2008
Errors in file /logs/ORACLE/MYDBNAME/bdump/pphi08a_arc3_2846.trc:
ORA-01031: insufficient privileges
PING[ARC3]: Heartbeat failed to connect to standby 'PS4008A.world'. Error is 1031.
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
and that the SYS password is same in the password files.
returning error ORA-16191
------------------------------------------------------------
I also checked the alert log file of the standby database, find that the someone probably was doing shutdown/start the standby db
-----
Tue Sep 23 02:00:33 2008
Physical Standby Database mounted.
Completed: alter database mount standby database
Tue Sep 23 02:00:33 2008
Physical Standby Database mounted.
Completed: alter database mount standby database
Tue Sep 23 02:01:26 2008
alter database recover managed standby database parallel 16 disconnect
Tue Sep 23 02:01:26 2008
alter database recover managed standby database parallel 16 disconnect
---
The timestamp of the pmon process also indicated it was just started tonight.
$ ps -ef grep pmon
oracle 23059 1 0 01:59:26 ? 0:00 ora_pmon_PS4008A
Another team member explained to me that we change the SYS password every 2 or 3 months, however,
when we change for the primary database, we don't change for the standby database. So after standby db down and up, Oracle trying to connect the standby using the new password, but the password file in the standby db site still contains old password. This is the reason that I saw the error.
Based on what he said and the problem observed, it looks to me that somehow there is a "connection" from primary db to the standby db using the password in the password files and the two password files have to be in sync to enable such a connection. When we changed the sys password in the primary only, the password files are out of sync. So when standby is bounced, this "connection" need to be re-established but it can not because the out of sync situation.
I thus used the "orapwd" command to create a new password file on the standby site. Problem resolved. No such alerts recieved again.
Subscribe to:
Posts (Atom)