Showing posts with label troubleshooting. Show all posts
Showing posts with label troubleshooting. Show all posts

Wednesday, February 09, 2011

Quick notes about troubleshooting a blocker issue involving an insert

Called to help troubleshooting a production issue: an insert statemnt blocked many others. However, when I checked the database on 7:30 PM, the symptoms have gone.

In retrospective, the problem is most  likely due to a configuration issue in context of DMT. This database was upgraded to 10g a year ago from 8i. A few tablespaces is still dictionary-managed.


Here is the AWR "Top 5 Timed Events" during the problem period:

Top 5 Timed Events (18:30 -19:00)

Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class

enq: TX - contention 44,494 130,783 2,939 59.7 Other
enq: ST - contention 27,122 79,719 2,939 36.4 Configuration
CPU time 3,700 1.7
read by other session 839,265 2,843 3 1.3 User I/O
db file scattered read 405,142 2,121 5 1.0 User I/O


From: http://www.saptechies.com/faq-oracle-enqueues/,  we can read some info about Oracle enqueue.


System enqueues:

ST (space transaction enqueue): This enqueue is held in dictionary-managed tablespaces within extent allocations and releases.

TYPE = ST

Very large numbers of extents are allocated or deallocated.


You can solve the problem permanently by using LMTS instead of DMTS (refer to Notes 214995 and 659946). If you use DMTS, you must ensure that the extents of temporary segments, tables and indexes are sufficiently big so that you can avoid large quantities of extents being allocated and deallocated.




I wan able to find the blocker from the DBA_HIST_ACTIVE_SESS_HISTORY view. Drunig 18:30 - 19:00 Session 651 is blocking many other sessions, which either wait for enq: TX or enq:ST  for examples:

SAMPLE_TIME SESSION_ID SQL_ID EVENT BLOCKING_SESSION

---------------------------------------- ---------- ------------- ------------------------------ ----------------
08-FEB-11 06.59.56.243 PM 242 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 345 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 414 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 469 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 591 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 593 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 594 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 595 6wyrqa50whrjg enq: TX - contention 651
08-FEB-11 06.59.56.243 PM 596 6wyrqa50whrjg enq: TX - contention 651

...


SAMPLE_TIME SESSION_ID SQL_ID EVENT BLOCKING_SESSION
---------------------------------------- ---------- ------------- ------------------------------ ----------------
08-FEB-11 06.59.56.243 PM 1976 dpc1k1djg5gh0 enq: ST - contention 651
08-FEB-11 06.59.56.243 PM 49 dpc1k1djg5gh0 enq: ST - contention 651
08-FEB-11 06.59.56.243 PM 1841 dpc1k1djg5gh0 enq: ST - contention 651
08-FEB-11 06.59.56.243 PM 1763 dpc1k1djg5gh0 enq: ST - contention 651
08-FEB-11 06.59.56.243 PM 1693 dpc1k1djg5gh0 enq: ST - contention 651
08-FEB-11 06.59.56.243 PM 1660 dpc1k1djg5gh0 enq: ST - contention 651

....

Session 651 that was doing the same insert lasted about 1hour, from about 6:17 to 7:15

SAMPLE_TIME SESSION_ID SQL_ID EVENT BLOCKING_SESSION
---------------------------------------- ---------- ------------- ------------------------------ ----------------
08-FEB-11 06.17.23.778 PM 651 6wyrqa50whrjg
08-FEB-11 06.17.33.998 PM 651 6wyrqa50whrjg
08-FEB-11 06.17.44.198 PM 651 6wyrqa50whrjg
08-FEB-11 06.17.54.508 PM 651 6wyrqa50whrjg
...

08-FEB-11 07.14.45.919 PM 651 6wyrqa50whrjg
08-FEB-11 07.14.56.049 PM 651 6wyrqa50whrjg
08-FEB-11 07.15.06.159 PM 651 6wyrqa50whrjg
-- end --



After this sql ( 6wyrqa50whrjg  is an  simple insert into a table) was done, things become normal.  I am still not sure why this insert took so long. I was unable to know what wait event it was experiencing. Apparantly, Oracle was allocating a new extent to the table segment, but why took about an hour? The table referred in the SQL has two extents now, first one has 64000 blocks, the second has 16000 blocks.

So the scenario appeared to be that a session wants to insert a row into a table that resides in a DMT tablespace, but  the table segment is run out of space, Oracle firstly has to allocate a new extent to the table segment; the session is waiting for this allocation to complete; it blocks all other sessions doing the same insert into the same table with the latter waiting for enq:TX. It also blocks other sessions with enq:ST events,  which  are doing DML on  different tables  that resides in the same tablespace.

Anyway we really need to migrate DMT to LMT asap.

Monday, July 13, 2009

Having troubles in creating a database link

Today, I have spent quite some time to try to get a db link from a 10g db (DBWRKEV1) to a 9i db (DBWRK920) work.

At DBWRKEV1, we can see the global_name has a funny suffix.

z9xxx94@DBWRKEV1> select * from global_name;

GLOBAL_NAME
---------------------------------------------------------
DBWRKEV1.REGRESS.RDBMS.DEV.US.ORACLE.COM


Based on this post, I tried to remove it by this way:


SQL> show user;
USER is "SYS"
SQL> update GLOBAL_NAME SET GLOBAL_NAME='DBWRKEV1';

1 row updated.

SQL> commit;

Commit complete.

z9xxx94@DBWRKEV1> select * from global_name;

GLOBAL_NAME
---------------------------------------------
DBWRKEV1

Then, I tried to create the db link again:

z9xxx94@DBWRKEV1> create database link dbwrk920 connect to z9xxx94 identified by vxxx using 'dbwrk920';

Database link created.

z9xxx94@DBWRKEV1> select * from global_name@dbwrk920;
select * from global_name@dbwrk920
*
ERROR at line 1:
ORA-02085: database link DBWRK920.REGRESS.RDBMS.DEV.US.ORACLE.COM connects to DBWRK920


Still, I have not succeeded. It should be noted that the initialization parameter global_names is TRUE in the DBWRKEV1(10g)

Finally, I was able to use the db link through following approach:


At DBWRK920


z9xxx94@DBWRK920> select * from global_name;

GLOBAL_NAME
---------------------------------------------------------------------------
DBWRK920

z9xxx94@DBWRK920> alter database rename global_name to dbwrk920.world;

Database altered.

z9xxx94@DBWRK920> select * from global_name;

GLOBAL_NAME
---------------------------------------------------------------------------
DBWRK920.WORLD


At DBWRKEV1


z9xxx94@DBWRKEV1> create database link dbwrk920.world connect to z9xxx94 identified by vxxx using 'dbwrk920';

Database link created.


z9xxx94@DBWRKEV1>
z9xxx94@DBWRKEV1> select * from global_name@dbwrk920.world;

GLOBAL_NAME
--------------------------------------------------------------
DBWRK920.WORLD

Saturday, November 15, 2008

Troubleshooting - ORA-02049: timeout: distributed transaction waiting for lock

The following steps/observations are involved in solving this issue:

1. Application team asked DBA for help to solve the following error:

We are connecting to NCAMASTERCON/xxxxx@PRDDB1



SQL> update au1011
2 set rerate_request_cd = '5',dl_service_code = 'FXLSA', sys_update_date = sysdate
3 where ban in (select ban from TEMP_LSAFF_US1011)
4 and record_type = '1';
commit;
update au1011
*
ERROR at line 1:
ORA-02049: timeout: distributed transaction waiting for lock
ORA-02063: preceding line from NCAUSAGECON.PRDDB2





2. I found out that au1011 on PRDDB1 is a synonym for an object on a remote database PRDDB2

The synonym definition is as follows:
CREATE OR REPLACE SYNONYM "NCAMASTERCON"."AU1011" FOR "NCAUSAGECON"."AU1011"@"NCAUSAGECON.PRDDB2";


3. Then, I found out on PRDDB2 the "AU1011" is also a synonym, it is defined as:
CREATE OR REPLACE SYNONYM "NCAUSAGECON"."AU1011" FOR "NCAUSAGEOWN"."ACCUMULATED_USAGE_10_B";

4. Checked if there are any locks on the table: ACCUMULATED_USAGE_10_B



system@PRDDB2> @l2/locked_object_by_name.sql
Enter value for object_name: ACCUMULATED_USAGE_10_B
Lock Lock Time Object
SID Type Held Owner Object Name Held (min) Status
------ ---- ------ ---------------- ------------------------------ ---------- ------------
2278 DML Row-X NCAUSAGEOWN ACCUMULATED_USAGE_10_B 1622 VALID




5. The SQL running by session 2278 is:



system@PRDDB2> @sql_cur_sid
Enter value for sid: 2278

SQL_TEXT SQL_ID SID USER_NAME

---------------------------------------- ------------- ------ ------------
UPDATE /*+ PARALLEL ("A1",4) */ "NCAUSAG 93vbp0jv2c7w4 2278 NCAUSAGECON
ECON"."AU1011" "A1" SET "RERATE_REQUEST_
CD" = '5',"DL_SERVICE_CODE" = 'FXLSA',"S
YS_UPDATE_DATE" = SYSDATE@! WHERE ("A1".
"BAN","A1"."SUBSCRIBER_NO")=ANY (SELECT
DISTINCT "A2"."BAN","A1"."SUBSCRIBER_NO"
FROM "TEMP_LSAFF_US1011"@! "A2") AND "A
1"."RECORD_TYPE"='1'



6. Further check revealed that session 2278 on PRDDB2 was started 97784 seconds ago
and the last call it made was 97773 seconds ago



7. The application DBA observed the following:
The session (sid=7265) shows as KILLED on PRDDB1, but the corresponding remote session on PRDDB2(sid=2278) is still active.

8. Confirmed by application team, we killed the session 2278 on PRDDB2

Monday, October 27, 2008

Identify and kill a blocking session - enq: TX - row lock contention

This post documented steps I have taken to identify and kill a blocking session:

1. Recived alert:

DBNAME:critical-Higher wait events:enq: TX - row lock contention-63(threshold 50)

2. Checking the wait events, find that the count for 'enq: TX - row lock contention' is high

  COUNT(*) EVENT                                         MAX(SECONDS_IN_WAIT)
---------- --------------------------------------------- --------------------
5753 SQL*Net message from client                                6048278
188 SQL*Net break/reset to client                                33425
149 enq: TX - row lock contention                                12274
20 rdbms ipc message                                            41691
1 Streams AQ: qmn slave idle wait                             556815
1 Streams AQ: waiting for messages in the queue                    5
1 Streams AQ: waiting for time management or cl                33106
eanup tasks
1 db file parallel write                                           0
1 direct path write                                                2
1 Streams AQ: qmn coordinator idle wait                      6184833
1 jobq slave wait                                                 18
1 os thread startup                                                0
1 pmon timer                                                   18795
1 smon timer                                                    5504
1 SQL*Net message to client                                        0


3. Checking locks:

Looks everyone is waiting for sid=7752

system@DBNAME> @qlocks

Lock                Time Held
SID Lock Type Requested Lock Held (minutes) Block
------ --------- --------- --------- --------- -----
7752 Trans     None      Exclusive       325 Yes
3994 Trans     Share     None            324 No
8305 Trans     Share     None            323 No
4922 Trans     Share     None            292 No
3977 Trans     Share     None            285 No
7668 Trans     Share     None            282 No
4379 Trans     Share     None            124 No
5423 Trans     Share     None              1 No
7611 Trans     Share     None              1 No
6080 Trans     Share     None              1 No
3843 Trans     Share     None              0 No
4731 Trans     Share     None              0 No

( truncated ...  total 116 rows )



system@DBNAME> @blocker

ADDR             KADDR               SID TY        ID1        ID2      LMODE    REQUEST      CTIME   Block
---------------- ---------------- ------ -- ---------- ---------- ---------- ---------- ---------- ----------
0000000516758030 00000005167581B8   7752 TX    3080222     128785          6          0      19544 ##########





4. Check the session sid=7752

system@DBNAME> @sess_by_sid
Enter value for sid: 7752

Shadow       Remote       Remote
SID-SERIAL   OS User  Username         Module         Process      Process      Machine              Logon Time      Status
------------ -------- ---------------- -------------- ------------ ------------ -------------------- --------------- --------
'7752,55822' doammw27 OLAMLOG                         8786         1234         dspcef07             10/26 13:42:04  INACTIVE





5. The following query shows that session sid=7752 is 19367 sec old, and the last call made by it is 19366 sec ago.

system@DBNAME> @trans_long_running.sql

ADDR             Status       UBAREC           START_SCN SYSDATE             START_DATE                SECS     RSSIZE   SID LAST_CALL_ET Status   Username         OS User  SHARABLE_MEM
---------------- -------- ---------- ------------------- ------------------- ------------------- ---------- ---------- ----- ------------ -------- ---------------- -------- ------------
0000000516758030 ACTIVE           13      10172725882034 2008-10-26 19:05:54 2008-10-26 13:43:07      19367    1171456  7752        19366 INACTIVE OLAMLOG          doammw27




6. After killing the session sid=7752, wait event showed as follows:

  COUNT(*) EVENT                                         MAX(SECONDS_IN_WAIT)
---------- --------------------------------------------- --------------------
5943 SQL*Net message from client                                6049401
188 SQL*Net break/reset to client                                34548
22 rdbms ipc message                                            42814
1 smon timer                                                     276
1 Streams AQ: waiting for time management or cl                34229
eanup tasks

1 jobq slave wait                                                 47
1 pmon timer                                                     283
1 Streams AQ: qmn slave idle wait                             557938
1 Streams AQ: qmn coordinator idle wait                      6185956
1 SQL*Net message to client                                        0
1 Streams AQ: waiting for messages in the queue                    5




####---------------------------------------------------------------------------
Appendix - scripts used:

$ cat event.sql
rem file: event.sql
rem
rem  Lists the Current Waits in the database.
rem  First Script to run in the database to debug performance issues.
rem  This query lists the Count of Sessions group by Wait Event, and
rem  sorted in Ascending Order.
rem  Look for Top 5 Wait events from this query.
rem  Also listed is the maximum time ( seconds) sessions
rem  have waited against a particular wait event. Count Information is
rem  more valuable than the Max time information.
rem
column event format a45
select count(*), event, max(seconds_in_wait) from v$session_wait
group by event
order by 1 desc
/

####---------------------------------------------------------------------------

$ cat qlocks.sql
clear columns
clear computes
clear breaks

set linesize 240
set pagesize 40

column sid      format 99999    heading "SID"
column locktype format A9       heading "Lock Type"
column lockreq  format A9       heading "Lock     Requested"
column lockheld format A9       heading "Lock Held"
column timeheld format 9999999  heading "Time Held(minutes)"
column block    format a5       heading "Block"


select /* RULE */ l.sid sid,
decode(l.type,'TM','DML','TX','Trans','UL','User',l.type) locktype,
decode(l.request,0,'None',1,'Null',2,'Row-S',3,'Row-X',4,'Share',5,'S/Row-X',6,'Exclusive') lockreq,
decode(l.lmode,0,'None',1,'Null',2,'Row-S',3,'Row-X',4,'Share',5,'S/Row-X',6,'Exclusive') lockheld,
l.ctime/60 timeheld,
decode(l.block,0,'No',1,'Yes') block
from v$lock l
where l.request != 0 or l.block != 0
order by l.id1, l.lmode desc, l.ctime desc
/

####---------------------------------------------------------------------------
$ cat blocker.sql
select * from v$lock where block > 0
/

####---------------------------------------------------------------------------
$ cat sess_by_sid.sql
set linesize 250
set pagesize 1000
set verify off

column SidSerial        format a12      heading "SID-SERIAL"
column sid      format 9999     heading "SID"
column serial   format 99999    heading "Serial "
column osuser   format a8       heading "OS User"
column username format a16      heading "Username"
column module   format a14      heading "Module"
column spid     format 999999   heading "ShadowProcess"
column process  format 999999   heading "RemoteProcess"
column machine  format a20      heading "RemoteMachine"
column time     format a15      heading "Logon Time"
column status   format a8       heading "Status"


select ''''||s.sid||','||s.serial# ||'''' SidSerial,
--select s.sid sid ,
--s.serial# serial,
s.osuser                 osuser,
s.username               username,
s.module                 module,
p.spid                   spid,
s.process                process,
s.machine                machine,
to_char(s.logon_time, 'mm/dd hh24:mi:ss') time,
s.status                 status
from v$session s, v$process p
where s.paddr = p.addr (+)
and s.sid = &sid
/

####---------------------------------------------------------------------------

$ cat trans_long_running.sql
column start_scn format 999999999999999999
select
t.addr,
t.status,
t.UBAREC,
t.start_scn,
SYSDATE,
t.start_date,
(
(SYSDATE - t.start_date) * 86400
) secs,
r.rssize,
s.sid,
s.last_call_et,
s.status,
s.username,
s.osuser,
sa.sharable_mem
from
v$transaction t
join v$rollstat r on t.xidusn = r.usn
join v$session s on s.saddr = t.ses_addr
left outer join v$sqlarea sa on s.sql_address = sa.address
where s.sid = 7752
/


Tuesday, September 23, 2008

Troubleshooting ORA-16191 : Primary log shipping client not logged on standby

I recieved an error ORA-16191 from email alert tonight when I was on-call. I searched the internet immediately, found that:

---
ORA-16191: Primary log shipping client not logged on standby
Cause: An attempt to ship redo to standby without logging on to standby or with invalid user credentials.
Action: Check that primary and standby are using password files and that both primary and standby have the same SYS password. Restart primary and/or standby after ensuring that password file is accessible and REMOTE_LOGIN_PASSWORDFILE initialization parameter is set to SHARED or EXCLUSIVE
---

I checked the alert log file of the primary database, found that the alert starting from 1:18




Tue Sep 23 01:18:15 2008
Error 1017 received logging on to the standby
------------------------------------------------------------
RA-16191: Primary log shipping client not logged on standby
PING[ARC3]: Heartbeat failed to connect to standby 'PS4008A.world'. Error is 16191.
Tue Sep 23 01:23:16 2008
Error 1031 received logging on to the standby
Tue Sep 23 01:23:16 2008
Errors in file /logs/ORACLE/MYDBNAME/bdump/pphi08a_arc3_2846.trc:
ORA-01031: insufficient privileges
PING[ARC3]: Heartbeat failed to connect to standby 'PS4008A.world'. Error is 1031.

Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
and that the SYS password is same in the password files.
returning error ORA-16191
------------------------------------------------------------



I also checked the alert log file of the standby database, find that the someone probably was doing shutdown/start the standby db



-----
Tue Sep 23 02:00:33 2008
Physical Standby Database mounted.
Completed: alter database mount standby database
Tue Sep 23 02:00:33 2008
Physical Standby Database mounted.
Completed: alter database mount standby database
Tue Sep 23 02:01:26 2008
alter database recover managed standby database parallel 16 disconnect
Tue Sep 23 02:01:26 2008
alter database recover managed standby database parallel 16 disconnect
---



The timestamp of the pmon process also indicated it was just started tonight.



$ ps -ef grep pmon
oracle 23059 1 0 01:59:26 ? 0:00 ora_pmon_PS4008A



Another team member explained to me that we change the SYS password every 2 or 3 months, however,
when we change for the primary database, we don't change for the standby database. So after standby db down and up, Oracle trying to connect the standby using the new password, but the password file in the standby db site still contains old password. This is the reason that I saw the error.

Based on what he said and the problem observed, it looks to me that somehow there is a "connection" from primary db to the standby db using the password in the password files and the two password files have to be in sync to enable such a connection. When we changed the sys password in the primary only, the password files are out of sync. So when standby is bounced, this "connection" need to be re-established but it can not because the out of sync situation.

I thus used the "orapwd" command to create a new password file on the standby site. Problem resolved. No such alerts recieved again.