Friday, March 06, 2015

Killing DBSNMP sessions not easy

We had a performane issue for a prodcution database today. The CPU utilization went up. When I logged into the server and tried to log into the database as sysdba, I got ORA-00020 error:
$sqlplus "/ as sysdba"

SQL*Plus: Release 11.2.0.2.0 Production on Fri Mar 6 10:55:10 2015

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

ERROR:
ORA-00020: maximum number of processes (700) exceeded

Another DBA reported that the session count for DBSNMP user went up high. This user is from OEM. I stopped the OEM agent and he killed all DBSNMP sessions at database level. However, almost 1 hour later, we still saw high DBSNMP session count:
Username           COUNT(*)
---------------- ----------
DBSNMP                  352
XXXX100                   1
XXXXSELECT                1
XXXXUSER                232
IVIEW2XXXX               25
ORACLE                    5
XX09195                   1
XXX6621                   6
XXX3294                   1
XXDBMON                   1
                         36
note: some usernames are marked to protect ther innocence.

Of course, we can see all DBSNMP sessions marked as "KILLED"
SQL> select username, status from v$session where username='DBSNMP';

Username         Status
---------------- --------
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
...

So It appeared to be a good idea to kill the processes at OS level. I fired the following script I usually use to get the KILL command:
set pages 0
spool /tmp/kill_os

select 'kill -9 '|| p.spid
from v$process p , v$session s
where p.background is null
and s.paddr = p.addr (+)
and p.spid is not null
and s.username='DBSNMP'
order by p.spid
/
Guess what, to my surprise, it returned nothing !. I got the count for those "LOCAL=NO" processes as below:
$ps -ef |grep "LOCAL=NO" | wc -l

659

Now the problem was how could I get the PIDs out of those 659 processes that belong to DBSNMP ? What I did finally was:

1. load the PIDs of all those "LOCAL=NO" processes into a table db_admin.pid with one number column called ID:
ps -ef | grep LOCAL=NO | awk '{print "insert into db_admin.pid values(" $2 "); }'

2. Run the following query to generate the KILL commands

select 'kill -9 ' ||  id from db_admin.pid x 
where not exists (
select 1 
from v$session s, v$process p
where s.paddr = p.addr (+)
  and p.spid = x.id
)
order by id
/

This trick worked. To prevent future troubles, we've also set up a profile limit for the number of sessions the user DBSNMP can have. We still don't know what caused the session count high though.

1 comment:

yds said...

As to possible reason for high DBSNMP session count: see: High Number of SYS or DBSNMP Sessions Selecting from gv$buffered_subscribers Affecting Database Performance or Contributing to Hung Database (Doc ID 1372977.1)