Tuesday, April 14, 2015

Monitoring Oracle GoldenGate Latency

Today I've implemented an approach to monitor OGG latency. Here I will describe what I've done.  

1. Create a table gg_latency in source and target databases:
create table gg_latency
(
  extr varchar2(10),
  pump varchar2(10),
  repl varchar2(10),
  update_time date
);



alter table gg_latency add constraint gg_latency_pk primary key(extr, pump, repl) using index;

2. Create a procedure that is used to update the latency table: 

create or replace procedure proc_update_gg_latency
is
begin
  for rec in ( select * from gg_latency)
  loop
     update gg_latency set update_time=sysdate where extr=rec.extr and pump=rec.pump and repl = rec.repl;
     commit;
  end loop;
end;
/
3. Populate the table with every possible combination of the processing group names:

For example, in my replication enviroment, at source I have three Extract groups, three Pump groups, at target I have 15 Replicat groups :
insert into gg_latency(extr, pump, repl) values('ECRUDR1', 'PCRUDR1', 'CRURDR1');
insert into gg_latency(extr, pump, repl) values('ECRUDR1', 'PCRUDR1', 'CRURDR1A');
insert into gg_latency(extr, pump, repl) values('ECRUDR1', 'PCRUDR1', 'CRURDR1B');
insert into gg_latency(extr, pump, repl) values('ECRUDR1', 'PCRUDR1', 'CRURDR1C');
insert into gg_latency(extr, pump, repl) values('ECRUDR1', 'PCRUDR1', 'CRURDR1D');
  
insert into gg_latency(extr, pump, repl) values('ECRUDR2', 'PCRUDR2', 'CRURDR2A');
insert into gg_latency(extr, pump, repl) values('ECRUDR2', 'PCRUDR2', 'CRURDR2B');
insert into gg_latency(extr, pump, repl) values('ECRUDR2', 'PCRUDR2', 'CRURDR2C');
insert into gg_latency(extr, pump, repl) values('ECRUDR2', 'PCRUDR2', 'CRURDR2D');
insert into gg_latency(extr, pump, repl) values('ECRUDR2', 'PCRUDR2', 'CRURDR2F');
insert into gg_latency(extr, pump, repl) values('ECRUDR2', 'PCRUDR2', 'CRURDR2G');
insert into gg_latency(extr, pump, repl) values('ECRUDR2', 'PCRUDR2', 'CRURDR2H');
insert into gg_latency(extr, pump, repl) values('ECRUDR2', 'PCRUDR2', 'CRURDR2M');
insert into gg_latency(extr, pump, repl) values('ECRUDR2', 'PCRUDR2', 'CRURDR2N');

insert into gg_latency(extr, pump, repl) values('ECRUDR3', 'PCRUDR3', 'CRURDR3');

4. For each EXTRACT group parameter file at source, add the TABLE clause with WHERE option for the GG_LATENCY table , e.g.

    TABLE DB_ADMIN.GG_LATENCY WHERE ( EXTR="ECRUDR1");

Note: do this for all the EXTRACT groups


5. For each PUMP group parameter file at source, add the TABLE clause with WHERE option for the GG_LATENCY table , e.g.
    TABLE DB_ADMIN.GG_LATENCY, WHERE (PUMP="PCRUDR1");
Note: add the line before the PASSTHRU if exists, do this for all the PUMP groups


 6. For each REPLICAT group parameter file at target, add MAP clause with WHERE option for the GG_LATENCY table , e.g.

MAP DB_ADMIN.GG_LATENCY, TARGET DB_ADMIN.GG_LATENCY, WHERE (REPL='CRURDR1');
 
Note: do this for all the REPLICAT groups. In 12c OGG, single quotation mark should be used for literal string.

7. Bounce all processes as parameter files are modified 


8. Create a scheduler job to update the latency table every minute
begin
DBMS_SCHEDULER.create_job (
   job_name         => 'UPDATE_GG_LATENCY_TABLE',
    job_type        => 'PLSQL_BLOCK',
    job_action      => 'BEGIN db_admin.proc_update_gg_latency; END;',
    start_date      => trunc(sysdate, 'HH24'),
    repeat_interval => 'freq=minutely',
    end_date        => NULL,
    enabled         => TRUE
    );
end;
/

9. Check latency by the following query at target:
 
SQL> select extr, pump, repl, update_time, round((sysdate - update_time) *24*60) latency_mins from gg_latency;

EXTR       PUMP       REPL       UPDATE_TIME          LATENCY_MINS
---------- ---------- ---------- -------------------- ------------
ECRUDR1    PCRUDR1    CRURDR1D   14-Apr-2015 12:46:00            1
ECRUDR1    PCRUDR1    CRURDR1B   14-Apr-2015 12:46:00            1
ECRUDR1    PCRUDR1    CRURDR1A   14-Apr-2015 12:46:00            1
ECRUDR2    PCRUDR2    CRURDR2D   14-Apr-2015 12:46:00            1
ECRUDR1    PCRUDR1    CRURDR1C   14-Apr-2015 12:46:00            1
ECRUDR1    PCRUDR1    CRURDR1    14-Apr-2015 12:46:00            1
ECRUDR2    PCRUDR2    CRURDR2H   14-Apr-2015 12:46:00            1
ECRUDR2    PCRUDR2    CRURDR2C   14-Apr-2015 12:46:00            1
ECRUDR2    PCRUDR2    CRURDR2N   14-Apr-2015 12:46:00            1
ECRUDR2    PCRUDR2    CRURDR2B   14-Apr-2015 12:46:00            1
ECRUDR2    PCRUDR2    CRURDR2A   14-Apr-2015 12:46:00            1
ECRUDR2    PCRUDR2    CRURDR2M   14-Apr-2015 12:46:00            1
ECRUDR2    PCRUDR2    CRURDR2G   14-Apr-2015 12:46:00            1
ECRUDR2    PCRUDR2    CRURDR2F   14-Apr-2015 12:46:00            1
ECRUDR3    PCRUDR3    CRURDR3    14-Apr-2015 12:46:00            1

15 rows selected.

Note: As we update every minute, the smallest unit for latency is a minute.

Friday, March 13, 2015

Just another example - Writing efficient SQL with analytic function

I have a monitoring job set up to alert me when the buffer gets per execution of a SQL above certain threshold. Today I received one as below:
=================================================
start :  13-MAR-15 02.00.23.885 PM
end   :  13-MAR-15 03.00.48.168 PM
snapid from     107851  to     107853
=================================================

********************************************************
list sql with buffer gets per execution > 100000
********************************************************


!!!! ##########  Expensive SQL found ####### !!!!                               
instance   : 1                                                                  
sql_id     : drdbm833ack3c                                                      
Buffer get : 260358                                                             
Execs      : 1                                                                  
BG/exec    : 260358                                                             
Gets/row   : 700                                                                
SQL TEXT   :                                                                    
SELECT p.rec_id, p.CREATE_DATE,                                                 
p.ordered_zzzyyy_type,p.qualified_zzzyyy_type,p.req_type,p.c_transid,p.status,p.
ERR_CODE,p.ERR_DESC,p.SVC_STATE,p.IS_DOWNGRADED,p.ETF_WAIVE_FLAG  FROM          
BL_XYZ_CHNG_PKG p WHERE p.req_type='R'  AND p.status || ''='IN'  AND            
p.jrs_indicator IN ('1','2','3')  and p.c_transid IN (SELECT MAX(C_TRANSID) FROM
BL_XYZ_CHNG_PKG  GROUP BY rec_id)                                               
 
...
After checking this sql, I've realized this may be a classical example where using analytic function can save resource.

 Original one -- Full table Scan on the same table twice 259k gets
SQL> SELECT p.rec_id,
  2         p.CREATE_DATE,
  3         p.ordered_zzzyyy_type,
  4         p.qualified_zzzyyy_type,
  5         p.req_type ,
  6         p.c_transid,p.status,
  7         p.ERR_CODE,p.ERR_DESC,p.SVC_STATE,p.IS_DOWNGRADED,p.ETF_WAIVE_FLAG
  8    FROM
  9         xyzu.BL_XYZ_CHNG_PKG p
 10  WHERE p.req_type='R'
 11     AND p.status || ''='IN'
 12     AND p.jrs_indicator IN ('1','2','3')
 13     and p.c_transid IN
 14          (SELECT MAX(C_TRANSID) FROM BL_XYZ_CHNG_PKG  GROUP BY rec_id)
 15  ;

242 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3808252951

------------------------------------------------------------------------------------------------
| Id  | Operation            | Name            | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |                 |  1613 |   127K|       | 83112   (3)| 00:16:38 |
|*  1 |  HASH JOIN           |                 |  1613 |   127K|       | 83112   (3)| 00:16:38 |
|*  2 |   TABLE ACCESS FULL  | BL_XYZ_CHNG_PKG |  1613 |   107K|       | 25463   (3)| 00:05:06 |
|   3 |   VIEW               | VW_NSO_1        |  6443K|    79M|       | 57578   (2)| 00:11:31 |
|   4 |    HASH GROUP BY     |                 |  6443K|    86M|   148M| 57578   (2)| 00:11:31 |
|   5 |     TABLE ACCESS FULL| BL_XYZ_CHNG_PKG |  6459K|    86M|       | 25154   (2)| 00:05:02 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("P"."C_TRANSID"="$nso_col_1")
   2 - filter("P"."REQ_TYPE"='R' AND "P"."STATUS"||''='IN' AND ("P"."JRS_INDICATOR"='1'
              OR "P"."JRS_INDICATOR"='2' OR "P"."JRS_INDICATOR"='3'))


Statistics
----------------------------------------------------------
         95  recursive calls
          0  db block gets
     259618  consistent gets
     267154  physical reads
          0  redo size
      10348  bytes sent via SQL*Net to client
        261  bytes received via SQL*Net from client
          4  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        242  rows processed

Rewrite with rank() over partition by  construct - one Full table scan with 129k

SQL> SELECT rec_id,
  2         CREATE_DATE,
  3         ordered_zzzyyy_type,
  4         qualified_zzzyyy_type,
  5         req_type ,
  6         c_transid,status,
  7         ERR_CODE,ERR_DESC,SVC_STATE,IS_DOWNGRADED,ETF_WAIVE_FLAG
  8  from
  9  (
 10     SELECT p.rec_id,
 11            p.CREATE_DATE,
 12            p.ordered_zzzyyy_type,
 13            p.qualified_zzzyyy_type,
 14            p.req_type ,
 15            p.c_transid,p.status,
 16            p.ERR_CODE,p.ERR_DESC,p.SVC_STATE,p.IS_DOWNGRADED,p.ETF_WAIVE_FLAG,
 17            rank() over ( partition by rec_id order by c_transid desc ) rank
 18       FROM
 19            xyzu.BL_XYZ_CHNG_PKG p
 20     WHERE p.req_type='R'
 21        AND p.status || ''='IN'
 22        AND p.jrs_indicator IN ('1','2','3')
 23  )  A
 24  where rank=1;

242 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3870177004

--------------------------------------------------------------------------------------------
| Id  | Operation                | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |                 |  1613 |   546K| 25464   (3)| 00:05:06 |
|*  1 |  VIEW                    |                 |  1613 |   546K| 25464   (3)| 00:05:06 |
|*  2 |   WINDOW SORT PUSHED RANK|                 |  1613 |   107K| 25464   (3)| 00:05:06 |
|*  3 |    TABLE ACCESS FULL     | BL_XYZ_CHNG_PKG |  1613 |   107K| 25463   (3)| 00:05:06 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("RANK"=1)
   2 - filter(RANK() OVER ( PARTITION BY "REC_ID" ORDER BY
              INTERNAL_FUNCTION("C_TRANSID") DESC )<=1)
   3 - filter("P"."REQ_TYPE"='R' AND "P"."STATUS"||''='IN' AND
              ("P"."JRS_INDICATOR"='1' OR "P"."JRS_INDICATOR"='2' OR "P"."JRS_INDICATOR"='3'))


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
     129827  consistent gets
     127794  physical reads
          0  redo size
      10184  bytes sent via SQL*Net to client
        261  bytes received via SQL*Net from client
          4  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
        242  rows processed

Friday, March 06, 2015

Killing DBSNMP sessions not easy

We had a performane issue for a prodcution database today. The CPU utilization went up. When I logged into the server and tried to log into the database as sysdba, I got ORA-00020 error:
$sqlplus "/ as sysdba"

SQL*Plus: Release 11.2.0.2.0 Production on Fri Mar 6 10:55:10 2015

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

ERROR:
ORA-00020: maximum number of processes (700) exceeded

Another DBA reported that the session count for DBSNMP user went up high. This user is from OEM. I stopped the OEM agent and he killed all DBSNMP sessions at database level. However, almost 1 hour later, we still saw high DBSNMP session count:
Username           COUNT(*)
---------------- ----------
DBSNMP                  352
XXXX100                   1
XXXXSELECT                1
XXXXUSER                232
IVIEW2XXXX               25
ORACLE                    5
XX09195                   1
XXX6621                   6
XXX3294                   1
XXDBMON                   1
                         36
note: some usernames are marked to protect ther innocence.

Of course, we can see all DBSNMP sessions marked as "KILLED"
SQL> select username, status from v$session where username='DBSNMP';

Username         Status
---------------- --------
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
DBSNMP           KILLED
...

So It appeared to be a good idea to kill the processes at OS level. I fired the following script I usually use to get the KILL command:
set pages 0
spool /tmp/kill_os

select 'kill -9 '|| p.spid
from v$process p , v$session s
where p.background is null
and s.paddr = p.addr (+)
and p.spid is not null
and s.username='DBSNMP'
order by p.spid
/
Guess what, to my surprise, it returned nothing !. I got the count for those "LOCAL=NO" processes as below:
$ps -ef |grep "LOCAL=NO" | wc -l

659

Now the problem was how could I get the PIDs out of those 659 processes that belong to DBSNMP ? What I did finally was:

1. load the PIDs of all those "LOCAL=NO" processes into a table db_admin.pid with one number column called ID:
ps -ef | grep LOCAL=NO | awk '{print "insert into db_admin.pid values(" $2 "); }'

2. Run the following query to generate the KILL commands

select 'kill -9 ' ||  id from db_admin.pid x 
where not exists (
select 1 
from v$session s, v$process p
where s.paddr = p.addr (+)
  and p.spid = x.id
)
order by id
/

This trick worked. To prevent future troubles, we've also set up a profile limit for the number of sessions the user DBSNMP can have. We still don't know what caused the session count high though.

Just Learned: About listener.ora

Recently I have been planning to upgrade a 10.2.0.4 Oracle database to 12.1.0.2. For that, I created a test db called minidb to play with upgrade steps. At the end, I need to bring up a listener for that db. This is what I've just learned: do you know we can start a listener without listener.ora? The listener.ora is a file that stores listener configuration information. Because the configuration parameters have default values, it is possible to start and use a listener with no configuration. The below screenshot shows I don't have a listener.ora file at the normal location ($ORACLE_HOME/nework/admin), and I don't have listener up and running:
exxxxxspd02 /apps/opt/oracle/product/12.1.0/network/admin [minidb] > ls -l
total 7
drwxr-xr-x   2 oracle   dba            6 Mar  6 08:41 samples
-rw-r--r--   1 oracle   dba          373 Mar 16  2014 shrept.lst
-rw-r--r--   1 oracle   dba          127 Mar  6 08:18 tnsnames.ora
exxxxxspd02 /apps/opt/oracle/product/12.1.0/network/admin [minidb] > lsnrctl status

LSNRCTL for Solaris: Version 12.1.0.2.0 - Production on 06-MAR-2015 08:42:20

Copyright (c) 1991, 2014, Oracle.  All rights reserved.

Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
TNS-12541: TNS:no listener
 TNS-12560: TNS:protocol adapter error
  TNS-00511: No listener
   Solaris Error: 146: Connection refused

Now I started up a listener:
exxxxxspd02 /apps/opt/oracle/product/12.1.0/network/admin [minidb] > lsnrctl start

LSNRCTL for Solaris: Version 12.1.0.2.0 - Production on 06-MAR-2015 08:44:59

Copyright (c) 1991, 2014, Oracle.  All rights reserved.

Starting /apps/opt/oracle/product/12.1.0/bin/tnslsnr: please wait...

TNSLSNR for Solaris: Version 12.1.0.2.0 - Production
Log messages written to /apps/opt/oracle/diag/tnslsnr/exxxxxspd02/listener/alert/log.xml
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=exxxxxspd02)(PORT=1521)))

Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Solaris: Version 12.1.0.2.0 - Production
Start Date                06-MAR-2015 08:44:59
Uptime                    0 days 0 hr. 0 min. 0 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Log File         /apps/opt/oracle/diag/tnslsnr/exxxxxspd02/listener/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=exxxxxspd02)(PORT=1521)))
The listener supports no services
The command completed successfully
Notice by default this listener is listening on port 1521 and supports no services. I can register my minidb with this local listener through a mechanism called "Dynamic Service Registration". For that to happen, I need to configure an initialization parameter called "LOCAL_LISTENER":
exxxxxspd02 /apps/opt/oracle/product/12.1.0/network/admin [minidb] > sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Fri Mar 6 08:50:13 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> show parameter local_list

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
local_listener                       string
SQL> alter system set local_listener='(ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))' scope=both;

System altered.

SQL> ho lsnrctl status

LSNRCTL for Solaris: Version 12.1.0.2.0 - Production on 06-MAR-2015 08:51:47

Copyright (c) 1991, 2014, Oracle.  All rights reserved.

Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Solaris: Version 12.1.0.2.0 - Production
Start Date                06-MAR-2015 08:44:59
Uptime                    0 days 0 hr. 6 min. 48 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Log File         /apps/opt/oracle/diag/tnslsnr/exxxxxspd02/listener/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=exxxxxspd02)(PORT=1521)))
Services Summary...
Service "minidb" has 1 instance(s).
  Instance "minidb", status READY, has 1 handler(s) for this service...
The command completed successfully

What happened at background is that by default, the LREG background process can register service information with its local listener on the default local address of TCP/IP, port 1521.
exxxxxspd02 /apps/opt/oracle/product/12.1.0/network/admin [minidb] > ps -ef |grep lreg |grep -v grep
  oracle 28186     1   0 08:24:15 ?           0:00 ora_lreg_minidb
To ensure service registration works properly, the initialization parameter file should contain the following parameters:

  • SERVICE_NAMES for the database service name 
  •  INSTANCE_NAME for the instance name 
  •   LOCAL_LISTENER for the local listener 

Note the SERVICE_NAMES defaults to the global database name, a name comprising the DB_NAME and DB_DOMAIN parameters in the initialization parameter file. The value for the INSTANCE_NAME parameter defaults to the Oracle system identifier (SID).