Monday, June 02, 2008

Trouble shooting a CPU usage spike issue

Several months ago I have trouble-shooted an CPU usage spike issue. It turns out that the CBO for some reason choose an index with very poor selectivity to access a large table. By not using this index and using other index the sql get executed much much faster. The environment is 9.2.0.8.

The detailed trouble-shooting process is described as follows:

1. Symptom
In the past few days, mointoring script sends out aterts about CPU
usage reaching 100% during short period of time each day.

2. Statspack analysis

Indetified the problem period, generate statspack report for the period.
For example, 2/13/08 13:00 ~ 13:30, identified the top SQL by
buffer get or physical read is:


Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
20,273,427 70 289,620.4 38.0 6193.42 26897.15 3530230280
Module: JDBC Thin Client
select m.order_id, m.dsl_tn, m.queue_entered_time, m.service_re
ady_date, m.abc_user, m.hold_date, m.fallout_code, DECODE(m.isp
_lob,1,'Consumer', 2,'Business') isp_lob, m.state, s.ilec_mon, s
.pots_order_number, s.ilec_pon, l.ISP_NAME FROM schemaA.tableA
m , schemaA.tableS s , tableL



 


3. Otain the full text of the sql
Based on the hash value, obtain the full sql from v$sqltext


4. Format the sql
Using sqlReviwer tool to get formatted sql as follows:


SELECT
m.order_id,
m.dsl_tn,
m.queue_entered_time,
m.service_ready_date,
m.abc_user,
m.hold_date,
m.fallout_code,
DECODE(m.isp_lob,1,'Consumer', 2,'Business') isp_lob,
m.state,
s.ilec_mon,
s.pots_order_number,
s.ilec_pon,
l.ISP_NAME
FROM schemaA.tableA m ,
schemaA.tableS s ,
tableL l
WHERE m.map_id = :1
AND m.map_version = 1
AND m.curr_status_id = :2
AND m.abc_user = 'n/a'
AND m.hold_date IS null
AND s.order_id = m.order_id
AND s.element _id = m.element_id
AND nvl(m.ISP_ID,500) = l.ISP_ID
AND m.isp_lob != :3
AND
(
(
m.service_ready_date > sysdate
)
OR
(
m.service_ready_date <= sysdate ) )
AND ( m .ABC_LOCATION_INDICATOR IN ('E','FE')
OR m.ABC_LOCATION_INDICATOR IN ('W','FW') )
AND m.product_category IN ( 0,1,2,3,5,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,
22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,40,42,43,44 )
AND nvl(m.isp_id,500 ) = 500



 

5. SQL Trace and TKPROF

Fill in the values for bind variable :1, :2, :3
tkprof report:


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.01 0.00 0 0 0 0
Fetch 17 140.66 140.80 307342 308814 0 234
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 19 140.67 140.81 307342 308814 0 234

Misses in library cache during parse: 0
Optimizer goal: CHOOSE
Parsing user id: 5

Rows Row Source Operation
------- ---------------------------------------------------
234 TABLE ACCESS BY INDEX ROWID ABC_SERVICE_DETAILS
469 NESTED LOOPS
234 HASH JOIN
234 TABLE ACCESS BY INDEX ROWID ABC_MAIN_CURRENT
267 BITMAP CONVERSION TO ROWIDS
1 BITMAP AND
1 BITMAP CONVERSION FROM ROWIDS
267 INDEX RANGE SCAN SMC_MAP_CURRSTAT_NUK (object id 142322)
31 BITMAP CONVERSION FROM ROWIDS
140900562 INDEX RANGE SCAN SMC_ABC_USER_NUK (object id 115734)
386 TABLE ACCESS FULL ABC_ISP_LOOKUP
234 INDEX RANGE SCAN SSD_ORDER_ID_ELEMENT_ID_NUK (object id 115773)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 17 0.00 0.00
db file sequential read 307340 0.26 26.40
latch free 1 0.01 0.01
db file scattered read 1 0.00 0.00
SQL*Net message from client 17 5.83 8.84
********************************************************************************




Notes:
- condition: AND m.abc_user = 'n/a', caused the problem
- Oracle used the index range scan
140900562 INDEX RANGE SCAN SMC_ABC_USER_NUK (object id 115734)

The selectivity of the index scan is extremly bad.



6. Rewrite the SQL using materialize hint as follows:


WITH
temp_tab AS
(
SELECT /*+ materialize */
m.order_id,
m.dsl_tn,
m.queue_entered_time,
m.service_ready_date,
m.abc_user,
m.hold_date,
m.fallout_code,
DECODE(m.isp_lob,1,'Consumer', 2,'Business') isp_lob,
m.state,
s.ilec_mon,
s.pots_order_number,
s.ilec_pon,
l.ISP_NAME
FROM schemaA.tableA m,
schemaA.tableS s,
schemaA.tableL l
WHERE m.map_id = 8 AND m.map_version = 1 AND m.curr_status_id = 3
-- AND m.abc_user = 'n/a'
AND m.hold_date IS null
AND s.order_id = m.order_id
AND s.element_id = m.element_id
AND nvl(m.ISP_ID,500) = l.ISP_ID

AND m.isp_lob != 2
AND
(
(
m.service_ready_date > sysdate
)
OR
(
m.service_ready_date <= sysdate ) )
AND ( m.ABC_LOCATION_INDICATOR IN ('E','FE')
OR m.ABC_LOCATION_INDICATOR IN ('W','FW') )
AND m.product_category IN ( 0,1,2,3,5,6,7,8,9,10,11,13,14,15,
16,17,18,19,20,21 ,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,40,42,43,44 )
AND nvl(m.isp_id,500 ) = 500 )
select * from temp_tab
where abc_user='n/a'
/

tkprof:
=======
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.06 0.08 14 33 2 0
Execute 1 0.09 0.18 35 1268 2 0
Fetch 17 0.01 0.00 3 23 1 237
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 19 0.16 0.27 52 1324 5 237


Rows Row Source Operation
------- ---------------------------------------------------
237
237 VIEW
237 TABLE ACCESS FULL SYS_TEMP_4254950916


NOtes: in the tkprof report, it shows a global temp table is created and rows are
inserted into this table:

INSERT /*+ APPEND BYPASS_RECURSIVE_CHECK */ INTO
"SYS"."SYS_TEMP_0FD9D6604_6B0E3DDA" SELECT /*+ */ "M"."ORDER_ID",
"M"."DSL_TN","M"."QUEUE_ENTERED_TIME","M"."SERVICE_READY_DATE",
"M"."ABC_USER","M"."HOLD_DATE","M"."FALLOUT_CODE",DECODE("M"."ISP_LOB",
:"SYS_B_00",:"SYS_B_01",:"SYS_B_02",:"SYS_B_03"),"M"."STATE","S"."ILEC_MON",
"S"."POTS_ORDER_NUMBER","S"."ILEC_PON","L"."ISP_NAME" FROM
"SCHEMAA"."TABLEA" "M","SCHEMAA"."TABLES" "S",
"SCHEMAA"."TABLEL" "L" WHERE "M"."MAP_ID"=:"SYS_B_04" AND
"M"."MAP_VERSION"=:"SYS_B_05" AND "M"."CURR_STATUS_ID"=:"SYS_B_06" AND
"M"."HOLD_DATE" IS NULL AND "S"."ORDER_ID"="M"."ORDER_ID" AND
"S"."ELEMENT_ID"="M"."ELEMENT_ID" AND "L"."ISP_ID"=NVL("M"."ISP_ID",
:"SYS_B_07") AND "M"."ISP_LOB"<>:"SYS_B_08" AND
("M"."SERVICE_READY_DATE">SYSDATE@! OR "M"."SERVICE_READY_DATE"<=SYSDATE@!)
AND ("M"."ABC_LOCATION_INDICATOR"=:"SYS_B_09" OR
"M"."ABC_LOCATION_INDICATOR"=:"SYS_B_10" OR "M"."SSP_LOCATION_INDICATOR"=
:"SYS_B_11" OR "M"."ABC_LOCATION_INDICATOR"=:"SYS_B_12") AND
("M"."PRODUCT_CATEGORY"=:"SYS_B_13" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_14"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_15" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_16"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_17" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_18"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_19" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_20"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_21" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_22"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_23" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_24"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_25" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_26"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_27" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_28"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_29" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_30"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_31" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_32"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_33" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_34"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_35" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_36"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_37" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_38"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_39" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_40"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_41" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_42"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_43" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_44"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_45" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_46"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_47" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_48"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_49" OR "M"."PRODUCT_CATEGORY"=:"SYS_B_50"
OR "M"."PRODUCT_CATEGORY"=:"SYS_B_51") AND NVL("M"."ISP_ID",:"SYS_B_52")=
:"SYS_B_53"


call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.01 0.00 0 0 0 0
Execute 1 0.08 0.17 35 1268 3 237
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2 0.09 0.17 35 1268 3 237

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 5 (recursive depth: 1)

Rows Row Source Operation
------- ---------------------------------------------------
1 LOAD AS SELECT
237 TABLE ACCESS BY INDEX ROWID TABLES
475 NESTED LOOPS
237 HASH JOIN
386 TABLE ACCESS FULL TABLEL
237 TABLE ACCESS BY INDEX ROWID TABLEA
270 INDEX RANGE SCAN SMC_MAP_CURRSTAT_NUK (object id 142322)
237 INDEX RANGE SCAN SSD_ORDER_ID_ELEMENT_ID_NUK (object id 115773)




 


7. App team comes up with another idea:
Instead of using:
AND m.abc_user = 'n/a'
they change it to:
AND sub_string(m.abc_user, 1,1) not in ('V', 'Z')


tkprof:

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.01 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 22 0.24 0.31 757 1665 0 302
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 24 0.25 0.32 757 1665 0 302

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 99

Rows Row Source Operation
------- ---------------------------------------------------
302 TABLE ACCESS BY INDEX ROWID TABLES
605 NESTED LOOPS
302 HASH JOIN
302 TABLE ACCESS BY INDEX ROWID TABLEA
351 INDEX RANGE SCAN SMC_MAP_CURRSTAT_NUK (object id 142322)
386 TABLE ACCESS FULL TABLEL
302 INDEX RANGE SCAN SSD_ORDER_ID_ELEMENT_ID_NUK (object id 115773)


No comments: