Wednesday, September 02, 2009

Hit Bug 5888835 on a new Sun T5240 Server and somthing about dreadful latch

I copied a production database (9.2.0.8) to a new server at another data center. The database looked OK after open resetlogs; however, whenever I flushed the shared pool or shutdown the database, I would encounter the ORA-600 [504] error, i.e.

ORA-00600: internal error code, arguments: [504], [0x38006AC18], [160], [7], [shared pool], [2], [0], [0x38006AB28]


Per Oracle support, I hit a bug (Bug 5888835/5508574 ML428226.1 ), which applies to 9.2.0.8 to 10.2.0.3.

Based on ML 428226.1, there could be two causes of this bug:

(1) The latch directory size exceeds 255 when _kgl_latch_count > 31.
(2) Even when the _kgl_latch_count is equal to 0, if the cpu_count is >=32 the bug still applies.

This is due to the default value of _kgl_latch_count is calculated as next prime number after the value returned by CPU_COUNT. So, this bug could still apply if the cpu_count=32 as the _kgl_latch_count would be calculated to the next prime number that would be 37.


So this is about the dreadful latch, which remind me of an unpleasant interview experience, in which I was blind when was asked about this topic.

I did some searching and reading to understand further, below is a Q&A section about some basics:

Q: What is _kgl_latch_count?
A:
On this website ( http://www.ixora.com.au/q+a/library.htm), I found:

It sets the number of child library cache latches. The default is the least prime number greater than or equal to cpu_count. The maximum is 67. It can safely be increased to combat library cache latch contention, as long as you stick to prime numbers. However it is only effective if the activity across the existing child library cache latches is evenly distributed as shown in V$LATCH_CHILDREN


Q: What is parent latch and child latch?
A: A latch can be defined as a latch set. A latch set has a parent latch and several child latches. Parent latches are allocated statistically. Child latches are allocated dynamically. Both parent and child latches share the same latch name. The latches in a set are "chained" together to facilitate statistics gathering. Apparently, library cache latch is defined as a latch set.

Q: What is library cache ?
A:
The primary function of the library cache is to store shared cursors together with their parse trees and execution plans among other objects. The literary cache is structured as a hash table that is accessible through an array of hash buckets. Each hash bucket gives access to a linked list of library object handles. Library cache objects are made of several independent memory heaps and are accessed through the object handles.

Q: What is library cache latch?
A: When a library object is accessed, first, the lock is required on the handle, then the necessary object heaps are pinned. A library cache latch is acquired before the lock request and is released after it. Multiple library cache latch exist, and each one protects a range of hash buckets and the latch covers all associated heaps.

Some 'good' news about latch are also reached to me as the result of this searching and reading, the latch in Oracle is or will be gone! See:

Library cache latches gone in Oracle 11g

Library cache latch has gone?


OK, enough for latch for now. Hopefully, I could say something about latch if I would be interviewed again about this topic. Let's return to the bug, it can be fixed by:

1. set CPU_COUNT <=60.
I tested in this case, no ORA-600 error when shutdown or flush shared pool. Obviously we have 128 CPUs or Oracle thinks so, this setting will not take advantage of it.

2. Upgrade to at least 10.2.0.4

3. Apply the patch.
This was what we finally did. I tested that the patch worked as advertised.


As side notes, the following query can be used to check the value of _kgl_latch_count:

select a.ksppinm aa, b.ksppstvl bb
from x$ksppi a, x$ksppsv b
where a.indx=b.indx
and a.ksppinm like '%latch%count%';


If asked why we have 128 CPUs, I will refer to http://www.sun.com/servers/coolthreads/t5240/. It says this server can up to 128 compute threads. Oracle sees it as 128 CPUs.

No comments: