Wednesday, May 26, 2010

Tuning by cardinality feedback and understanding of join cardinality and transitive closure

What comes out of my recent tuning experience of a 13-table join query is my deeper understanding of "Tuning by Cardinality Feedback", join cardinality and transitive closure. The problem for that 13-table join SQL is that Oracle predictes the join cardinality to be 1 always so it chooses nested loop join, whereas the hash join is able to bring the consistent gets from 400K to 6K.

First of all, let's see where the cardinality estimation is wrong. Below are the estimated cardinality from each operations in the first 4-table join: PACKAGE_MARKETS -> MARKETS -> PACKAGE_CATEGORY -> PACKAGE_PRODUCTS

10    9       NESTED LOOPS (Cost=10 Card=1 Bytes=522)
  11   10         NESTED LOOPS (Cost=9 Card=1 Bytes=325)
  12   11           NESTED LOOPS (Cost=4 Card=1 Bytes=229)
  13   12             NESTED LOOPS (Cost=3 Card=1 Bytes=224)
  14   13               NESTED LOOPS (Cost=2 Card=1 Bytes=104)
  15   14                 INDEX (UNIQUE SCAN) OF 'PK_PACKAGE_MARKETS' (UNIQUE) (Cost=1 Card=1 Bytes=12)
  16   14                 TABLE ACCESS (BY INDEX ROWID) OF 'MARKETS' (Cost=1 Card=1 Bytes=92)
  17   16                   INDEX (UNIQUE SCAN) OF 'PK_MARKETS' (UNIQUE)
  18   13               TABLE ACCESS (BY INDEX ROWID) OF 'PACKAGES' (Cost=1 Card=1 Bytes=120)
  19   18                 INDEX (UNIQUE SCAN) OF 'PK_PACKAGES' (UNIQUE)
  20   12             TABLE ACCESS (BY INDEX ROWID) OF 'PACKAGE_CATEGORY' (Cost=1 Card=1 Bytes=5)
  21   20               INDEX (UNIQUE SCAN) OF 'PK_PACKAGE_CATEGORY' (UNIQUE)
  22   11           TABLE ACCESS (BY INDEX ROWID) OF 'PACKAGE_PRODUCTS' (Cost=5 Card=1 Bytes=96)
  23   22             INDEX (RANGE SCAN) OF 'PK_PACKAGE_PRODUCTS' (UNIQUE) (Cost=2 Card=6)
  24   10         TABLE ACCESS (BY INDEX ROWID) OF 'PRODUCTS' (Cost=1 Card=1 Bytes=197


Below are the real cardinalities obtained from SQL_TRACE/TKPROF. we can see the real cardinality coming out of PACKAGE_PRODUCTS table is 2725 and the estimated one is 1 as shown above. This is where the whole execution plan becomes "wrong".

2725         NESTED LOOPS
      1          NESTED LOOPS
      1           NESTED LOOPS
      1            NESTED LOOPS
      1             INDEX UNIQUE SCAN PK_PACKAGE_MARKETS (object id 423961)
      1             TABLE ACCESS BY INDEX ROWID MARKETS
      1              INDEX UNIQUE SCAN PK_MARKETS (object id 423924)
      1            TABLE ACCESS BY INDEX ROWID PACKAGES
      1             INDEX UNIQUE SCAN PK_PACKAGES (object id 423954)
      1           TABLE ACCESS BY INDEX ROWID PACKAGE_CATEGORY
      1            INDEX UNIQUE SCAN PK_PACKAGE_CATEGORY (object id 423956)
   2725          TABLE ACCESS BY INDEX ROWID PACKAGE_PRODUCTS
   2811           INDEX RANGE SCAN PK_PACKAGE_PRODUCTS (object id 423963)

What I did is to add hints to force Oracle choose hash join instead of nested loop join for the remaining 9 tables.

This execersie also arouse my insterest to understand how join cardinality is calculated, thus I re-read Chapter 10 of Jonanth Lewis's Cost-Based Oracle Fundamentals. I found what he says is true ( as always :-)). Below is a test case output for a two table join SQL that involves "transitive closure" - there is a filter on the join column:
SQL>select * from v$version;

BANNER
----------------------------------------------------------------
Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production
PL/SQL Release 9.2.0.8.0 - Production
CORE 9.2.0.8.0 Production
TNS for Solaris: Version 9.2.0.8.0 - Production
NLSRTL Version 9.2.0.8.0 - Production

SQL>
SQL>-- from the count we know the acutual join cardinality should be 86
SQL>select count(*) from packages where package_id='Y9995';

  COUNT(*)
----------
         1

SQL>select count(*) from package_products where package_id='Y9995';

  COUNT(*)
----------
        86

SQL>
SQL>-- JL mentioned "query_reworte_enabled" has impact on join predicate removal
SQL>-- prior to 10g in hist book "Cost-Based Oracle Fundamentals"
SQL>-- I confirmed it is true in our 9i db. I also tested with rewrite/nowrite hints
SQL>-- instead of session level "query_rewrite_enabled parameter". No impact
SQL>-- from the hints.
SQL>
SQL>
SQL>show parameter query_rewrite_enabled

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
query_rewrite_enabled                string      TRUE
SQL>alter session set query_rewrite_enabled=true;

Session altered.

SQL>
SQL>
SQL>-- query_rewrite_enabled=true
SQL>explain plan for
  2  select  *
  3    from  packages a,   -- NROWS 2656
  4        package_products b   -- NROWS 284247
  5   where a.package_id   -- NDV  2656 PK
  6       = b.package_id   -- NDV  2610 NOT NULL
  7   and a.package_id = 'Y9995'
  8  ;

Explained.

SQL>select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------
| Id  | Operation                    |  Name                | Rows  | Bytes | Cost  |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                      |     1 |   240 |     5 |
|   1 |  NESTED LOOPS                |                      |     1 |   240 |     5 |
|   2 |   TABLE ACCESS BY INDEX ROWID| PACKAGES             |     1 |   144 |     2 |
|*  3 |    INDEX UNIQUE SCAN         | PK_PACKAGES          |     1 |       |     1 |
|   4 |   TABLE ACCESS BY INDEX ROWID| PACKAGE_PRODUCTS     |     1 |    96 |     3 |
|*  5 |    INDEX RANGE SCAN          | PK_PACKAGE_PRODUCTS  |     1 |       |     2 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("A"."PACKAGE_ID"='Y9995')
   5 - access("B"."PACKAGE_ID"='Y9995')
       filter("A"."PACKAGE_ID"="B"."PACKAGE_ID")

Note: cpu costing is off

20 rows selected.

SQL>
SQL>alter session set query_rewrite_enabled=false;

Session altered.

SQL>
SQL>-- query_rewrite_enabled=false
SQL>explain plan for
  2  select  *
  3    from  packages a,   -- NROWS 2656
  4        package_products b   -- NROWS 284247
  5   where a.package_id   -- NDV  2656 PK
  6       = b.package_id   -- NDV  2610 NOT NULL
  7   and a.package_id = 'Y9995'
  8  ;

Explained.

SQL>
SQL>select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------
| Id  | Operation                    |  Name                | Rows  | Bytes | Cost  |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                      |   109 | 26160 |    52 |
|   1 |  NESTED LOOPS                |                      |   109 | 26160 |    52 |
|   2 |   TABLE ACCESS BY INDEX ROWID| PACKAGES             |     1 |   144 |     2 |
|*  3 |    INDEX UNIQUE SCAN         | PK_PACKAGES          |     1 |       |     1 |
|   4 |   TABLE ACCESS BY INDEX ROWID| PACKAGE_PRODUCTS     |   109 | 10464 |    50 |
|*  5 |    INDEX RANGE SCAN          | PK_PACKAGE_PRODUCTS  |   109 |       |     2 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("A"."PACKAGE_ID"='Y9995')
   5 - access("B"."PACKAGE_ID"='Y9995')

Note: cpu costing is off

19 rows selected.


The test case showed two scenarios with same SQL and apparently obtained same execution plan, but
1. with query_rewrite_enabled=true, the estimated cardinality is 1
2. with query_rewrite_enabled=false, the estimated cardinality is 109
3. There are differences in the "Predicate Information" section of the execution plan. With query_rewrite_enabled=false, there is no join predicate

So how those cardinilities are calculated by Oracle? If we have two table join in the following form:

select *
  from t1, t2
 where t1.c1 = t2.c2
   and filter_predicates_of_t1
   and filter_predicates_of_t2;

The basic join cardinality formula is as follows (from "Cost-Based Oracle Fundamentals" p266):

Join Selectivity =
  ((num_rows(t1) - num_nulls(t1.c1)) / num_rows(t1)) *
  ((num_rows(t2) - num_nulls(t2.c2)) / num_rows(t2)) /
  greater(num_distinct(t1.c1), num_distinct(t2.c2))

Join Cardinality =
   join selectivity *
   filtered cardinality(t1) * filtered cardinality(t2)


In our scenario 1:

Join Selectivity = 
  (( 2656 -0) / 2656 ) *
  (( 284247 - 0) / 284247) /
  greater(2656, 2619)
  = 1/2656

Join Cardinaltiy =
   (1/2656) *
   (2656/2656) * ( 284247 /2610) 
   = 0.004 ( rounded to 1)

In our senario 2, notice we don't have join predicate actually, Oracle eliminated the Join Selectivity part:

Join Cardinlity =
   (2656/2656) * ( 284247 /2610)
   = 109


My production 9i database has set query_rewrite_enabled=true. So if I wrote the SQL without the join condition as follows::

select  *
 from  packages a,   -- NROWS 2656
       package_products b   -- NROWS 284247
where
      b.package_id = 'Y9995'
  and a.package_id = 'Y9995'
;


I can get the correct cardinality as in the scenaro 2.

BTW, for example, if a=b, and a=5; then b=5, this is called "transitive closure".

Friday, May 07, 2010

Play with my toy 10g RAC VII - Voting Disk

The voting disk is a shared raw disk partition or file on a clustered file system that is accessible to all nodes in the cluster. Its primary purpose is to help in situations where the private network communication fails. When that happens, the cluster is unable to have all nodes remain available because they are no longer able to synchronize I/O to the shared disks. Therefore, some of the nodes must go offline. The voting disk is then used to communicate the node state information used to determine which nodes go offline.


If we have 3 nodes,  presumably to the voting disk each node will write message as follows:


Node 1 writes : I can see Node 2 & 3
Node 2 writes : I can see Node 1 & 3
Node 3 writes : I can see Node 1 & 2


If for example Node 3's private network has problem, the message may become:


Node 1 writes : I can see Node 2 only
Node 2 writes : I can see Node 1 only
Node 3 writes : I can not see either Node 1 or Node 2 ( or it does not write anything)


In this situation, clearly Node 3 should be evicted from the cluster.




To avoid a single point of failure, we can multiplex voting disk. By design, if strictly more than half of the voting disks are up and contain consistent information, the cluster will be fine. That is to say if we have 5 voting disks, we can have at most 2 voting disk failures.
So the number_of_voting_disk = number_of_tolerable_disk_failure * 2 + 1.


This post is to document my test with the following task regarding voting disk administration:


Task - Recover from the lost of voting disks




1. Check the current voting disk configuration


[oracle@rac1 backup]$ crsctl query css votedisk
 0.     0    /dev/raw/raw6
 1.     0    /dev/raw/raw7
 2.     0    /dev/raw/raw8

located 3 votedisk(s).




2. Backup voting disk




[oracle@rac1 backup]$ dd if=/dev/raw/raw6 of=/home/oracle/backup/votingdisk_050710
80325+0 records in
80325+0 records out
[oracle@rac1 backup]$
[oracle@rac1 backup]$ ls -lhtr
total 40M
-rw-r--r--  1 oracle oinstall 40M May  7 16:23 votingdisk_050710



3. Wipe out the first voting disk


dd if=/dev/zero of=/dev/raw/raw6


Note: I have three voting disk files, in my understanding, the cluster should survive with 1 voting disk failure, however,rac1 and rac2 reboot right after I issue this command. I don't know why.


--------------- RAC 1 alert log  --------------------
 
Fri May  7 16:25:21 2010
Trace dumping is performing id=[cdmp_20100507162519]
Fri May  7 16:25:23 2010
Error: KGXGN aborts the instance (6)
Fri May  7 16:25:24 2010
Errors in file /u01/app/oracle/admin/devdb/bdump/devdb1_lmon_10476.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702


--------------- RAC 2 alert log --------------------
ri May  7 16:25:19 2010
Error: KGXGN aborts the instance (6)
Fri May  7 16:25:19 2010
Error: unexpected error (6) from the Cluster Service (LCK0)
Fri May  7 16:25:19 2010
Errors in file /u01/app/oracle/admin/devdb/bdump/devdb2_lmon_3150.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri May  7 16:25:19 2010
Errors in file /u01/app/oracle/admin/devdb/bdump/devdb2_lck0_3236.trc:
ORA-29702: error occurred in Cluster Group Service operation
Fri May  7 16:25:19 2010
LMON: terminating instance due to error 29702
Fri May  7 16:25:21 2010
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/admin/devdb/bdump/devdb2_diag_3146.trc
Fri May  7 16:31:01 2010





4. Restart CRS stack


[oracle@rac1 ~]$ sudo $ORA_CRS_HOME/bin/crsctl stop crs
Password:
Stopping resources.

Successfully stopped CRS resources
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.
[oracle@rac1 ~]$
[oracle@rac1 ~]$ ssh rac2 sudo $ORA_CRS_HOME/bin/crsctl stop crs
Password:vz123ys

Stopping resources.
Successfully stopped CRS resources
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.
[oracle@rac1 ~]$
[oracle@rac1 ~]$
[oracle@rac1 ~]$
[oracle@rac1 ~]$ ps -ef | grep d.bin
oracle   14672 30539  0 16:56 pts/1    00:00:00 grep d.bin
[oracle@rac1 ~]$ ssh rac2 ps -ef | grep d.bin
[oracle@rac1 ~]$ ./crsstat.sh
HA Resource                                   Target     State
-----------                                   ------     -----
error connecting to CRSD at [(ADDRESS=(PROTOCOL=ipc)(KEY=ora_crsqs))] clsccon 184

[oracle@rac1 ~]$ sudo $ORA_CRS_HOME/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
[oracle@rac1 ~]$ ssh rac2 sudo $ORA_CRS_HOME/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly


[oracle@rac1 ~]$ ps -ef | grep d.bin
root     14242     1  0 16:54 ?        00:00:00 /u01/app/oracle/product/10.2.0/crs_1/bin/crsd.bin reboot
oracle   15219 14240  2 16:58 ?        00:00:00 /u01/app/oracle/product/10.2.0/crs_1/bin/evmd.bin
oracle   15383 15357  2 16:58 ?        00:00:00 /u01/app/oracle/product/10.2.0/crs_1/bin/ocssd.bin
oracle   15602 30539  0 16:58 pts/1    00:00:00 grep d.bin
[oracle@rac1 ~]$ ssh rac2 ps -ef | grep d.bin
root     23610     1  0 16:56 ?        00:00:00 /u01/app/oracle/product/10.2.0/crs_1/bin/crsd.bin reboot
oracle   24394 23609  2 16:58 ?        00:00:00 /u01/app/oracle/product/10.2.0/crs_1/bin/evmd.bin
oracle   24575 24549  2 16:58 ?        00:00:00 /u01/app/oracle/product/10.2.0/crs_1/bin/ocssd.bin
[oracle@rac1 ~]$ ./crsstat.sh
HA Resource                                   Target     State
-----------                                   ------     -----
ora.devdb.SLBA.cs                             OFFLINE    OFFLINE
ora.devdb.SLBA.devdb1.srv                     OFFLINE    OFFLINE
ora.devdb.SLBA.devdb2.srv                     OFFLINE    OFFLINE
ora.devdb.SNOLBA.cs                           OFFLINE    OFFLINE
ora.devdb.SNOLBA.devdb1.srv                   OFFLINE    OFFLINE
ora.devdb.SNOLBA.devdb2.srv                   OFFLINE    OFFLINE
ora.devdb.db                                  ONLINE     ONLINE on rac2
ora.devdb.devdb1.inst                         ONLINE     ONLINE on rac1
ora.devdb.devdb2.inst                         ONLINE     ONLINE on rac2
ora.rac1.ASM1.asm                             ONLINE     ONLINE on rac1
ora.rac1.LISTENER_RAC1.lsnr                   ONLINE     ONLINE on rac1
ora.rac1.gsd                                  ONLINE     ONLINE on rac1
ora.rac1.ons                                  ONLINE     ONLINE on rac1
ora.rac1.vip                                  ONLINE     ONLINE on rac1
ora.rac2.ASM2.asm                             ONLINE     ONLINE on rac2
ora.rac2.LISTENER_RAC2.lsnr                   ONLINE     ONLINE on rac2
ora.rac2.gsd                                  ONLINE     ONLINE on rac2
ora.rac2.ons                                  ONLINE     ONLINE on rac2
ora.rac2.vip                                  ONLINE     ONLINE on rac2






5. Check log


[oracle@rac1 rac1]$ tail alertrac1.log
2010-05-07 16:58:50.358
[cssd(15383)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 rac2 .
2010-05-07 16:58:54.722
[crsd(14242)]CRS-1201:CRSD started on node rac1.
2010-05-07 16:59:45.626
[cssd(15383)]CRS-1604:CSSD voting file is offline: /dev/raw/raw6. Details in /u01/app/oracle/product/10.2.0/crs_1 /log/rac1/cssd/ocssd.log.
2010-05-07 17:00:47.657
[cssd(15383)]CRS-1604:CSSD voting file is offline: /dev/raw/raw6. Details in /u01/app/oracle/product/10.2.0/crs_1 /log/rac1/cssd/ocssd.log.
2010-05-07 17:01:49.730
[cssd(15383)]CRS-1604:CSSD voting file is offline: /dev/raw/raw6. Details in /u01/app/oracle/product/10.2.0/crs_1 /log/rac1/cssd/ocssd.log.
[oracle@rac1 rac1]$ tail /u01/app/oracle/product/10.2.0/crs_1/log/rac1/cssd/ocssd.log
[    CSSD]2010-05-07 17:00:16.337 [132250528] >TRACE:   clssgmClientConnectMsg: Connect from con(0x8358600) proc( 0x8389ac8) pid() proto(10:2:1:1)
[    CSSD]2010-05-07 17:00:22.183 [132250528] >TRACE:   clssgmClientConnectMsg: Connect from con(0x8358600) proc( 0x838eb90) pid() proto(10:2:1:1)
[    CSSD]2010-05-07 17:00:30.776 [132250528] >TRACE:   clssgmClientConnectMsg: Connect from con(0x8358600) proc( 0x838ee90) pid() proto(10:2:1:1)
[    CSSD]2010-05-07 17:00:47.657 [62401440] >TRACE:   clssnmDiskStateChange: state from 3 to 3 disk (0//dev/raw/ raw6)
[    CSSD]2010-05-07 17:01:07.263 [132250528] >TRACE:   clssgmClientConnectMsg: Connect from con(0x8358600) proc( 0x837e6b8) pid() proto(10:2:1:1)
[    CSSD]2010-05-07 17:01:09.009 [132250528] >TRACE:   clssgmClientConnectMsg: Connect from con(0x8357dd8) proc( 0x8379340) pid() proto(10:2:1:1)
[    CSSD]2010-05-07 17:01:49.730 [62401440] >TRACE:   clssnmDiskStateChange: state from 3 to 3 disk (0//dev/raw/ raw6)
[    CSSD]2010-05-07 17:02:09.984 [132250528] >TRACE:   clssgmClientConnectMsg: Connect from con(0x835a580) proc( 0x8365a50) pid() proto(10:2:1:1)
[    CSSD]2010-05-07 17:02:51.784 [62401440] >TRACE:   clssnmDiskStateChange: state from 3 to 3 disk (0//dev/raw/ raw6)
[    CSSD]2010-05-07 17:03:12.292 [132250528] >TRACE:   clssgmClientConnectMsg: Connect from con(0x835a580) proc( 0x8365a50) pid() proto(10:2:1:1)



Note: similar message from rac2 alertrac2.log and ocssd.log. It can be seen with three voting disk files, if one of them unavailable, the RAC is still functioning.




6. Wipe out the second voting disk


dd if=/dev/zero of=/dev/raw/raw7


Two nodes reboot right after issuing above command. After reboot, only see evmd running:


[oracle@rac1 ~]$ ps -ef | grep d.bin
oracle    8139  6985  3 17:12 ?        00:00:14 /u01/app/oracle/product/10.2.0/crs_1/bin/evmd.binoracle   11075 10255  0 17:20 pts/1    00:00:00 grep d.bin




7. Check log:


--- alertrac1.log   show two voting disk files are offline

2010-05-07 17:12:14.926
[cssd(8303)]CRS-1604:CSSD voting file is offline: /dev/raw/raw6. Details in /u01/app/oracle/product/10.2.0/crs_1/log/rac1/cssd/ocssd.log.
2010-05-07 17:12:15.099
[cssd(8303)]CRS-1604:CSSD voting file is offline: /dev/raw/raw7. Details in /u01/app/oracle/product/10.2.0/crs_1/log/rac1/cssd/ocssd.log.
2010-05-07 17:12:15.147
[cssd(8303)]CRS-1605:CSSD voting file is online: /dev/raw/raw8. Details in /u01/app/oracle/product/10.2.0/crs_1/log/rac1/cssd/ocssd.log.
[oracle@rac1 rac1]$

[oracle@rac1 crsd]$ tail crsd.log

2010-05-07 17:14:04.532: [ COMMCRS][36494240]clsc_connect: (0x8655528) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_rac1_crs))

2010-05-07 17:14:04.532: [ CSSCLNT][3086931648]clsssInitNative: connect failed, rc 9

2010-05-07 17:14:04.533: [  CRSRTI][3086931648]0CSS is not ready. Received status 3 from CSS. Waiting for good status ..

2010-05-07 17:14:05.536: [ CRSMAIN][3086931648][PANIC]0CRSD exiting: Could not init the CSS context

2010-05-07 17:14:05.540: [ default][3086931648]Terminating clsd session







8. Restore Voting Disk


[oracle@rac1 ~]$ dd if=/home/oracle/backup/votingdisk_050710 of=/dev/raw/raw6
80325+0 records in
80325+0 records out
[oracle@rac1 ~]$ dd if=/home/oracle/backup/votingdisk_050710 of=/dev/raw/raw7
dd: writing to `/dev/raw/raw7': No space left on device
80263+0 records in
80262+0 records out





9. Restart CRS


[oracle@rac1 ~]$ sudo $ORA_CRS_HOME/bin/crsctl start crs
Password:
Attempting to start CRS stack
The CRS stack will be started shortly
[oracle@rac1 ~]$ ssh rac2 sudo $ORA_CRS_HOME/bin/crsctl start crs
Password:vz123ys

Attempting to start CRS stack
The CRS stack will be started shortly


---- in alertrac1.log --------------------------

[oracle@rac1 rac1]$ pwd
/u01/app/oracle/product/10.2.0/crs_1/log/rac1
[oracle@rac1 rac1]$ tail -15 alertrac1.log
[cssd(8303)]CRS-1605:CSSD voting file is online: /dev/raw/raw8. Details in /u01/app/oracle/product/10.2.0/crs_1/log/rac1/cssd/ocssd.log.
2010-05-07 17:29:31.679
[cssd(13301)]CRS-1605:CSSD voting file is online: /dev/raw/raw6. Details in /u01/app/oracle/product/10.2.0/crs_1/log/rac1/cssd/ocssd.log.
2010-05-07 17:29:31.714
[cssd(13301)]CRS-1605:CSSD voting file is online: /dev/raw/raw7. Details in /u01/app/oracle/product/10.2.0/crs_1/log/rac1/cssd/ocssd.log.
2010-05-07 17:29:31.729
[cssd(13301)]CRS-1605:CSSD voting file is online: /dev/raw/raw8. Details in /u01/app/oracle/product/10.2.0/crs_1/log/rac1/cssd/ocssd.log.
2010-05-07 17:29:35.433
[cssd(13301)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1 rac2 .
2010-05-07 17:29:38.247
[crsd(8910)]CRS-1012:The OCR service started on node rac1.
2010-05-07 17:29:38.287
[evmd(13364)]CRS-1401:EVMD started on node rac1.
2010-05-07 17:31:02.432
[crsd(8910)]CRS-1201:CRSD started on node rac1.

---- CTS resource are on-line

[oracle@rac1 ~]$ ~/crsstat.sh
HA Resource                                   Target     State
-----------                                   ------     -----
ora.devdb.SLBA.cs                             OFFLINE    OFFLINE
ora.devdb.SLBA.devdb1.srv                     OFFLINE    OFFLINE
ora.devdb.SLBA.devdb2.srv                     OFFLINE    OFFLINE
ora.devdb.SNOLBA.cs                           OFFLINE    OFFLINE
ora.devdb.SNOLBA.devdb1.srv                   OFFLINE    OFFLINE
ora.devdb.SNOLBA.devdb2.srv                   OFFLINE    OFFLINE
ora.devdb.db                                  ONLINE     ONLINE on rac2
ora.devdb.devdb1.inst                         ONLINE     ONLINE on rac1
ora.devdb.devdb2.inst                         ONLINE     ONLINE on rac2
ora.rac1.ASM1.asm                             ONLINE     ONLINE on rac1
ora.rac1.LISTENER_RAC1.lsnr                   ONLINE     ONLINE on rac1
ora.rac1.gsd                                  ONLINE     ONLINE on rac1
ora.rac1.ons                                  ONLINE     ONLINE on rac1
ora.rac1.vip                                  ONLINE     ONLINE on rac1
ora.rac2.ASM2.asm                             ONLINE     ONLINE on rac2
ora.rac2.LISTENER_RAC2.lsnr                   ONLINE     ONLINE on rac2
ora.rac2.gsd                                  ONLINE     ONLINE on rac2
ora.rac2.ons                                  ONLINE     ONLINE on rac2
ora.rac2.vip                                  ONLINE     ONLINE on rac2
[oracle@rac1 ~]$ date
Fri May  7 17:37:35 EDT 2010
 




In conclusion, with 3 voting disks, my test did show RAC can be operational if one of them is offline; if two of them are not available, then the CRS daemon can not start at all. However, when zero-out one of the voting disks, the server reboots, this is not desirable, not sure if this is due to my particular environment.

Tuesday, May 04, 2010

Play with my toy 10g RAC VI - Failed to Start CRS Processes due to OCR Out-of-Sync

Today is the first time I am trying to play with my toy RAC again since I have moved OCR and voting disk files to the raw devices last time as described here. To my surprise, I can not start CRS processes (i.e. those seen from ps -ef | grep d.bin) ! First of all, they should be started automatically after server reboots, but they did not. Secondly they did not start after I issued "crsctl start crs" or "/etc/init.d/init.crs start" either. It took me quite a while to find the right log that indicates the problem:

[oracle@rac1 client]$ pwd
/u01/app/oracle/product/10.2.0/crs_1/log/rac1/client
[oracle@rac1 client]$
[oracle@rac1 client]$ tail -10 css.log
2010-04-05 10:15:01.231: [ CSSCLNT][3086931648]clsssInitNative: connect failed, rc 9

2010-04-05 10:15:02.249: [ CSSCLNT][3086931648]clsssInitNative: connect failed, rc 9

2010-04-21 13:44:32.807: [ default][3068479168]prlsndmain: olsnodes successful!!
2010-04-28 11:13:01.999: [  OCRRAW][3086931648]propriogid:1: INVALID FORMAT
2010-04-28 12:02:36.960: [  OCRRAW][3086931648]propriogid:1: INVALID FORMAT
2010-05-04 10:58:31.850: [  OCRRAW][3068479168]propriogid:1: INVALID FORMAT
2010-05-04 10:58:31.932: [ default][3068479168]prlsndmain: olsnodes successful!!
2010-05-04 11:06:06.374: [  OCRRAW][3086931648]propriogid:1: INVALID FORMAT


It showed that OCR had problems. I then noticed "Device/File needs to be synchronized with the other device" in the ocrcheck output:

root@rac1:/u01/app/oracle/product/10.2.0/crs_1/log/rac1/client [devdb1]# ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          2
         Total space (kbytes)     :     505928
         Used space (kbytes)      :       5372
         Available space (kbytes) :     500556
         ID                       :  645781380
         Device/File Name         : /dev/raw/raw4
                                    Device/File needs to be synchronized with the other device
         Device/File Name         : /dev/raw/raw5
                                    Device/File integrity check succeeded

         Cluster registry integrity check succeeded


I executed the following command:

# dd if=/dev/raw/raw5 of=/dev/raw/raw4

Then, I found ocrcheck was good again.
[oracle@rac1 ~]$ ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          2
         Total space (kbytes)     :     505928
         Used space (kbytes)      :       5348
         Available space (kbytes) :     500580
         ID                       :  219149972
         Device/File Name         : /dev/raw/raw4
                                    Device/File integrity check succeeded
         Device/File Name         : /dev/raw/raw5
                                    Device/File integrity check succeeded

         Cluster registry integrity check succeeded

I am not sure it is the right or best way to fix this problem, but it worked. BTW, I tried "ocrconfig -replace ocr /dev/raw/raw5", not working. After that, I was able to bring everything back to normal. Actually, the problem was there when I created ocr and ocrmirror on the raw devices, but I did not notice it at that time.