Friday, November 28, 2008

Oracle 10g/11g RAC Clusterware Installation Tips

Part II: Oracle RAC High Availability Series

Since I have been building clusters with Oracle 10g and 11g RAC environments for many
new clients, I have decided to present a series on Oracle RAC tips and tricks. The previous installment for Oracle RAC technology included design and implementation
guidelines for best practices. This week, we will cover the post installation tips
with the Oracle 10g/11g Cluster Verification Utility called cluvfy. This wonderful tool provides insight into whether or not your Oracle RAC environment has been installed correctly.

The Oracle Clusterware utility for Oracle RAC comes in two forms: the cluvfy utility which is available after installation of the clusterware software and the shell script which can be used before installation. In the following example, we will use the post install verification feature available with cluvfy to verify our Oracle 10g RAC clusterware installation.

Verify Oracle RAC Clusterware Installation using the Clusterware Verification Utility (CLUVFY)

Cluvfy uses stages to check the status of the installation for before and after each
phase of the Oracle RAC installation process.

Below is the basic syntax for cluvfy:

oracle@racnode1 $ cluvfy

cluvfy [ -help ]
cluvfy stage { -list | -help }
cluvfy stage {-pre|-post} [-verbose]
cluvfy comp { -list | -help }
cluvfy comp [-verbose]

cluvfy stage -post crsinst -n [-verbose]

is the comma separated list of non-domain qualified nodenames, on which the test should be conducted. If "all" is specified, then all the nodes in the cluster will be used for verification.

performs the appropriate checks on all the nodes in the nodelist after setting up Cluster Ready Services(CRS).

Now we run the post-CRS verification with cluvfy to ensure that our Oracle RAC clusterware has been correctly installed. This is useful since the Oracle 10g/11g RAC Clusterware (CRS) is the most challenging aspect of a RAC implementation.

oracle@racnode1> cluvfy stage -post crsinst -n all -verbose

Performing post-checks for cluster services setup

Checking node reachability...

Check: Node reachability from node "racnode1"
Destination Node Reachable?
------------------------------------ ------------------------
racnode1 yes
racnode2 yes

Result: Node reachability check passed from node "racnode1".

Checking user equivalence...

Check: User equivalence for user "oracle"
Node Name Comment
------------------------------------ ------------------------
racnode1 passed
racnode2 passed
Result: User equivalence check passed for user "oracle".

Checking Cluster manager integrity...

Checking CSS daemon...
Node Name Status
------------------------------------ ------------------------
racnode2 running
racnode1 running
Result: Daemon status check passed for "CSS daemon".

Cluster manager integrity check passed.

Checking cluster integrity...

Node Name

Cluster integrity check passed

Checking OCR integrity...

Checking the absence of a non-clustered configuration...
All nodes free of non-clustered, local-only configurations.

Uniqueness check for OCR device passed.

Checking the version of OCR...
OCR of correct Version "2" exists.

Checking data integrity of OCR...
Data integrity check for OCR passed.

OCR integrity check passed.

Checking CRS integrity...

Checking daemon liveness...

Check: Liveness for "CRS daemon"
Node Name Running
------------------------------------ ------------------------
racnode2 yes
racnode1 yes
Result: Liveness check passed for "CRS daemon".

Checking daemon liveness...

Check: Liveness for "CSS daemon"
Node Name Running
------------------------------------ ------------------------
racnode2 yes
racnode1 yes
Result: Liveness check passed for "CSS daemon".

Checking daemon liveness...

Check: Liveness for "EVM daemon"
Node Name Running
------------------------------------ ------------------------
racnode2 yes
racnode1 yes
Result: Liveness check passed for "EVM daemon".

Liveness of all the daemons
Node Name CRS daemon CSS daemon EVM daemon
------------ ------------------------ ------------------------ ----------
racnode2 yes yes yes
racnode1 yes yes yes

Checking CRS health...

Check: Health of CRS
Node Name CRS OK?
------------------------------------ ------------------------
racnode2 yes
racnode1 yes
Result: CRS health check passed.

CRS integrity check passed.

Checking node application existence...

Checking existence of VIP node application
Node Name Required Status Comment
------------ ------------------------ ------------------------ ----------
racnode2 yes exists passed
racnode1 yes exists passed
Result: Check passed.

Checking existence of ONS node application
Node Name Required Status Comment
------------ ------------------------ ------------------------ ----------
racnode2 no exists passed
racnode1 no exists passed
Result: Check passed.

Checking existence of GSD node application
Node Name Required Status Comment
------------ ------------------------ ------------------------ ----------
racnode2 no exists passed
racnode1 no exists passed
Result: Check passed.

Post-check for cluster services setup was successful.

The beauty of the Oracle RAC cluster verification tool is that it can pinpoint trouble spots before, during and after the installation process. It is an essential tool in the Oracle RAC expert's workbench.

Some additional details on the use of the cluvfy utility are available in the following Oracle Support online Metalink notes:

ML # 339939.1- Running Cluster Verification Utility to Diagnose Install Problems


Sunday, November 23, 2008

Design considerations for Oracle RAC High Availability Design

I have recently been working with clients to implement and review Oracle 10g RAC environments. One item that has
been of importance is how to design a truly redundant and highly available (HA) infrastructure.

The goal is to avoid any single point of failure (SPOF) while maximizing performance and scalable factors

Here are my notes:

Hardware Considerations
Implement 3-4 nodes for the RAC design in the initial phase. By using at least 3-4 nodes for a new RAC cluster, you
help to protect yourself against the dreaded "Split Brain" condition. If you lose a single node, you still will have 2-3 nodes
to use for failover and servicing current mission critical applications.

Network Considerations for RAC
1. Implement multiple switches for both the interconnect to avoid loss in communication between nodes in the RAC cluster.
The danger of using only a single switch is that if this switch has a failure, the entire RAC cluster will crash and result will be
downtime. I see a lot of clients skimp on this item. By using 2 switches at both the interconnect (private network) and storage level (ie: the SAN fabric or iSCSI layer) you protect yourself against a network failure.

2. Use a fat pipe for the interconnect. Go with at least 4Gb+ Ethernet or even better, fiber channel for best throughput.
Even better, the Infiniband has robust performance for heavy duty applications.

3. Implement multiple dual homed NIC cards to avoid loss of a network adapter in the server. Other good network interfaces
have this built into the network card such as Sun's IPMP (IP Multipathing - Sun Traffic manager).

Storage Considerations for Oracle 10g/11g RAC
Invest in the best SAN possible with Fiber Channel (FC/FC-AL) for best performance and support.

Fiber is the best overall performance and HA solution for enterprise storage. Another suitable option is iSCSI which provides many similar benefits.

Other tips for Oracle 10g/11g RAC Design

1. Mirror and protect multiple copies of the Oracle 10g/11g Clusterware: have several copies of the OCR (Oracle Cluster Registry) and Voting Disks. These are small footprints in size and if you only have a single copy, guess what will happen to your entire RAC cluster if you lose your only copy of these critical files? The entire RAC cluster will fail because it will not be able to communicate. I have seen clients that have 1 copy of the OCR and vote disk and they put their RAC clustered environments at great risk. Even if the Unix or storage administrator says they mirror copies of them on storage, one cannot be too cautious to have multiple copies.

2. Use multiple ASM Disk Groups.

At least 4-5 ASM disk groups are recommended to split up the various Oracle 10g/11g database files for performance and availabilty reasons. For example, we can have the following sample ASM configuration:

+FLASHDG for flash recovery area within ASM to store backups and archivelogs
+DATADG for Oracle 10g/11g data files
+INDEXDG for Oracle 10g/11g indexes
+DATADG2 for additional Oracle 10g/11g application database files

Implement Oracle Data Guard for RAC
While Oracle RAC provides performance, scalability, and data protection against a single node failure for a RAC instance in the cluster, it does not protect against data loss in the event that the RAC database has a media failure and data loss. This is because the RAC cluster nodes all share the same database. Many folks incorrectly assume that RAC is a total HA solution. It is not. I recommend that a standby physical database be implemented with RAC environments to provide for protection against data loss and the single point of failure (SPOF) which is the Achilles Heel with RAC. By using an Oracle Data Guard with RAC, you gain failover and switchover features to protect against data loss. Downtime is bad enough for an already stressed DBA to worry about, but data loss will get a DBA fired and potentially cause a company to go out of business. Thus, Data Guard is the perfect solution to complement RAC for a comprehensive HA solution as part of the Oracle Maximum Availability (MAA) architecture.

Implement and Test RMAN Backup and Recovery with Oracle 10g/11g RAC
Oracle provides the ultimate backup and recovery tool called the Recovery Manager (RMAN) for free out of the box to provide essential backup and recovery for complex RAC environments. User managed hot backups were fine years ago before the RMAN age but sorry folks, they really do not cut the mustard for modern times. RMAN provides a ton of features such as block level media recovery and point in time recovery that are not available in the old user backups. Plus RMAN can be used to clone RAC databases and implement standby Data Guard environments as well as backup and recovery ASM disk groups with Oracle 10g and 11g for RAC.

Hope these tips and tricks help you with building a reliable and stable RAC environment!