Oracle RAC Cluster in VMware vSphere : VMotion Caveat

 

Last week I faced a very critical issue with our Oracle RAC cluster in vSphere 5, for the ESXi maintenance we have performed a VMotion on the Oracle nodes – then the FUN began !!

OUR ENVIRONMENT

We have 2 node oracle RAC cluster running on Vsphere5

For the Oracle RAC cluster we are using VMFS5 file system, I mean VMDK only (no RDM)

Oracle nodes are using windows 2008 R2 SP1 and oracle 11.2.0.3

The clusters are running on HP Blade center C7000 with ProLiant BL680c G7 blades

The ORACLE private and public network are using vDS (VMware distributed switch)

Storage, we are using HP 3PAR FC SAN for the vsphere5

ISSUE

We performed a VMotion on Oracle RAC cluster nodes one by one, after that we got a BSOD for both nodes and the VM’s crashed completely.

image

I spent some time in researching the stop code that was in the BSOD: STOP 0x0000FFFE
It appears that it is not a normal blue screen, as the above stop code is not given from Microsoft. It is actually a forced blue screen from Oracle.  Unfortunately, I am unable to find specific information regarding it from Oracle’s website, but I did find the following blog article (this information may not be completely reliable as it is just a blog article):

“OraFence has a built-in mechanism to check it was scheduled in time. If it is not scheduled within 5 seconds it will also reboot the note. In this way, OraFence is designed to fence and reboot a node if it perceives that a given node is ‘hung’ once its own timeout has been reached. Note that the default timeout for the OraFence driver is a (very low) 0x05 (5 seconds). What this means is that if the OraFence driver detects what it perceives to be a hang for example at the operating system level and that hang persists beyond 5 seconds, it’s possible that the OraFence driver – of its own accord – will fence and evict the node.” —

http://dbmentors.blogspot.com/2012_09_01_archive.html

 

RCA for the ISSUE :
The above article suggests that a simple timeout could cause an Oracle driver to BSOD the machine with the stop code 0x0000FFFE. This would be something that could occur during a vMotion simply because, during the VMotion the operating system is quiesced, and depending upon the VM RAM and VMotion configuration it will take around 16 seconds to complete the VMotion (we have configured MultiNic VMotion in the HP Blades)

Oracle uses fencing mechanism for the RAC cluster, just like the SCSI fencing in the REDHAT CLUSTER to reboot the nodes. Because of the VMotion quiescence process and with default timeout for the OraFence driver 5 seconds, because the VMotion quiescence took more than 5 seconds and this caused the BSOD.

The above blog article recommends increasing the timeout from 5 seconds because that is very aggressive.I would recommend contacting Oracle Support with the details, as they would probably have a more complete picture of how their product forces a BSOD and how to avoid it in the future.

RESOLUTION :

Increase the default timeout for the OraFence driver, so the BSOD should not occur again due to a quiesced operating system (during VMotion or in DRS/HA environment)

Design Notes

While designing the Oracle RAC cluster in vSphere, taking this consideration will avoid such crashes. Also if you are running the RAC nodes in the VMware HA/DRS environment, this should be considered.

Advertisements

About GK_RAJ

An enthusiastic IT person, with an intense passion towards Datacenter technologies. I am a VMware vExpert Title holder and working as a Technical Consultant, in Qatar. I am exposed to VMware vSphere, Storage, Bladecenters, Datacenter operations, Symantec Backup, Deduplication technologies and carry rich and diversified experience in these domains. I specialize in Designing & Consulting on VMware VSphere, the integration of Storage and Network Stacks to VSphere. With my experience, I help Organizations/Enterprises to achieve their CAPEX & OPEX savings, develop DR and BCP strategies, Consolidation services with Virtualization using VSphere, and prepare them to move to Cloud. In the meantime, I would like to share my knowledge and do a good contribution to the community. I am an Indian citizen, and have a Engineering degree in Electronics and Communication. I have certified in VCAP5-DCD, VCP-Cloud, VCP 4 & 5, MCITP, MCSE.

Posted on November 13, 2012, in HP Blades, VMware. Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Dan Gorman's Technology News Aggregation

My Daily Readings from Zite

Virtual Reality

Lets dive into world of virtualization

Brad Hedlund

stuff and nonsense

VCDX56

A blog focusing on day 2 day virtualization stuff

UCSguru.com

Every Cloud Has a Tin Lining.

pibytes

Experience the Datacenter Technologies

boche.net - VMware vEvangelist

Experience the Datacenter Technologies

blog.scottlowe.org

The weblog of an IT pro specializing in virtualization, networking, cloud, servers, & Macs

Eric Sloof - NTPRO.NL

Experience the Datacenter Technologies

Technodrone

Experience the Datacenter Technologies

Welcome to vSphere-land!

your ultimate VMware information destination

Michelle Laverick...

Laverick by Name, Maverick by Nature...

CloudXC

By Josh Odgers - VCDX#90

Long White Virtual Clouds

all things vmware, cloud and virtualizing business critical applications

Virtual Geek

Experience the Datacenter Technologies

Yellow Bricks

by Duncan Epping

%d bloggers like this: