Oracle RAC Cluster in VMware vSphere : VMotion Caveat
Last week I faced a very critical issue with our Oracle RAC cluster in vSphere 5, for the ESXi maintenance we have performed a VMotion on the Oracle nodes – then the FUN began !!
We have 2 node oracle RAC cluster running on Vsphere5
For the Oracle RAC cluster we are using VMFS5 file system, I mean VMDK only (no RDM)
Oracle nodes are using windows 2008 R2 SP1 and oracle 220.127.116.11
The clusters are running on HP Blade center C7000 with ProLiant BL680c G7 blades
The ORACLE private and public network are using vDS (VMware distributed switch)
Storage, we are using HP 3PAR FC SAN for the vsphere5
We performed a VMotion on Oracle RAC cluster nodes one by one, after that we got a BSOD for both nodes and the VM’s crashed completely.
I spent some time in researching the stop code that was in the BSOD: STOP 0x0000FFFE
It appears that it is not a normal blue screen, as the above stop code is not given from Microsoft. It is actually a forced blue screen from Oracle. Unfortunately, I am unable to find specific information regarding it from Oracle’s website, but I did find the following blog article (this information may not be completely reliable as it is just a blog article):
“OraFence has a built-in mechanism to check it was scheduled in time. If it is not scheduled within 5 seconds it will also reboot the note. In this way, OraFence is designed to fence and reboot a node if it perceives that a given node is ‘hung’ once its own timeout has been reached. Note that the default timeout for the OraFence driver is a (very low) 0x05 (5 seconds). What this means is that if the OraFence driver detects what it perceives to be a hang for example at the operating system level and that hang persists beyond 5 seconds, it’s possible that the OraFence driver – of its own accord – will fence and evict the node.” —
RCA for the ISSUE :
The above article suggests that a simple timeout could cause an Oracle driver to BSOD the machine with the stop code 0x0000FFFE. This would be something that could occur during a vMotion simply because, during the VMotion the operating system is quiesced, and depending upon the VM RAM and VMotion configuration it will take around 16 seconds to complete the VMotion (we have configured MultiNic VMotion in the HP Blades)
Oracle uses fencing mechanism for the RAC cluster, just like the SCSI fencing in the REDHAT CLUSTER to reboot the nodes. Because of the VMotion quiescence process and with default timeout for the OraFence driver 5 seconds, because the VMotion quiescence took more than 5 seconds and this caused the BSOD.
The above blog article recommends increasing the timeout from 5 seconds because that is very aggressive.I would recommend contacting Oracle Support with the details, as they would probably have a more complete picture of how their product forces a BSOD and how to avoid it in the future.
Increase the default timeout for the OraFence driver, so the BSOD should not occur again due to a quiesced operating system (during VMotion or in DRS/HA environment)
While designing the Oracle RAC cluster in vSphere, taking this consideration will avoid such crashes. Also if you are running the RAC nodes in the VMware HA/DRS environment, this should be considered.