Deciding which add-on technologies to use when delving into the world of enterprise virtualization products can...
be tough. One such product that helps you get better availability from applications running on your virtual servers is VMware High Availability (HA) for VMware ESX. This tip introduces basics on configuring VMware HA and offers details on tuning your virtual machine availability to your desired levels and expectations.
VMware HA is an availability enforcement tool that monitors virtual machines and ESX hosts within your virtual infrastructure environment to ensure that they are running. Availability is enforced by restarting halted virtual machines or by ensuring that configured virtual machines are running on an available ESX host in the event that the original ESX host fails. (FYI: VMware HA is bundled with VMware Infrastructure 3 (VI3) Enterprise, but if you have VI3 Standard or Starter edition, you'll have to purchase VMware HA separately.)
Defining VMware HA parameters
Implementing VMware HA requires some upfront planning. You need to have realistic expectations. The hardest adjustment for new virtualization administrators is that most of the VMware HA commands are not driven by normal check boxes and option screens. Consider the following example:
- Enable virtual machine failover monitoring
- Establish a 90 second polling interval
- Allow a three minute virtual machine initialization grace period
- Assign a high priority for the restart operation
For this configuration, I have one virtual machine, VMWIN2K3-0001, running Windows Server 2003 on ESX host ESX35DEV0001 running VMware ESX 3.5. The cluster for this example is named C-ESX35. I am also running VirtualCenter 2.5 to configure the HA rules, so your configuration may appear different if you are on different versions.
The configuration elements for VMware HA are determined per cluster. Therefore, if you have a large number of ESX hosts, you may decide to make different clusters based on the availability requirements you have in your environment. For my environment, I have two data centers and three ESX clusters. The three clusters have four, three and two ESX hosts. This effectively translates to large, medium and development environments for the clusters as we are using them.
To configure the above HA parameters, there are two sections we will need to visit. The first is to set the restart priority for the cluster. Using VirtualCenter, select the cluster and right-click to select Edit Settings. The base screen is shown below with the restart priority value of high:
From here, set the HA rules' functionality requirements in the configuration by clicking the Advanced Options button. This allows you to configure the options that are not in the interface. In this section, I have assigned to each option the following three values to meet the above criteria:
- das.vmFailoverEnabled, true;
- das.MinUptime, 180;
- das.FailureInterval, 90
These values must be typed in and should match the case sensitivity listed in the ESX documentation for HA to makes sure the settings are applied as expected. Click OK to save these settings. Once configured, they should look like the following configuration I have in C-ESX35:
VirtualCenter validates parameters as you go
Once the values are applied, they are immediately in effect on the host for the clusters involved. I was initially concerned about accuracy when entering values in this fashion, but VirtualCenter will validate the settings as you implement them. For example, if you attempt to enter an invalid advanced configuration parameter, VirtualCenter will not commit them to the cluster. To validate this, I attempted to enter an advanced option of VMFailoverEnabled with a value of True within the C-ESX35 cluster. This was immeidately rejected and shown in the VirtualCenter scrolling log as a "Bad HA advanced option key" message. This is also recorded in the VPX_Task table of the VirtualCenter database with the following characteristics:
NAME: vim.ComputeResource.reconfigureEx ERROR_DATA:
LocalizedMethodFault"> Bad HA advanced option keys: VMFailoverEnabled
In the database entry, you can see the format is rejected from the input within VirtualCenter.
Put the HA rules to work
In the example above, we would permit ESX to restart the virtual machine after the 90 second timeout. To execute a system failure test, I have set a configuration that will force a simulated Windows blue screen of death (BSOD) on the virtual machine. Once the BSOD has occurred on the guest virtual machine, the virtual machine loses the IP address and communication to the VMware tools in the VMware Infrastructure Client.
The VMware HA agent monitors use and access to the virtual resources, so it knows if a virtual machine is running or if it's in a failed state. Monitoring occurs regardless of whether the guest operating system is running VMware tools or has an IP address assigned. Thus, server build processes or disabling a network interface will not induce an HA event. I simulated the BSOD on the VMWIN2K3-0001 system as shown below:
Once the configured parameters occur, the ESX host will reset the virtual machine. This will be the equivalent of pushing the power button on a physical system without a graceful shutdown. In the example above, however, there are not many other available options. VMware HA doesn't correct this issue, but it can be used as a tool that can work around hard errors like in the previous example.
Accountability for HA events
The VMware HA automated events will not be shown in the scrolling log of the VMware Infrastructure Client, and there isn't centralized logging for this class of event. While most ESX events will appear in either the VPX_EVENT or VPX_TASK table in clear text, they may be in some of the ntext fields within the Virtual Center database.
For the example that was performed above, there was a log event on the local ESX system in the /var/log/vmware/hostd-2.log file for the VMware HA event for the ESX 3.5 development system. The hostd-2.log file is not centralized to VirtualCenter and is very cumbersome to traverse. The entry for the HA event in this example encompasses 25 lines in the file. I've included the significant log file entries from HA event below.
First three events in log sequence:
[2008-01-17 00:20:50.439 ' TaskManager' 35957680 info] Task Created : haTask-64-vim.VirtualMachine.reset-8728 [2008-01-17 00:20:50.439 'ha-eventmgr' 35957680 info] Event 81 : VMWIN2K3-0001 on ESX35DEV0001.AMCS.TLD in ha-datacenter is reset [2008-01-17 00:20:50.440 'vm:/vmfs/volumes/ 478bd6c8-3f8f2109-7d9e-00188b36fd47/VMWIN2K3-0001/VMWIN2K3-0001.vmx' 35957680 info] State Transition (VM_STATE_ON -> VM_STATE_RESETTING)
Two middle events in log sequence:
[2008-01-17 00:20:52.763 'ha-eventmgr' 128564144 info] Event 83 : VMWIN2K3-0001 on esx35dev0001.amcs.tld in ha-datacenter is powered on [2008-01-17 00:20:52.763 'vm:/vmfs/volumes/478bd6c8-3f8f2109- 7d9e-00188b36fd47/VMWIN2K3-0001/VMWIN2K3-0001.vmx' 128564144 info] State Transition ( VM_STATE_RESETTING -> VM_STATE_ON)
In these events, you can follow the sequence: ESX started the reset procedure and took the virtual machine to the powered-on state. These events may be intermixed with other ESX messages, so be sure to use your text viewer's find function to look for these events.
Note defaults and define your usage requirements
It is important to be aware of the defaults for VMware HA in your ESX environment. The base functionality may meet your expectations, or you can configure HA to meet your operational needs. One important default is the maximum number of failures permitted. The default configuration for das.maxFailures and das.maxFailureWindow only permits a single virtual machine to fail and be reset by VMware HA three times. Keep in mind that the triple reset default makes sense because simply resetting a constantly failing system is not a true solution to the failure. The Virtual Machine Failure Monitoring technical note is a good place to start collecting information on defaults and other advanced configuration items for VMware HA.
More on availability in virtual environments
A log-traversing script for a pattern match may be appropriate for archiving should your organization require an audit trail for change control on the specific VMware HA automated reset events for virtual machines. Further, a script to copy (and rename at the destination) the log files into a central repository may be a good supplement as the log locally does not keep much history.
ABOUT THE AUTHOR: Rick Vanover is an MCSA-certified system administrator for Belron US in Columbus, Ohio. Rick has been working with information technology for over 10 years and with virtualization technologies for over seven years.