VMware High Availability (HA) is a useful component of a VMware environment, but like other parts of a virtual
environment it necessitates configuration and a certain amount of planning. As HA can fail especially as environments grow, smart planning involves determining which workloads you want to protect with HA and run frequent tests. In this tip, we'll discuss possible HA errors, how many host virtual machines HA can support per cluster, suggestions for what to do when HA doesn't work if a host or guest virtual machine fails, and situations where using HA may not be necessary.
Why consider how much VMware HA I need to use?
VMware HA is one of the better case-making elements for VI3 environments, but there is a lot of planning that goes into how it's configured. This includes resource planning in regards to reserved capacity. When we determine how far we want to take HA in our environment, we are brought back to the fundamental concept of HA -- host failure. In the event of a host failure, VMware's HA functionality can take over and re-establish affected workloads on other hosts. We won't go into much of the base operations of HA here; for a precursor to this planning material that gives a clear idea of what to expect from HA, check out my earlier tip on configuring VMware HA .
I would not have covered VMware HA adequately unless I acknowledged that there are frequently issues with getting it to work as you would expect. Every administrator has surely encountered a situation in which HA either did not function as expected, gave useless error messages or icon color changes in the VI Client, or had you trying to determine why the HA Agent experienced an error. Given that it is a valuable feature for VI3 environments but can give administrators some headaches, it is worth giving some thought to which workloads you want to protect with HA.
Ideally, HA is used rarely and tested frequently, so it is of paramount importance to test the HA configuration to verify that it works as advertised. As virtual environments grow, the overhead of the reserved capacity needs to be validated. This could be manifested in an HA event for a host failure not working because there is not enough capacity in the cluster to accommodate the current workload.
The pivotal question, however, in determining how much HA to use is the number of hosts in a cluster. Fellow VMware administrators frequently ask me, "What is ideal number of hosts in a cluster?" Unfortunately, there is no one-size-fits-all answer. For most environments, between five and eight hosts per cluster is optimum. When clusters get larger, there becomes a valid argument in accommodating for more than one host failure. The figure below shows the important question in VMware HA configuration that reserves capacity for host failures:
This is an important point in designing or reconfiguring a VI3 environment -- as if there is not enough capacity to accommodate the configured HA rules, bigger issues will arise. One specific issue is the admission control calculations. These calculations are a set of rules that determine mostly how much RAM can be allocated to running workloads while still allowing the HA configuration to be met. The bad side of the workload paired to the HA configuration yields error messages such as "insufficient resources to satisfy HA failover" or the generic and less useful "HA agent has an error" messages. In a VMware deployment guide, there are a few examples that illustrate admission control, but no definitive formula. For larger clusters where, if a second host failure is permitted, there is a corresponding increase in the reserved capacity of the cluster.
When VMware HA goes bad
When VMware HA goes bad, it can be a difficult issue to resolve. There are many situations in which the agents on the ESX hosts do not communicate correctly to VirtualCenter, and in those cases a variety of steps are required to correct the communication. When there are VMware HA issues (don't expect that HA will work correctly in the event of a failure of a host or guest), here are some possible courses of action:
- Reduce workload: Turn off unnecessary VMs, including development or test systems.
- Reconfigure HA: If one host is the issue, right-click and select 'Reconfigure for VMware HA'.
- Enter maintenance mode: Get the host quiet, exit maintenance mode and then present a workload back to the system.
- Re-enter a cluster: Get a host in maintenance mode, exit the cluster and re-enter. This reconfigures the HA (and DRS agents, if used) on the host.
- Look around: There are plenty of resources to get an HA issue resolved, including SearchVMware.com blogs, the VMware Communities sites and VMware Support, if you have that option.
- Simplify the configuration: Unfortunately, HA "seems to always work" in the simpler configurations.
HA can go bad, and it is critically important to know how to correct it if this is functionality you are counting upon. While there are big plans for a next generation of HA and fault tolerance for VMware products, the current VI3-based HA is what we have now and we need to know how to correct issues that may occur.
Situations where HA is not necessary
For organizations that have embraced virtualization and been successful in their implementations, there frequently can be a tiered effect that develops. This effect is a separation of workloads that require the highest availability with add-ons such as VMware HA and the other class of workload that does not require this functionality. A specific example may be virtual machines that are designed with disaster recovery and fault tolerance, such as a pool of Web servers. In this example, VMware HA doesn't offer much because in most HA events, a restart of the VM is required. The VM will come back online after the HA event, but the Web sessions will be affected. If the pool of Web servers is behind a load distributing switch with a virtual IP address, sessions can get re-directed to an available Web server. In this situation, HA may not really make a strong case to protect these workloads.
The other category that develops where HA is not needed is development or test systems. Every organization has these with varied size and scope, but the one common thing is that they are not needed to make the company money or otherwise meet their top-level goals. In this situation, is it worth the extra cost of VMware Infrastructure Standard and Enterprise editions (which also include VMotion and DRS features)? Chances are if the development environment is fully separated, a case can be made for a lower tier of virtualization on this workload.
Planning and vigilance with VMware HA
With the good and bad of VMware HA summarized here for you, one last piece of advice is to make it an important priority to be current on the VirtualCenter version used. Specifically, VirtualCenter 2.5 Update 3 has nine resolved issues related to HA and Update 2 had four resolved HA issues. Upgrading VirtualCenter is relatively painless, and easier to accommodate than upgrading an ESX host. While that is a lot to absorb, maintaining VMware HA can make the life of an administrator much easier in exchange for explaining why it did not work or why we are paying extra money for HA on dev systems.
If you have any questions, write to me via email@example.com. Your questions will be forwarded to me by one of the site editors and I'll respond as soon as I can.
ABOUT THE AUTHOR: Rick Vanover (MCSA) is a systems administrator for Safelite AutoGlass in Columbus, Ohio. Rick has over 12 years of IT experience and focuses on virtualization, Windows-based server administration and system hardware.