VMware HA redesigned
VMware High Availability (HA) is a core infrastructure feature that restarts virtual machines (VMs) within a cluster if the ESXi host they were running on fails. Restarting the VMs means that they are “crash consistent,” so it looks like they lost power and were restarted. VMware HA also works in conjunction with VMware Distributed Resource Scheduler (DRS) to redistribute VMs and their resources in a cluster.
Prior to vSphere 5, VMware HA was based on some quite old software provided by Legato Automated Availability Manager. It was effective, but the underlying architecture was complex to understand and troubleshoot. Up to five ESXi servers acted as primary nodes and all other servers in the cluster were secondary nodes. However, you couldn’t see what server was what type of node unless there was a problem or you ran a PowerCLI script. Furthermore, as this blog points out, VMware High Availability would no longer function if it was running on a blade chassis and all five of the master servers were lost.
In vSphere 5, the underlying architecture of VMware HA has been redesigned and modernized. It no longer uses the Legato software, which VMware replaced with Fault Domain Manager (FDM).
Unlike vSphere 4.1, high availability in vSphere 5 relies on just one master server and all other servers in the HA cluster are available for failover should the first one fail. This way, the master is never a single point of failure, and if it does fail, a re-election occurs very quickly.
New features in vSphere 5 HA
As part of this VMware HA redesign, there are a number of other features you should know about.
Heartbeat monitoring. One of VMware HA’s shortcomings in vSphere 4.1 was that it was overly reliant on the network and domain name system (DNS) to access data stores used to establish a heartbeat. It used the network for heartbeat testing to see which servers were alive. So if there was a network failure, VMware High Availability could erroneously kick in, restart a VM, and, potentially, cause unnecessary downtime. With VMware HA in vSphere 5, both the network and shared data stores are used to see which hosts are available, eliminating unnecessary VM restarts.
Also, vSphere 5 HA no longer uses DNS, which it had relied on to perform administrative tasks via IP and DNS host names, removing another point of failure.
Host isolation detection also improved. Thanks to the data stores used for heartbeat monitoring, VMware HA can determine if a host is isolated from the network (because it can still communicate via data stores) or if it completely crashes (because it isn’t communicating via the network or data stores anymore).
VMware HA also supports IPv6 in vSphere 5.
Simplified log and configuration files. In vSphere 5, you can find VMware HA and FDM log files at /var/log/fdm.log and the configuration file at /etc/opt/vmware/fdm/fdm.cfg. If you want to learn how to use these files, check out these two new VMware Knowledge Base articles: Changing the verbosity of the VMware High Availability Management Agent (FDM) logs and Troubleshooting Fault Domain Manager (FDM) issues.
Faster initial install. If you have ever enabled VMware High Availability in an older vSphere cluster with 10 or more hosts, you know that it can take a very long time. With vSphere 5, you’ll find that it takes less than a minute to get VMware HA enabled -- whether it’s on two hosts or 10. This is because FDM is more efficient and better integrated with ESXi.
Enhancements to the user interface. From the perspective of the vSphere 5 client, administering VMware HA looks almost identical to the previous version. One of the enhancements you may notice is that the Cluster Status is different. Here is what each of the three tabs in the Cluster Status looks like:
Figure 1. This is an example of a Cluster Status screen in the Hosts tab.
Figure 2. This is an example of a Cluster Status screen in the VMs tab.
Figure 3. This is an example of a Cluster Status screen in the Heartbeat tab.
Notice that on the last tab, the heartbeat data stores show two data stores available. That brings me to my last point…
The most common VMware HA configuration mistake
Configuration of VMware HA in vSphere 5 isn’t that different from the process in vSphere 4.1. The biggest difference is that you must now have at least two shared data stores between all hosts in the HA cluster.
Large shops may already have five or 10 shared data stores, but for small shops and smaller HA clusters that are used to having just one, this represents a change.
The best way to get VMware High Availability up and running is to use the Cluster Status window. As you can see in this Cluster Status window, I’ve gotten an error saying I haven’t met the minimum requirement to have at least two shared heartbeat data stores between the hosts in my VMware HA cluster:
Figure 4. This shows an error with the number of heartbeat data stores.
Along with vMotion and DRS, VMware HA is likely one of the top three advanced features VMware admins use today. In vSphere 5, the high availability feature has greater scalability, takes less time to implement and reconfigure, is more resilient and is based on 100% VMware code. Make sure you know how VMware HA has changed as you run your real-world implementation and when you take the VMware Certified Professional 5 exam.
This was first published in October 2011