Avoiding downtime with VMware Fault Tolerance and High Availability
A comprehensive collection of articles, videos and more, hand-picked by our editors
VMware's vSphere High Availability rapidly restarts virtual machines when an ESXi server fails. But is powering...
the virtual machine back on enough? If the restarted virtual machine is starved of resources, it may not deliver the required performance. Virtualization administrators need to configure High Availability correctly to protect the performance of virtualized applications to match the business value.
VSphere HA addresses one of the key concerns with virtualization: the vulnerability when 10, 20 or a hundred VMs are running on a single ESXi server.
The vSphere High Availability (HA) feature, first introduced in ESX version 3, revives virtual machines (VMs) when hardware or administrator failures have occurred. It's one of the features that enabled vSphere to become the standard virtualization platform in the enterprise.
VSphere HA addresses one of the key concerns with virtualization: the vulnerability when 10, 20 or a hundred VMs are running on a single ESXi server. Before virtualization, a single physical server failure only stopped one application. Today, there may be 30 applications hosted in VMs on a single physical server. Having HA rapidly restart these VMs if the ESXi server fails helps reduce the impact when a server issue arises.
But vSphere HA only protects the reserved resources, so setting an appropriate reservation is an important part of configuring a highly available vSphere environment. Since vSphere HA protects the resources reserved for a VM, it is critical that reservations are set; otherwise, performance may suffer after a hardware failure.
This is a common misunderstanding: HA does not protect the resources the VM consumes, only the reservation that is set. If the reservation is insufficient, then performance is not protected.
How to ensure a proper restart
When a virtualization administrator configures an HA cluster, the primary settings determine whether to protect against failures and how much resource to set aside to accommodate the failure. In a production environment, you always want to protect against failures with the Admission Control setting. You should set aside enough resources for the failures you can afford to cover in the Admission Control Policy setting. With both of these set, you can be sure that the VMs that were running will be restarted if one of the hosts fails.
What you cannot be sure of is how VMs will perform when they restart nor the performance of every other VM in the cluster, since they share the same pool of resources. This is because by default ESXi does not guarantee to deliver any CPU or RAM to the VMs; they have zero reservation until you set a reservation.
In a previous article, we talked about the importance of reservations. Reservations make sure your VM gets the minimum amount of resources it needs to deliver to its service level. When you power on VMs using vCenter, the HA cluster checks for enough unreserved resource after the configured level of failure. With the default reservation of zero, HA will let you power on a lot of VMs, but there will be a point when the resources available will decline or VM performance will slow to the point where you will decide to power on more VMs.
Set up your VMs to deliver the required performance after host failure by limiting the amount of VMs that are running before issues arise. The best way to do this is to set a reservation on the VMs.
With the reservation set in your VMs, you may find that HA won't let you power on so many VMs in the cluster, and normal utilization on your ESXi servers may not be as high. If a host is lost, the VMs still perform at the proper level. This will be particularly true where you have critical VMs with high reservations; their importance to the business will justify not working the ESXi servers so hard.
In a cluster with many low-priority VMs, despite their smaller reservations you may still see high normal utilization, since these VMs are allowed to lose more resources when ESXi servers fail.