Prevent vSphere High Availability woes through a proper configuration

Until vSphere High Availability settings have been tuned to your environment, virtual machines may not restart after a failure.

When configuring a vSphere cluster, one of the most valuable features is vSphere High Availability, which automatically restarts virtual machines on an available host if a server -- or the operating system -- fails.

But that virtual safety net won't be effective until the system administrator applies the right vSphere High Availability (HA) settings for the environment. Once configured correctly, vSphere HA will improve the availability of all virtual machines within a vSphere cluster. Here's a closer look at several items that need to be set properly to avoid downtime.

Properly configure the Admission Control Policy

After enabling HA, the Admission Control Policy (ACP) can be configured. This feature allows you to set the amount of resources available during a failover. If there aren't enough cluster resources in an HA failover state, then ACP prevents new virtual machines (VMs) from starting. Before making changes, note the configuration of the hosts in the cluster and the amount of failed hosts that should be tolerated.

There are three options in the Admission Control Policy.

The first Admission Control Policy option sets the number of host failures that can be tolerated. When calculating whether a VM can restart, HA assumes the largest hosts will be the ones to fail.

The second Admission Control Policy option determines the percentage of CPU and memory resources within the cluster to reserve. This control over how much capacity is reserved can be useful in a heterogeneous host environment.

The third Admission Control Policy option identifies specific hosts to reserve for failover. To guarantee these resources will be available should a host fail, HA prevents the identified host(s) from powering on any VMs.

Properly configure the host isolation response

For HA to restart a VM, the new host must be able to lock the virtual disk files of the VM. If the original host is still operational, then the virtual disk files will be locked and prevent the new host from powering on the VM. To prevent this situation, HA has a setting to define the response a host can take when it determines it is isolated from other cluster nodes. This setting can be configured to either shut down, power off or leave the VMs powered on. A default must be identified, but each VM can be configured separately.

How to avoid host isolation

There are a few settings that can prevent a host from declaring itself as isolated.

The first option is to identify an isolation address. HA will attempt to ping this address to determine if the host is disconnected. By default, the isolation address is the default gateway configured on the host. If HA is running on any subnet other than the default gateway, then the advanced setting das.isolationaddress should be used to set additional isolation addresses.

The second option is to properly configure the network used for HA. All non-vMotion VMkernel ports will be used for HA communication among the hosts in a cluster. Having proper network interface card (NIC) and switch redundancy for a single VMkernel port or across multiple VMkernel ports is key.

Disable host monitoring during maintenance. This will prevent any unexpected restarting of virtual machines due to a temporary situation that affects any HA ports or other HA components. Further, VMware recommends placing the host into maintenance mode during any network changes to force HA to recognize any networking changes when exiting maintenance mode.

Make sure HA host monitoring is enabled. This can happen when forgetting to check the checkbox in the cluster properties during the build of the cluster or after a maintenance period. Also, some users don't trust automation in their virtual environment and do not enable it.

Dig Deeper on VMware High Availability and Fault Tolerance