cutimage - Fotolia


Protect hosts from hardware degradation with vSphere High Availability

New to vSphere 6.5, Proactive High Availability helps protect hosts from hardware degradation by partnering with hardware vendors to detect problems and evacuate VMs.

Along with updates to the Distributed Resource Scheduler and Fault Tolerance, vSphere 6.5 includes new Proactive High Availability features to improve users' overall resource management experience. This new version of vSphere High Availability works in conjunction with the Distributed Resource Scheduler and agents from your hardware vendor to evacuate VMs from a host before a problem occurs.

Think about a scenario in which the hardware sensors trigger an alert because one of the two power supplies in your server has failed, or one in which the CPU fan stops working. In either of these cases you could leave everything as is, but the chance of a server crash is high, so removing the VMs from the host and making sure that the workloads are running on the healthy surviving nodes in the cluster is the safe option. This gives the user a chance to address the hardware problem and bring the host back online and, in the meantime, end users won't experience any down time.

Enabling the new vSphere High Availability

To enable Proactive High Availability (HA), as shown in Figure A, the cluster must have Distributed Resource Scheduler (DRS) enabled because DRS uses vMotion to move VMs to other hosts while still running. A hardware vendor-provided agent that triggers hardware-related alerts, such as the Dell Customized Image of VMware ESXi 6.5, is also required. This image -- which will also become available from other vendors -- makes the correct hardware checking available.

Enabling Proactive High Availability.
Figure A. Enabling Proactive High Availability.

Then, under vSphere Availability, you will have to configure what the behavior will be when hardware degradation occurs. As you can see in Figure B below, there are two possible quarantine modes available for Proactive High Availability. Depending on the severity level of the hardware error, you can still use the host, but only when needed to satisfy DRS affinity rules. If the affinity rules don't exist and all VMs can run on other hosts, the VMs will be evacuated to other hosts. The other vSphere High Availability setting puts the host in maintenance mode, which means the VMs will always be moved to other hosts when an alert is triggered.

Proactive HA Failures and Responses.
Figure B. Configuring behavior for hardware degradation in vSphere Availability.

More proactive work done by your cluster

VSphere High Availability isn't the only tool with proactive features; DRS also does some proactive work in your cluster. When used in conjunction with vRealize Operations (vROps), it can predict when a usage peak for VMs is about to occur based on previous measurements. It can migrate VMs to other hosts to prevent that usage peak from happening.

This would normally require corrective actions, DRS's traditional approach to usage peaks. The way DRS with vROps works is simple: VROps collects the metrics for your VMS, stores them and uses them to calculate dynamic thresholds. VROps already used this tactic to find anomalies in the system, but engineers at VMware came up with a way to use this VM resource usage footprint to predict when a recurring spike in resource usage will take place. Of course, this method works best in data centers where VM load balancing follows a predictive scheme, such as in an office where end users start work around the same time each day and go for lunch around the time.

To enable this feature you need the latest version of vROps, which is currently 6.4. In Figure C, you can see the connection to your vCenter Server, where the cluster to use this feature is located.

Configuring vROps to send data to vCenter.
Figure C. Configuring vRealize Operations to send data to vCenter.

Other new vSphere DRS features

Once vROps is configured to send data to vCenter, you can enable Predictive DRS on your cluster. With that done, all you have to do is sit back and watch these systems go to work. Like the proactive version of vSphere High Availability, this feature is new, so whether or not it helps improve the availability of resources in clusters is yet to be determined. It's worth mentioning that the feature is only licensed to be used in cluster with a maximum number of VMs that does not exceed 4,000.

New vSphere DRS features.
Figure D. New vSphere DRS features.

As you can see in Figure D, there are three other features new to vSphere DRS: VM Distribution, Memory Metric for Load Balancing and CPU Over-Commitment.

VM Distribution lets you balance VMs on your cluster nodes based on the number of VMs rather than resource usage. You may encounter a scenario where a large group of VMs is running on only a few hosts while other hosts have very few or no VMs running. This scenario occurs after a server failure in which the cluster has so many resources that after the failed host comes online, there's no need for DRS to migrate VMs to that host, so it stays empty.

This only happens if there is insufficient contention justifying migrations. VM Distribution reduces the impact of a server failure if VMs are spread evenly across cluster nodes compared to when a large number of VMs run on the failed host. This feature is secondary to load balancing, so machines are only spread evenly when resource balancing isn't in jeopardy.

The last two settings control resource load balancing in combination with overcommitment. Memory Metric for Load Balancing allows you to use consumed memory instead of active memory. If you look at each of your VMs, you can see that they all report that their entire RAM has been consumed. With this setting, VMs are balanced based on the memory assignments rather than the actual memory they each use.

CPU overcommitment allows you to configure the maximum vCPU to pCPU ratio. For example, if you configure this setting to 200%, you can start two vCPUs for each pCPU. The maximum value you can set is 500%. This setting prevents excessive over-commitment of CPU resources in your cluster.

Next Steps

VMware rolls out updates in vSphere 6.5

Latest version of VSAN improves HA uptime

VSphere 6.5 puts applications at the forefront

Dig Deeper on VMware High Availability and Fault Tolerance