Avoiding downtime with VMware Fault Tolerance and High Availability
A comprehensive collection of articles, videos and more, hand-picked by our editors
VMware released vCenter Server Heartbeat in 2009 to provide high availability for vCenter Server services by clustering...
these services across two copies of Windows. But a recent licensing change has some administrators wondering what options there are for vCenter high availability.
VMware announced the end of availability for vCenter Server Heartbeat (vCSHB) in June, which means administrators cannot buy a new vCHSB license, with support for existing owners ending in September 2018.
Administrators could also use vCSHB to protect a Microsoft SQL server used for vCenter. Its role is to deliver higher availability rather than running vCenter in VMs on a High Availability (HA) cluster.
A properly configured HA cluster is our baseline level of high availability in a vSphere environment. If one of the physical servers fails, the VMs it was running will start on other hosts in the cluster. It is important to recognize that the operating system and applications in the VM essentially are reset. The operating system must start up as if the VM had crashed. Then the applications must start up as well. This can take a few minutes. For vCenter, it is not unusual for it to take 15 to 30 minutes before all its services are running. The good news is that vCenter does not need to be running for HA to do its job. The HA cluster will start up the vCenter VM like any other VM; vCenter will return to service automatically.
When vCenter Server is down
So, what doesn't work when vCenter Server is not running? Usually the inability to see what is happening is the biggest issue. End users are seldom affected.
First, vCenter is the control point for IT to manage the vSphere environment. While vCenter is down, administrators can't create new VMs or migrate existing ones.
Second, we can't view and monitor our vSphere estate. This will be very annoying as we try to monitor the HA failover. Not being able to handle VMs while the failover happens is probably a good thing. HA is good at its job and is best left alone for a while.
Also, while vCenter is down, the other cluster feature -- Distributed Resource Scheduler (DRS) and Storage DRS -- will not run. DRS moves VMs from one physical server to another to load the servers equally; storage does the same thing to balance data stores. While vCenter is down, this balancing will not happen; changes in VM load could saturate on host and VM performance could suffer. This is a minor risk because the cluster was balanced before vCenter went down; we care more about starting the VMs up. Once vCenter is back, it will reassess the cluster and rebalance the load.
There are a few places where vCenter outages are visible to users. There are products that drive vCenter from another platform; vCloud Director (vCD) and Horizon View are the usual two. For these, end users may need VMware vCenter to create or modify VMs for them. If vCenter is down, vCD users cannot provision new VMs or manage the virtual hardware of existing VMs. Consumers of cloud services tend not to tolerate management outages, so we need a fast recovery.
With Horizon View, users may not be able to access to their desktops. A 30-minute outage may be a big problem if it happens at 9 a.m. when the whole office is waiting to work.
A separate vCenter just for View or vCD is a great idea, even if you don't need it for scalability. Separating it for management can help speed recover times.
What might VMware do to provide HA for vCenter?
The first choice is nothing: Leave vCenter protected by vSphere HA, like all your other VMs. If vSphere HA is good enough for production Oracle servers, then why not for vCenter? Of course, the HA recovery does involve a service outage while the VMs are booted.
The next option is to leave extra availability to third-party developers; vCHSB is licensed from and developed by NeverFail. I imagine you will be able to buy NeverFail for vCenter once the OEM deal with VMware expires. There is still a service outage, but only for the service startup, because the VM is already booted.
A third option is that we have been hearing about multiprocessor vSphere Fault Tolerance for a while. If this feature ships in a vSphere update, it would allow vCenter to be resilient to HA events. There would be no vCenter outage if the ESXi server fails.
A scale-out vCenter version?
The option I hope for is that VMware will make vCenter a scale-out appliance made of several clustered Linux VMs. This would be the Web app method -- a load-balanced set of one or more VMs that provide a service. The service is available so long as there is a minimum number of VMs running. Individual VMs could fail without the service failing. This was described in a "projects" area of the VMware stand at VMworld 2012, and I've been waiting for it ever since. This scale-out fault-tolerant architecture is widely believed to be the future of cloud-enabled application development.
As always, there are many ways to solve a problem. The solution you adopt will depend on your requirements, so a range of possible solutions is important. Most vSphere customers find that an HA cluster is a great level of recovery; just a few need something more.