As you may be aware, I occasionally get to pass on some hard-learned lessons gained from troubleshooting scenarios...
in the data center. This particular story is about the loss -- and recovery -- of vSphere vCenter after a patch procedure went awry.
It started off innocently enough, with vSphere vCenter not behaving. It wasn't stopping or starting correctly, and to address the issues, I restarted it. After reboot came the biggest issue: VCenter didn't come back.
Finding the problem
First, I had to determine which host server vCenter was on by manually logging on to each server via Integrated Lights-Out (ILO) and disabling lockdown mode, and then logging in using the vSphere client. It was a time-consuming exercise. However, it can be a simple task if you only have a few nodes. When you have more than 20 nodes, it takes a lot of time and comes with the added pressure of not having a way to manage the infrastructure until it's found.
Prevent vCenter from wandering
To avoid extended downtime and to make your life easier, either create a DRS rule to restrict which host the vCenter sits on, or use the machine migration selection to prevent a move unless the host fails. I personally prefer a DRS rule to restrict movement to just a few select hosts.
Once I found and opened the VM console, the issue appeared to be from an update that would not complete. A third party had patched the server and left it in a bad state. Despite several reboots, the server wouldn't come back. I tried the "last known good configuration," but that also proved futile.
Issues from using a paravirtual controller
As a sanity check, I tried to boot from the OEM DVD to see if I could see the drives and if the data was intact. At this point, the second problem hit. The vCenter had its disks set to use a paravirtual controller card. Paravirtual disks can typically only be used after VMware Tools have been installed. As a precaution, I suggest rolling the paravirtual drivers into a custom bootable DVD install; this will allow you to at least see the drives and would be useful if a similar situation arises.
At this point, it was decided that a guest rebuild would be the best option. When you lose vSphere vCenter, you lose the ability to deploy from template. In the end, I had to rebuild from an ISO image. Due to the complexity of the build, it took several hours before we could restore the data.
Steps for a successful data restore
Prior to restoring from install media, use the first available version to ensure the restore overwrites all the files correctly. Also, make sure you set the disk controller type to "paravirtual", assuming your old vCenter had paravirtual disk controllers.
Make an OVA backup
Another helpful tip is to periodically export the vCenter to an OVA file and keep it in an easily accessible location. If you lose the vSphere vCenter, then you can rename the old one -- do not delete the old one until the sysadmin has properly restored services -- and redeploy a recent version from your OVA backup.
There is one huge caveat to this: If you are using the free version of MS SQL and store your database on the same server, then this option won't work. The vCenter database will be out of sync with the previous state of the vCenter and could be missing recently added VMs, although the VMs will still be on the disks and can be re-added.
As every virtualization administrator knows, a snapshot of vCenter before the patching process would have avoided a lot of unnecessary effort and stress. I hope these steps will prove as a reminder for IT to be proactive and make backups to prevent their shops from ending up in the same situation.
Developing a strategy for upgrading vCenter
vCenter server components explained