Recently, I encountered a type of problem an administrator working on a vSphere platform doesn't see that often...
in a well-designed network: the ability to manage a host from vCenter Server was lost.
Trying to administer a host without vCenter or a vSphere client requires some work. It is in these scenarios where remote consoles such as Dell DRAC and HP ILO can save the day. Managing the network via the Direct Console User Interface (DCUI) -- or the ESXi console in simple terms -- can be tricky but it is straightforward. Here are some of the lessons I learned from managing the host without the client -- or even secure shell (SSH).
In a well-designed network, the virtual machine (VM) network should be separate from the management network, including correctly configured multiple physical NICs uplinked to different switches. Traffic from one should not contaminate the other. In my case, this scenario meant the VMs were still up and working. If the VMs are reachable using either the Remote Desktop Protocol or SSH, an administrator should try to log in to do a graceful shutdown.
Checking the network infrastructure proved the issue wasn't hardware or switch related. Further investigation proved a software driver was the root cause of the failure.
Moving the VMs to another host
The first order of business was to get the affected guests restarted on a host that was working correctly so the VMs could be managed while the faulty host was remediated. Bearing in mind this server was several thousand miles away, I had to use the out-of-band management system.
Using the DCUI or any remote console functionality is less than ideal but it works in an emergency. You can shut down VMs that can't be accessed by a command prompt. On vSphere versions prior to 6.0, press Ctrl + F1; this gives limited functionality console prompt after entering the root credentials. As noted before, the VMs should really be powered off cleanly using other means. The details below are for last-ditch efforts where, for example, the network had a common point of network failure between VM networks and management networks.
In vSphere 6, on selection of shutdown from the DCUI there is now an option to force a power off of any remaining guests. Just tab to the box and click it if needed. Use this feature with caution as careless use could easily cause an unintended outage.
For those on vSphere 5, use the following commands to locate and power off the VMs in question.
To list all the VMs, use this command. (Note the process numbers of the VMs to shutdown):
esxcli vm process list
To stop the VMs, use this command:
esxcli vm process kill <process ID>
Once this is done, the host can be placed into maintenance mode by using this command:
esxcli system maintenanceMode set --enable true
Although this isn't essential, it is good practice and ensures all guests are powered off and the host is ready to be powered down.
Take the host out of maintenance mode
By this point the VMs should have been be released and then be available to restart on other hosts by powering them on manually. In my situation, for reasons I can't detail, I had to restart the VMs on the same host again. It sure wasn't pretty, but it is doable. To do the same, start by taking the server out of maintenance mode. Use this command:
esxcli system maintenanceMode set --enable false
After that, manually power up the guests. All the VM files will be located in their own folders under \vmfs\volumes on the host. Use the cd command to navigate and tab auto completion, which can be very helpful. Each folder represents a data store so look in the appropriate folder to find the VM.
One snag is certain VMs refused to boot up and eventually timed out with a very generic error. In this case, check the errors by using the vmware.log file that belongs to the problematic VM and in the folders for the VMs.
To cover all the errors and complexities an administrator may encounter is beyond the scope of this article but it should provide a good place to start troubleshooting.
This type of issue shouldn't happen too often. If the VMs are correctly set up on separate network port groups and uplinks and responding to remote clients, it is more than likely a management issue that can be fixed in a controlled manner in off-hours to reduce impact. With any luck, the solution is a simple network connectivity issue that can be resolved without downtime.
What's the difference between UI design software and manual coding?