In the course of your duties as a vSphere administrator, you've probably had a discussion about a zombie VM that...
is lingering in your environment.
"What is this VM for?" you may have asked.
"I think John created it for some project before he left," comes the reply.
"Is anyone using it? What's it for?"
What we have established is a VM was needed at some point, but its purpose has been lost in the shifting sands of the IT department. Now the virtual machine sits in a limbo state. There is a shroud of fear surrounding the VM -- it might be important. If you have one zombie VM or two, those probably aren't big issues. But a swarm of them that use a lot of server and SAN resources can be an expensive dead weight.
Since VMs are easy to create and have no incremental cost, we create them freely. Many environments have excess capacity in vSphere clusters so VMs are a great place to test ideas. Virtualization also allows us to respond rapidly to business needs, creating VMs quickly for whatever comes our way. VMs that have served their use and are no longer required should be removed.
Narrow the list of suspects
If your vSphere cluster is overloaded with VMs, you may need to kill off the useless zombies to free up capacity for useful virtual machines. The hard part is identifying the useful VMs that get mixed with the useless. In an ideal world, there would be a record of the VM's creation, purpose and who owns the VM. In many places this simply does not exist, leading to lots of zombies.
So how do I identify the zombies in my VM fleet? Start with a list of every VM. Zombies could be hiding anywhere. Then you need to eliminate the useful VMs. Keep in mind, zombie VMs never have support tickets raised against them by business units or users. A VM that has had a ticket in the last six months isn't a zombie. Zombies typically have very low resource usage. CPU, network and disk utilization should be low. None of these will be zero all the time but high resource usage is usually a sign of work being done. This should eliminate most of the living VMs from your search. Then you will need to look at each VM remaining on your list.
Find the creator
The next step is to identify who created the VM. On the VM's Tasks and Events tab, sort by date with oldest first. If the VM creation event is still in the vCenter database, you will see a Reconfigure Virtual Machine event at the start of the list. This event will have a userID in the Initiated By field. This person created the VM and should know the reason behind it.
If the VM creator isn't available or doesn't know anything about the VM, then you have two paths. The nice path is to send a list of zombie VMs to every IT staff member and every business unit to ask who owns what. This may get a few more VMs identified. Often there will be a few VMs that are still unaccounted for, maybe a lot of VMs. Then you head down the second path.
Taking more drastic action
This path I like to call the scream test. Shut the VM down and see who screams. It takes a brave or desperate manager to authorize scream tests but often it is the only option to separate the useful VMs from the zombies.
Before you shut the VM down, you should capture a little information from it. The basics like IP address and computer name will help identify which VM is the cause of the scream. Once the VM is shut down, it should be left in place. The CPU and RAM the zombie was using will be released immediately, but its disk footprint will need to remain for a while longer.
Now wait to see who screams. Some screams will happen the moment the VM is shut down. Some screams will come a week later, or at the end of month. Usually if the shutdown VM makes it to the end of six weeks without anyone making a fuss, then you are ready to take the next step.
Making the move
After six weeks of no complaints, there is a high probability that you have found a zombie VM. Migrate the VM from the fast storage where it normally resides to somewhere cheaper. This will free up the high performance storage. Sometimes VMs will only be needed every quarter or just once per year, so the scream might be quite a while after the VM was powered off.
If this VM is needed a few months after the scream test, you had better be able to recover it from the cheaper media. Once several more months have passed, you can archive the VM to your cheapest storage, maybe tape or some other offline media.
Where VM sprawl comes from
Get VM sprawl in the cloud under control