Virtual machine sprawl has always been an issue in medium and large virtual compute environments. Sprawl is generally...
caused by a lack of control combined with process and documentation failure, issues that exist in both public and private cloud environments. It typically occurs when a user creates a VM and then moves jobs, forgetting the machine. The exact details of the VM can be lost over that time and tidying it up becomes an issue.
In an on-premises environment, VM sprawl can waste spare resources. This becomes an ever bigger issue in a paid-for cloud environment, as resource depletion translates to unnecessary expenditures on the monthly cloud bill. Even a well-managed environment can have a significant volume of unused machines, losing you money. The best way to avoid financial risk is to create a solid VM sprawl management strategy.
Creating a plan for virtual machine sprawl detection
So now that we know what sprawl is, how do we eliminate it? The first step is detection. On a small estate, you can potentially identify sprawl machines by asking a few questions: How are VMs being utilized over time? Who has logged in, and how often? What service or request is the machine in question servicing? Again, this method is best applied to small estates -- in an environment with over 15,000 VMs, assessing the situation manually would not be cost effective and would have a significant margin of error.
One Embotics customer took a very controversial approach to dealing with sprawl, choosing to power down his entire development-only estate on a predetermined Friday evening. VMs within the estate were only powered back on after someone complained that the servers were in use.
The eventual findings that resulted from this method were interesting, as 65% of VMs lay unclaimed initially after the power down. After three months, that number had fallen to 33%. One way to look at it is that 33% of the powered on estate served no real purpose other than burning licensing, power and disk. Backing those unused servers were expensive high-end servers and 24/7 hardware support contracts.
Preventing VM sprawl up front
It may sound glib, but the best way to beat sprawl is to prevent it from happening in the first place. You can accomplish this by creating regulations for VM creation and usage. The best place to start is to ensure that every machine built is approved by a senior administrator. Next, you should tag your VMs with pertinent information; all hypervisor platforms allow VM tagging.
It's extremely important that everyone involved in the creation and management of VMs understand the cost. There are two ways to do this: showback and chargeback. Showback is usually used to make people aware of the costs involved in provisioning and running a VM. This is useful for environments where there is no formal charging structure. It shows just how expensive running a few VMs can be. Chargeback shows people what they're being charged for the resources they use via a monthly bill, which must be allocated to someone's balance sheet.
VM sprawl management in legacy environments
While these methods are useful for new environments, legacy environments are in a different position. Fortunately, several vendors have provided VM sprawl management tools out of the box to alleviate this problem. Some are admittedly easier to use and less expensive than others, but they perform as promised.
One issue that's caused quite a bit of grief is the question of how an administrator can manage IT estates set up without IT support on an ad hoc basis. An example of this is AWS, an estate with no perceived slow and expensive local IT, that is, until the reality of managing this kind of server hit. Setting up a cloud on a company credit card is not the way forward, despite what sales people may tell you.
At this point, you need to look at some of the more advanced VM sprawl management tools. Workload analysis shows where admins can optimize so that sprawl servers and efficiency savings from right-sizing efforts are identified. Workload analysis tools can save thousands of dollars' worth of hardware resources to be reclaimed within just a few hours of investigation. You can also use tools such as VMware vRealize Ops (vROps) and Embotics vCommander to manage sprawl. Cost aside, both vROps and vCommander provide extensive reports on sprawl and compute wastage in oversized VMs. Rarely do software vendors' recommendation ring true with actual resources used.
Should these methods for preventing and eradicating VM sprawl prove unsuccessful, there's always the guerrilla approach of distributing top 10 VM resource consumers lists. While this has worked for some smaller IT shops, I still strongly recommend looking at virtual machine sprawl management products to see what they can do for you. The longer you allow sprawl to go unchecked, the more money you burn on unused compute resources.
Myths that lead to VM sprawl
The advanced guide to VM sprawl management
Prevent sprawl by putting data first