It was a fairly normal work day when a friend of mine called with a serious problem. This friend runs a successful...
small business which receives large design files electronically from all over the world, then produces a physical product that is shipped back out. He mentioned he was having a bad day with his vSphere environment and all his virtual machines at the manufacturing site had gone offline; since his business relies on the rapid flow of designs through and into the manufacturing process, this posed a major problem.
Like many small business owners, my friend looks after some of his own IT and relies on a service provider for the more specialized technical aspects. His environment had been in place for a while, and when I'd previously looked over the vSphere side, it appeared to be in good health. The environment in question has a couple of ESXi servers and a small iSCSI storage array. All of the VMs on the iSCSI array had been thin provisioned and a couple had large raw devices attached to handle the large design files. In this instance, VM thin provisioning was the source of the problem that stopped the flow of designs.
When to use VM thin provisioning
Thin provisioning comes with some risks. If you have enough disk capacity to store all your fully provisioned VMs, there is no need to use thin provisioning. However, if you do use VM thin provisioning, you probably strain disk space -- that is, you create VMs whose sum of configured disk space is greater than the physical space you have. Without thin provisioning, you would not be able to store these VMs, so storage capacity is exhausted. If you are overextended on any resource, there is a risk that the resource will be exhausted. When CPU, network or RAM capacities are exhausted, your VMs will slow down. When disk capacity is exhausted, your thin provisioned VMs will come to a halt; this is exactly what happened to my friend's servers.
In this case, the fault was not within the VMs, as every drive letter in every VM still reported free space. Nor was the problem with the vSphere data stores; each data store reported free space for the thin provisioned VM disk files to grow. The problem was the underlying hard disks were full, and the storage array was also thin provisioned and overextended. The sum of the sizes of the disks offered to the ESXi servers was greater than the array could hold, meaning the array had nowhere to store additional data. When each thin provisioned VM needed to write new data, it required more data store space. The array could not accept this write, so the VM was stopped to protect the integrity of the data that had already been written. Since every VM was thin provisioned, it didn't take long before each VM was stopped and, with them, the flow of work in the factory.
Since we are viewing this from a distance, we can focus for now on how this happened. Fundamentally, the strained array capacity was not being monitored adequately. If he had properly monitored his remaining capacity, there would have been warnings and, hopefully, time to prevent the problem. This comes back to managing the risk that is inherent in thin provisioning. My friend's VMs and data stores had monitoring and notification, but his array did not have adequate monitoring and alerting configured. If you are unable to monitor your usage, you should not exceed resources, it really is that simple. For disk capacity, it is much easier to prevent a problem than resolve one. Once the VMs have been stopped due to lack of space, it is much harder to return everything to service.
Monitoring array capacity
Returning to the moment of the failure, my friend naturally got his service provider's experts involved. They confirmed it was the array that was out of capacity, and there was no additional capacity in the array that could be assigned. The only way to get more disk capacity was to buy hardware. This proved more difficult than simply writing a check. The array stipulated that the new disks be the same size and speed as the existing disks from when the array was initially purchased, three years prior. To further complicate things, the existing disk shelves had no free disk slots available.
To get more space, the array needed a new disk shelf and new hard disks, but they had to correspond with the old model. If he had one or two months of advanced notice, this would have been manageable, but with all servers down there was no possibility to wait for new hardware. To restore service, his only option was to delete the stopped VMs and restore from backup.
It is crucial to learn from the mistakes of others. If you have VM thin provisioning anywhere in your storage environment, you must monitor your available array capacity closely. You must act before space is exhausted. The more levels you thin provision, the greater the risk of exhausting free space and losing service. It's essential to have a plan for resolving any out of space conditions, as they'll inevitably occur sooner or later.
Thin provisioning: storage array vs. hypervisor
Comparing thin provisioning implementation types
How can all-flash storage benefit from thin provisioning?