Most discussions of automation in a data center environment revolve around workloads. Creating, manipulating and...
destroying workloads is the focus of a seemingly endless number of startups, books, seminars and conferences. Despite this focus on workloads, physical infrastructure automation matters and its integration with the better-marketed layers of the infrastructure stack are important.
The physical infrastructure of a data center contains a lot of components, many of which have not traditionally been network-attached. Servers, switches and Uninterruptible Power Supplies (UPSes) have been network-attached for some time; however, HVAC systems, sensors, Power Distribution Units (PDUs) and physical security devices are only just now making the transition in mainstream facilities.
This is reflected in the applications we use to manage our infrastructure. VMware's vSphere management software, for example, has a limited capability to manage servers via built in lights-out management. With some effort it is possible to integrate UPSes into VMware's management plane, though this isn't exactly straightforward.
Missing from VMware's management tools are capabilities to manage other infrastructure hardware. This may prove to be a costly oversight for VMware as the growth in use of these devices within the data center environment is a direct result of their importance.
Coping with failure
The reason that sensors, HVAC systems, PDUs and security systems are increasingly network-attached is that automation of these devices has provided real value to those who have implemented them. Large-scale data center operators, such as Google, pioneered the use cases and much of the terminology. Their research efforts have trickled down to much smaller deployments.
Perhaps the most famous example of extended infrastructure automation is the ability to cope with thermal excursions. The short version of this is that if temperatures in one area of a data center get too high, workloads will automatically be moved away from the hot spot and the relevant systems shut down. While that seems pretty basic, understand that with enough sensors and a little bit of computing power it is possible to model the conditions of the data center environment and determine beforehand whether it is a good idea to place a given class of workload in a given physical location in a data center.
This is where analytics can feed automation and orchestration in order to prevent problems before they even happen. The same techniques are often applied to power supplies.
Most people don't think about power all that often, but the power feeding into data centers isn't perfect. It has spikes and troughs; voltages vary and lines are periodically cut. Generators don't always work when they need to and UPSes can and do fail.
It is possible not only to react to power events with the right sensors, but to detect instabilities in power provisioning and know when either moving workloads around is good, or when things are probably okay as they stand, but it's not a good idea to stress things more than they already are.
Ahead of the storm
The data center operator's worst nightmare is a cascade failure. A failure on one side of the data center environment causes workloads to be restarted on another side. This overloads the available resources, which causes another failure. Now even more workloads end up being redistributed and failures cascade until the entire data center fails. Simple reactive systems won't catch this. Automation and orchestration need to be aware of the possibility of failure cascades and how to deal with them.
In the case of workload resource constraints, this can be as simple as not lighting up new workloads. In the case of complex electrical or environmental issues, such as heat, humidity and so on, this can mean not only unsuccessfully returning failed workload to service, but shutting down systems ahead of the cascade in order to give the cascade a stopping place while allowing some workloads to stay online.
Workload priority management during disaster scenarios is important, too. If the resources can't be maintained to supply all workloads, then decisions have to be made about how to proceed. Often these must be codified by policy beforehand and acted upon faster than a team of human operations specialists are able.
In an ideal situation, workloads could be marked according to priority: critical infrastructure -- such as domain controllers, firewalls, file servers or databases -- bare minimum operational workloads and so on. If workloads can be prioritized in this manner, then alerts can be sent out to application owners or customers ahead of shutdown events.
Essentially, prioritization of workloads allows automated data centers to sacrifice workloads in order of descending importance when responding to physical infrastructure problems.
Yes, it actually does matter
For most companies, the idea of environmentally reactive physical infrastructure was an interesting academic discussion in the 1990s. It became a strategic enabler for large-scale data center operators and the largest of the large enterprises in the 2000s. Today it is increasingly important even for small businesses. The simple reason for the importance of data center automation is that organizations of all sizes are utterly dependent on IT in order to do business.
Many pundits and analysts will tell business owners to simply use public cloud computing. If you know what you're doing, public cloud computing can indeed be very resilient and the physical data center operations are no longer the concern of the tenants. This advice doesn't help VMware, nor does it help those organizations that need to keep workloads on premises, either for data sovereignty reasons or because integration of those workloads with on-site equipment is important to the business.
Evolution is required. It is no longer enough for VMware to simply provide a virtual infrastructure upon which workloads can be bottled and shuffled to and fro. VMware needs to integrate tightly with physical infrastructure and make dealing with the realities of the unexpected in data center environments easier if it wants to continue to attract companies now and in the future.
Tools for a workload-centric network
Increase business efficiency by automating workloads
Understanding the different levels of data center automation
What you should know about data center environmental monitoring