In almost every virtualization planning session one of the expected payoffs is an improvement in disaster recovery (DR). Whether it be at a remote site owned by the organization or at a collocation site, building a DR plan in a bare-metal (nonvirtualized) data center demands not only a significant investment in proper planning beforee execution, but also continuous attention to ensure that the secondary site stays up-to-date with the primary site. One change at the primary site to a server, network or storage array could result in the DR site taking weeks to replicate the primary data center, and in some cases not at all.
Server Virtualization at its very core makes significant improvements to DR readiness. It also brings DR to the masses. First, by abstracting the OS from the hardware layer, virtualization eliminates the pressure to have like hardware at the DR site. This becomes an invaluable capability as the initial setup of the DR site ages. No longer does hardware need to be bought in pairs and now old servers can go to the DR site as opposed to new ones in both. Second, virtualization eliminates the need for a one-to-one relationship of server hardware. This brings two advantages: no longer having to buy double servers and the less obvious benefit of only having to absolutely critical servers at the DR site.
The hardware reduction at the DR site is not just because of server consolidation at the primary site but also because of further optimization at the DR site. Multiple virtual hosts are used at the primary site to not only spread out the compute resource consumption but also to increase availability of the environment so that multiple hosts can accept guest OSes that are migrated to them. At the DR site, at least initially, there is no need to account for additional VMotion guests. Also, lower performance is tolerated at the DR site for applications during a disaster. So in many cases for a temporary solution you can put extra workloads on those DR site virtualization hosts. You are saved by the old rule, slow access is better than no access.
One of the big deliverables in server virtualization is how it broadens the application of the DR site. Since the servers are virtual, the cost to replicate all the server images to the DR site and to have them ready to stand up in the case of a disaster is negated immensely. How you replicate that data is up to you, but most companies will use some sort of storage controller-based replication to get the job done.
DR automation via VMware Site Recovery Manager
To further enhance DR, the virtualization suppliers are beginning to bring automation to the process. For example, VMware has launched Site Recovery Manager (SRM). SRM is a workflow engine designed to automate paper runbooks, triggering the movement and restoration of virtual environments between different ESX clusters. If you have the right networks, storage and ESX physical hosts pre-configured at your primary and backup sites, VMware SRM is the only thing you need to ensure that a disaster at one site is recovered at another site.
Again, recovery does require some physical pre-work, unless you're combining SRM with another lower-layer tool infrastructure virtualization tool like those from Scalent, Unisys or Egenera. That said, VMware arguably made a wise move in leveraging technology already widely available from various storage and software suppliers instead of building their own replication module from scratch. By demanding that there be a vehicle in place to replicate the storage to an alternate array, remote site or both, VMware cleverly avoided the debate on which replication technology is best. All that is needed for storage support is "connector" software to be written linking the storage manufacturer's replication module to the SRM module. Yet many major storage manufacturers have created those connectors, and adjunct physical lower-layer software exists, as we'll discuss later in the article, so all's well.
Best uses for VMware Site Recovery Manager
SRM is perhaps best used to handle total ESX host farm machine failures. Component failures, like HBA, NIC, or local disk might be better handled with VMware HA. Why? SRM creates the concept of a Protection Group of virtual machines that will failover together in the event of an interruption. The replication process that SRM triggers on attached storage creates a data store group of disks at the DR site that stores the boot images and the data sets of the protected VMs. Then, using a shadow VM a stub, or place holder is created in the secondary Virtual Center inventory for the protected VM. SRM also handles much of the inventory mappings to handle the correct connection of virtual machines to resource pools, VM folders and networks. Thus, entire VM systems are triggered to move.
SRM is a great automation alternative to paper runbooks, and should be familiar to anyone still burdened with those heavy three-ring binders. Within SRM you can define the processes that you want triggered during a failover and SRM will sent alerts, initiate storage replication, start third party physical layer software you need to run and monitor results.
But what about the previously mentioned pre-conditions for SRM to be successful?
It is important to be aware of the physical aspects of virtual environments, and not to view these issues lightly. ESX runs on physical machines, connected to real networks and central storage. For SRM to do its job, those physical server machines need ESX running with appropriate network connectivity and storage access to be pre-configured at your primary and backup sites.
If this setup were to be performed manually, far in advance of any potential disaster, its in direct conflict with the desire to keep the resources used at the DR site light to save on costs and power.
Additionally, there may be desired real-time failover of physical machines into virtual machines. For example, many customers would like to have certain physical servers virtualized in the DR site to keep costs down.
So how can you address the need for real-time physical layer automation and P2V and V2P demands to quickly spin up additional ESX host machines with associated network and storage connectivity, on demand? VMware provides an answer. As well as relying on storage vendor replication, VMware's documentation for example points to software from infrastructure management and automation providers known as Infrastructure Virtualization.
Infrastructure virtualization products
Infrastructure virtualization products such as Scalent V/OE, Unisys uAdapt and Egenera PAN Manager address the pre-conditions for SRM. These real-time physical layer management and automation products allow customers to create full ESX instances on physical machines with correct network and storage connectivity – or move a server instance in real time to another bare-metal server or into (or out of) a virtual machine (P2V, V2V and V2P).
Infrastructure virtualization thus solves the balancing problem of requiring sufficient physical servers in the DR site. Additional ESX servers can be created and powered on as needed, effectively producing an ESX on demand infrastructure (think "local ESX cloud"). For example, using the storage replication and SRM, all virtual machines could initially be positioned for backup on a single ESX server, ensuring disaster recoverability while keeping infrastructure costs down. But when a disaster is impending or occurs you can use infrastructure virtualization to remotely create additional physical ESX farms as needed, which will then begin absorbing the load of VMs directly off storage or off the single ESX via VMotion.
The advantage of adding the infrastructure virtualization products is that they also cover the non-virtual-machine HA/DR needs. Standalone, nonvirtualized bare-metal physical servers can be replicated to bare-metal physical machines just as SRM handles virtual machines, and/or moved into virtual machines for quick return to operations. In the event of a prolonged outage servers can be re-provisioned to additional bare-metal hardware that can be powered on and pointed to the server image, providing the application with its own dedicated resources. Look for solutions that can handle this movement in real time, effectively blurring the lines between physical and virtual environments.
The result is an ideal combination of flexibility and speed. Actual DR situations and even most tests always have a few surprises that require quick thinking, network reconfigurations or additional server deployments. By using SRM and/or an infrastructure virtualization tool, customers can get assistance with the workflow of the DR process while at the same time the flexibility to reconnect, repurpose and redeploy via a remote connection.
ABOUT THE AUTHOR: George Crump is president and founder of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. With 25 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS and SAN. Prior to founding Storage Switzerland he was CTO at one the nations largest storage integrators where he was in charge of technology testing, integration and product selection.