Problem solve Get help with specific problems with your technologies, process and projects.

Using fault-tolerant systems for more resilient data centers

Live migration and fault-tolerant systems can really make your business continuity efforts shine. They're similar but are used for entirely different purposes.

Creating a true high-availability architecture with redundant networks and storage pools can do wonders for your data center, but live migration and fault-tolerant systems can bring even more business continuity benefits.

Live migration
Live migrations and true fault tolerance require a shared storage architecture. Both allow virtual machines (VMs) to be moved from one host server to another on the fly. Although there are a lot of similarities between these two features, they are used for entirely different purposes.

Live migrations are made possible by VMware's vMotion feature, and a similar feature is available in Microsoft Hyper-V R2. This feature treats the host servers as a pool of resources that can be allocated to virtual servers. You can move a virtual server from one host to another almost instantly. The live migration feature is useful if a virtual host becomes overloaded and you need to offload some of the virtual servers or take a host server down for maintenance. One thing to remember is that vMotion does not create fault-tolerant systems.

Fault-tolerant systems
But VMware does include a fault-tolerance feature called VMware Fault Tolerance (FT) with vSphere 4. Unlike vMotion, VMware FT is designed to rapidly detect and respond to hardware failure so that virtual servers can instantly be moved to an alternate host. This is made possible by vLockstep technology.

The basic premise of vLockstep is that a primary VM and a secondary VM are kept in perfect sync. That way, if the primary VM fails, the secondary VM is ready to take over in an instant.

VLockstep technology creates fault-tolerant systems by ensuring that both the primary and the secondary VMs execute the same instructions in the same sequence. The primary virtual server's instruction set is passed to the secondary VM using a dedicated server backbone network. The backbone network is also used to transmit heartbeats between the primary and secondary VMs so that failures can be quickly detected.

The interesting thing about vLockstep technology is that, because the primary and secondary virtual servers are both executing the same instruction sets, both VMs initiate disk writes. But because both VMs are connected to the same storage pool, VMware FT suppresses write operations on the secondary VM. This ensures that only one VM is making changes to the data on the virtual hard drive.

VMware FT can be used within a VMware High Availability cluster. This allows multiple failovers to occur. If the primary VM fails, then failover occurs, and the secondary VM becomes the primary. VMware HA will automatically create a new secondary VM on another cluster node. This allows the VM to remain fault tolerant in spite of the failure that has occurred on the original host server.

Although creating a resilient data center does not necessarily require you to create traditional server clusters, using redundant hardware is still a must. To make VM migrations and fault tolerance possible, your data center must provide centralized storage that is accessible to all host servers but without creating a single point of failure.

Brien M. Posey, MCSE, has received Microsoft's Most Valuable Professional Award seven times for his work with Windows Server, IIS and Exchange Server. He has served as the CIO for a nationwide chain of hospitals and healthcare facilities, and was once a network administrator for Fort Knox. You can visit his personal website at

Dig Deeper on VMware High Availability and Fault Tolerance