Virtualization brings many benefits in consolidating underutilized servers, reducing the number of physical servers required to do the same workload. However, since each server runs multiple virtual server workloads, a single point of failure, the server, is created. This escalates the issue of availability for that physical server. In other words, if you put all your virtual eggs in one basket, then you want to take very good care...
of that basket.
As part of planning which virtual servers should reside together on which physical server, the question of availability must be considered. How important is it to keep each virtual server up and running? What level of down time can you tolerate? There is a continuum of availability that ranges from "annoying when it crashes, but no critical impact and thus not worth spending money on" to "life support applications that must stay operational with complete redundancies and fault tolerance, worth whatever it costs."
In dealing with virtual and physical servers there are a variety of ways to accomplish varying levels of availability, and they fall into different places along the availability continuum.
The first stop on the continuum after "annoying but not worth spending money on" is adding hardware component redundancy. Redundancy is common for certain components, such as dual-ported HBAs and NICs, allowing automatic failover to the other port in the event of a port failure. Multi-pathing is another example, allowing for a failure in the switching network. Certain hardware platforms, such as blades, also have redundancy in various other components. Blade systems include redundant, hot-swappable power supplies, management modules, and switches as part of the chassis, all designed to increase availability. Disk drives can also be hot-swappable, and of course disks can be configured in varying RAID levels up through full mirroring.
High availability (HA) clustering is another way to increase availability. HA clustering has been part of Microsoft Windows since before NT 4, when it was in beta as Wolfpack. Then in NT 4.0 it was released as Microsoft Cluster Service (MSCS). In Windows 2000 Server and Windows Server 2003, it became Server Clustering, and now in Windows Server 2008 (Longhorn) it is Windows Server Failover Clustering, or WSFC.
In the Windows releases prior to Windows Server 2008, clustering was designed for enterprise environments only, requiring shared disks (SANs) and a high level of IT expertise. With Windows Server Failover Clustering, Microsoft is attempting to bring high availability clustering capabilities down to an easier level. The requirement for a SAN is eliminated, and a higher-level user interface makes it easier for less technical staff to administer. As the need for HA is required in smaller organizations, this will help those organizations benefit from HA clustering. A variety of features in Longhorn, including a new Validate tool (which runs tests on the set of servers to be clustered, to identify any configuration problems) and new simplified, task-based management capabilities, make it easier to handle clustering.
Windows clustering with virtualization can be done in a number of ways and for varying reasons:
Clustering the Guest OS – One way to increase availability at the guest OS level is to use Microsoft Cluster Services to provide failover for the guest Operating Systems. This allows a cluster to exist either as two guest virtual machines on the same physical machine or on two different physical machines.
HA clustering within the same physical machine would be done primarily for development and testing, as it would not provide high availability in the event of a hardware failure.
- HA clustering across two physical machines protects against a physical failure on either machine, and provides true HA of the guest OS and its applications. This type of clustering would require access to a shared SAN so both servers have access to the data.
Clustering Virtual Servers – Microsoft Cluster Service can also be used to cluster the Virtual Server Host software itself, across physical servers. MSCS then monitors the physical server, the Window Server Software, and the Virtual Server Software. The clustering is done at the virtualization level, rather than at the guest OS level. Guest OS virtual machines may be monitored individually and moved between Virtual Server hosts for maintenance or load balancing.
VMware High Availability (HA)
VMware HA offers an add-on to VMware ESX Server that allows a clustering capability at the ESX hypervisor level. This is similar to what clustering Virtual Servers does in the Microsoft Virtual Server environment just described, but for ESX Server. VMware HA is also based on heartbeat monitoring/detection, and in the event of a failure, provides automated restart of all affected virtual machines within the specified resource pool. VMware HA's restart capability is based on the VMFS (VM File System) clustered file system (requiring shared storage, either SAN or NAS), which allows shared read/write access to the same VM files from multiple ESX servers concurrently. HA also requires VMware VirtualCenter for its management console. When used with VMware Dynamic Resource Scheduler (DRS), HA will automate the restart based on optimal placement of Virtual Machines within the resource pool, according to DRS logic.
Although high availability clustering offers improved availability, in the event of a failure of a node, you must still restart all virtual machines and applications, and deal with the associated (though hopefully short) outage. For applications requiring a higher level of availability, the upper end of the continuum involves running virtualization software on fault tolerant hardware. As an example, Microsoft Virtual Server running on the NEC Express5800 fault tolerant series of servers offers complete fault tolerance at the hardware level, the Virtual Server level, and the guest OS virtual machine level. The NEC architecture is a fully redundant, lockstep architecture that allows continuous processing in the event of a hardware failure, rather than requiring a restart.
Full fault tolerance has specific requirements at both the hardware and operating system software levels, so if you require fault tolerance, it is important to verify both the hardware and the software running on the bare metal.
Where do you fall on the virtualization availability continuum? Do you have questions or experiences related to addressing high availability and virtualization that you'd like to share? Email me via firstname.lastname@example.org.
About the author: Barb Goldworm is president and chief analyst of Focus Consulting a research, analyst and consulting firm focused on systems, software, and storage. Barb has spent thirty years in the computer industry, in various technical, marketing, sales, senior management, and industry analyst positions with IBM, Novell, StorageTek, Enterprise Management Associates, and multiple successful startups. She has been one of the top three ranked analyst/knowledge expert speakers at Storage Networking World.