BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Unexpected downtime in a virtual environment is an expensive issue, especially when you factor in the cost of disaster...
recovery necessary to bring things back online. Therefore, it's in an administrator's best interest to plan for worst case scenarios and to maintain high levels of availability to keep things running smoothly. VMware achieves and maintains this high level of availability through the Virtual SAN Stretched Cluster feature.
The stretched cluster feature first appeared in VMware Virtual SAN (vSAN) 6.1. Built on the foundation of fault domains, but made to provide "data center awareness" rather than "rack awareness," this feature allows the administrator to configure a vSAN object across two separate geographical sites. The stretched cluster synchronously replicates between these two sites; if one of these sites should fail, a copy of the data will still be available. VSAN stretched clusters also includes a third witness site that conducts cluster quorum-type services in the event of a failure.
The problem with this is that if one site fails, you're left with a single copy. Any additional failures could lead to data loss. Also, if a single disk or host fails on either of the sites, you must retransmit the data on the failed device from the other site in order to rebuild the RAID 1 protection.
VSAN 6.6 adds flexibility for local and site protection to stretched clusters. Administrators can now configure protection against entire site outages, as well as device outages within a site. Although site protection is still only RAID 1, this new feature allows admins to create mirrored protections for hybrid clusters within a local site. If you have an all-flash cluster, this new local protection allows you to create either mirrored (RAID 1) or erasure coding (RAID 5 or 6) protection within a local site. For example, if you have an all-flash vSAN configuration, you can simultaneously configure local clusters at each site to tolerate two device failures (RAID 6) and configure the stretched cluster to tolerate a single site failure (RAID 1). This means that you no longer have to fetch data from the other site in the event of a local device failure.
VSAN 6.6 stretched clusters implement a "proxy owner" per site. Rather than write to all replicas in the second site, a single write goes to the proxy owner, which then writes to all replicas on that local site. This decreases the amount of traffic replicated between sites.
Protect clusters with FTT
Stretched clusters now come with two protection policies: Primary Failure to Tolerate and Secondary Failure to Tolerate.
Primary Failure to Tolerate (PFTT) defines cross-site protection. VSAN 6.6 implements it as a RAID 1. You can set PFTT to 0 or 1 in a stretched cluster -- 0 means the VM is not stretched, and 1 means the VM is stretched. This also lends itself to the new site affinity feature. Secondary Failure to Tolerate (SFTT) defines how to protect a cluster within a site and how many local devices to tolerate before failure. This option works in conjunction with the Failure Tolerance Method (FTM) to determine which protection policy to use, depending on whether you use the hybrid (only RAID 1) or the all-flash vSAN 6.6 configuration (RAID 1, RAID 5 or RAID 6).
To enable PFTT and SFTT, simply create a storage policy to include PFTT and SFTT attributes. First make sure you enable stretched clusters and that you've configured stretched clusters with the correct number of hosts on each site based on the local site protection. For reference, RAID 1 requires a minimum of three hosts, RAID 5 requires a minimum of four, and RAID 6 requires a minimum of five. You must also ensure that a witness host is available to protect against the loss of a data site.
On the vSphere Web Client homepage, select the VM Storage Policies tab and click the Create VM Storage Policy button.
Under "Rule-set 1," select vSAN as the storage type, then locate the drop-down menu that says "<Add rule>." Set the value for "Primary level of failure to tolerate" to "1," then add another rule and choose "Secondary level of failures to tolerate." Here, your values depend on which FTM option you choose.
If you choose RAID 1 as your FTM, you can set your SFTT to "0," "1," "2" or "3," depending on the number of disk or host failures a storage object can tolerate per site. If you choose erasure coding as your FTM, you can set your SFTT to either "1" or "2," which represent either RAID 5 or RAID 6 protection per site.
Finally, add a rule for the FTM. Choose the requisite FTM for local protection, either RAID 1 or erasure coding.
Once you've created a new VM storage policy, vSAN stretched clustering will protect any VMs assigned to this storage policy.
Set site affinity in a stretched cluster
Choosing a PFTT allows you to determine site affinity for VMs in a stretched cluster. Site affinity gives users the ability to protect production workloads with cross-site protection, separate from their test and development workloads, which are kept at a single site. This also benefits workloads that have built-in availability, such as Exchange database availability groups and SQL Server AlwaysOn Availability Groups, and don't require copies of their VMs at each site.
You can enable site affinity within VM Storage Policies by changing the PFTT to 0, which tells vSAN that the policy should not use stretched clustering, and add another rule to the rule-set for "Affinity," where you can choose whether the VM should reside on the Preferred Fault Domain or Secondary Fault Domain.
Be mindful to align Distributed Resource Scheduler and High Availability rules with where the VM's data resides, and to set up affinity rules on the cluster to ensure that the VM's compute resources are pinned to where the virtual machine disk file resides within the vSAN data store. At present, this isn't an automated process.
All of the latest features in vSAN 6.5 and 6.6
The benefits of the different levels of RAID
VMware focuses on HCI in vSAN 6.5