This is part two of a two-part series on VMware Site Recovery Manager (SRM). If you're just joining us, please take a minute to read part one, which goes over the basics of VMware Site Recovery Manager and what resignaturing is.
In normal day-to-day operations an ESX host in the protected site should not get to see both the original LUN and replicated LUN/snapshot at the same time. If it did, ESX would suppress the second LUN/volume. If an ESX host was allowed to see both LUNs/volumes at the same time, ESX would be very confused and not at all happy. It wouldn't know which LUN/volume to send its reads and writes to. In ESX 3.5, the host would print a hard console error message, suggesting you may need to do a resignature of the VMFS volume.
In ESX 4.0 this hard console message has been deprecated, and an ESX 4.0 host no longer prints this hard console message, which I think is a bit of shame.
In previous versions, if this was a replicated/snapshot LUN or volume, the way to resolve this would be to modify the advanced settings in ESX to enable a resignature and issue a rescan of the host bus adapter (HBA).
Issuing a resignature in vSphere 4
In vSphere4 there are two ways of issuing a resignature to a snapshot volume to make it visible to the ESX host. You can now issue a resignature from the graphical user interface (GUI) in addition to using a command-line interface (CLI) tool called esxcfg-volume (which I'll go over later). If you present a volume that is snapshotted or replicated to an ESX host using your storage vendor's management tools, then it will appear with an existing Virtual Machine File System (VMFS) volume label from the Add Storage wizard.
To demonstrate this manual approach of presenting the storage, (which would be used if you were testing your disaster recovery (DR) plan and you didn't have SRM) I temporarily gave one of my ESX hosts access to a replicated volume. Then I ran the Add Storage wizard on the ESX hosts like so:
As you can see the volume is not blank, as it has a valid VMFS volume label. When selected, the ESX host's Add Storage wizard realizes this is a replicated volume and offers me the chance to carry out a manual resignature.
Alternatively, if you're handy at using the command line you should know that the new esxcfg-volumes command supports a –l switch to list all volumes/snapshots that have been detected as snapshots, and a –r switch to issue the instruction to resignature the volume. The example below is the command:
This lists the snapshots/replicated volumes the ESX host has discovered:
As you can see the command shows that the VMFS cannot be mounted because the original volume is still online. The volume is available to the resignature process. So if I then followed that command with:
esxcfg-volumes –r lefthand-networks-virtualmachines
Note: where lefthand-networks-virtualmachines is the VMFS volume name.
Then the ESX host would resignature the volume and mount the new volume to the ESX hosts. When this happens the volume is given a new Universally Unique Identifier (UUID) and volume name (snap-
Of course these command-line actions are also reflected in the vSphere Client too:
This behavior is the same for all storage vendors; I'm just using Hewlett-Packard Lefthand as an example. I first noticed this new way of managing resignaturing while working with EMC and its Replication Manager software.
What would happen without SRM?
This kind of behavior might have some very undesirable consequences if you were carrying out manual DR without the SRM product. The volume/datastore name would be changed, and a new UUID value generated. If virtual machines were registered on that VMFS volume there would be a problem, because all of the VMX files for those virtual machines would be "pointing" at the old UUID rather than the new one. The VM would need to be removed from the vCenter inventory, and reregistered to pick up on this new UUID.
In short, when you carry out manual DR the volume must be resignatured first, and then VMs are registered according to the new volume name and UUID. The same sequence of events happens when you test VMware SRM recovery plans.
How VMware Site Recovery Manager helps
Are you with me so far? Now, the good news is that SRM automatically resignatures volumes for you –- but only in the Recovery Site –- and it auto-magically fixes any issues with the VMX files. As the ESX hosts in the Recovery Site could have presented different snapshots taken at different times, SRM defaults to automatically resignaturing. It then corrects the VMX files of the recovery virtual machines to ensure they power-on Fwithout error.
Enabling automatic renaming of VMFS volumes
In the early beta release of SRM 1.0, VMware did an automatic rename of the VMFS volume to the original name. However, in the general availability release of SRM 1.0 and 4.0, however, this renaming process was dropped. If you do want SRM to rename the VMFS snapshots to have the original volume name, this can be enabled by editing the vmware-dr.xml file or by modifying the .xml via the advanced settings dialog box accessed by right-clicking the Site Recovery node in the vSphere Client:
This mandatory resignature could be regarded by some as overly cautious, but it does guarantee fewer errors by lessening the potential of the ESX host being presented the same UUID more than once. If this automatic resignaturing did not occur and an ESX host was presented with two LUNs/volumes with the same VMFS volume, data store and UUID values, the administrator would receive a hard error and it would be up to the SRM administrator to resolve the problem.
Some people might take the position that these sorts of replication problems are best avoided altogether, rather than taking unnecessary risks with data or adding a layer of unnecessary manual configuration.
It's perhaps worth mentioning that there are indeed products in the storage arena where an ESX host might see both the original LUN and its snapshot at the same time. These technologies, such as HP's CrossLink/Continuous Access and EMC TimeFinder, are designed to protect your system for loss of SAN. With these technologies, the ESX host would have connectivity to two arrays that would constantly replicate to each other. The idea is that if an entire storage array failed, it would still be able to access the LUN on another array. It's probably for this reason that VMware SRM defaults to resignaturing LUNs to prevent potential corruption.
As you can see, ESX 3.x handled replicated VMFS volumes slightly different than vSphere 4 does. With ESX 3 you had to use advanced settings to trigger the resignature process. In ESX 4, you can carry out this task using the GUI or the esxcfg-volumes command.
The great thing about VMware Site Recovery Manager is that it handles all this complexity for you with the click of a mouse.
|Mike Laverick (VCP) has been involved with the VMware community since 2003. Laverick is a VMware forum moderator and member of the London VMware User Group Steering Committee. Laverick is the owner and author of the virtualization website and blog RTFM Education, where he publishes free guides and utilities aimed at VMware ESX/VirtualCenter users, and has recently joined SearchVMware.com as an Editor at Large. In 2009, Laverick received the VMware vExpert award and helped found the Irish and Scottish VMware user groups. Laverick has had books published on VMware Virtual Infrastructure 3, VMware vSphere4 and VMware Site Recovery Manager.|