Failback with SRM and vSphere Replication

SRM 5.0 introduces support for vSphere Replication. What you might not know is that although this works perfectly for recoveries where the VM moves from one site to another, there is no automated method of failback.

While vSphere Replication works for recoveries where the VM moves from one site (say SiteA: New York) to another...

(say SiteB: New Jersey), no automated failback exists to move the VM back from the Recovery Site (SiteB: New Jersey) to the Protected Site (SiteA: New York).

Automated failback in SRM 5.0 is a combination of running a reprotect process to invert the normal path of replication, and then running a Recovery Plan to move the VMs back. Sadly, that reprotect and failback process didn't make it into the first release of VR. That means a more manual process has to be undertaken to prepare for the failback and return SRM and VR back to a state that it was in before the failover took place. For an experienced SRM administrator this isn't too much of a chore or difficult, but if you are new to SRM, and intend to use VR, then it could be a bit of challenge.

Say a company called Corp.com has two sites, one in New York and one in New Jersey. New York is the production or Protected Site during normal business operations. New Jersey is an ancillary location to Corp.com and is used for many purposes -- in this case as the Recovery Site in SRM.


Before I begin the process of failback, let's look at a VR setup. If you have already done this and got VR working, you might want to skip this overview section. I'm doing it just so the rest makes sense.

VR has a management server/service (VRMS) of which you need one per vCenter instance. Remember, SRM has a requirement of one vCenter per site, although it is possible using the Shared Site configuration to have one Recovery Site service the DR needs of many Protected Sites, a configuration that would be popular if a service provider was selling SRM as a service. The other requirement is at least one vSphere Replication Server (VRS) at the Recovery Site. This appliance receives the deltas that make up everyday replication. So after the first full sync of replication, only the changes are replicated. Of course, it is possible to use sneakernet to download the .VMDK of VM from the Protected Site, pop them on removable storage – and then use a secured courier service such as UPS or FedEx (imagine blacked out limos and motorcyclist armed with Uzis) to ship the data to the Recovery Site. In this case, only the deltas accrued during the time it takes to move the .VMDK files need to be copied.

Unlike the VRMS, it is possible to have multiple VRS per vCenter for scale-out purposes. You would do this if you found yourself rubbing up against the scalability limits. Interestingly, the scalability limits outlined in the release notes currently contradict what's in the official admin guide -- a situation that will be fixed very shortly. It's the release notes that are correct.

Of course a failback process means invariably inverting the replication that would usually send updates from Protected Site (SiteA: New York) to the Recovery Site (SiteB: New Jersey), so that they went in the opposite direction from SiteB: New Jersey to SiteA: New York. For this reason I recommended that folks always set up VR at EVERY site they intend to use with SRM, so they will have the appliance in place for failback. This is also useful if your Recovery Site is not a dedicated location for DR but also a production location. This is often referred to as a bi-directional configuration in which SiteA: New York is the DR location for SiteB: New Jersey, and SiteB: New Jersey is the DR location for SiteA: New York.

Such a configuration can be seen below. In the background, I have two VRMS and two VRS – one for SiteA: New York and one for SiteB: New Jersey.

In this example, only one VM called mail04 has been protected. Although this VM currently lives in the Protected Site of New York, it's the VR in New Jersey that receives the changes as the VM is powered on.For this reason, the management of the replication (Move to, Resume, Pause, Remove, Synchronize) is carried out at the destination for the replication. This is a pretty common practice in array-based replication -- the replication job is managed at the location where the recovery would occur in the event of disaster. After all there's little point in your management being located where a smoking crater might now exist.

This vSphere Replication is turned on when you right-click on a VM, although it's perfectly possible to select acollection of VMs in a folder and do groups of VMs using the standard [shift+click] and [ctrl+click] selection methods:

Once VR is set up and enabled for SRM, you can begin to treat a VR-protected machine the same as if it were protected by any storage vendor replication. Creating protection groups is just done by selecting the VR radio button and VMs they reside on.

The protection group(s) once created can be selected when creating the SRM Recovery Plan, exactly as you would do if you had used array-based replication.

When you run test or genuine recovery process (either planned or unplanned) Recovery Plans work in the same way as they would for array-based replication. Where changes or differences occur is in the management of the failback process. I’m going to look at this from two perspectives – a Planned Migration (when you know that disaster is potentially on its way) and unplanned migration (when the disaster happens without much, if any, prior notification).

Planned migration

A Planned Migration would be used when SiteA: New York and SiteB: New Jersey were both available – and you had received enough notification to avoid disaster. That could be triggered by some known power outage caused by the supplier in your location needing to do some major upgrade work or someone plans to demolish a building near you – and their demolition experts can’t 100% guarantee that the soon to be deceased building, will not fall in your general direction. 

In this case you would select your Recovery Plan(s), and select the “Planned Recovery” radio button, and enable the option that states “I understand if carry out this major task without senior management approval and it all goes pear-shaped I might be on welfare for the rest of my days and have to eek out an existing with shopping trolleys and trash cans as my only friends.

Doing this will do two major things. First, it will cause a synch to occur between the VM at Protected Site to the VR, and then a power off the VMs affected by the Recovery Plan at the Protected Site, followed by Power On event at the Recovery Site. Of course the time to complete this “delta synch” will be dependent on a number of factors – volume of changes since last replication, the amount of bandwidth available at each site, latency, lost packets, and not least the volume of VMs to be migrated over to the Recovery Site.

So what affect does this have on the existing replication relationship at the Protected Site? Answer – It gets switched off, and put into “Not Active” state.

On the Protection Group at the Protected Site – the one that contained the “old” VM, you should see that vSphere Replication has turned off the affected VMs. The “Replication Warning” is a standard alarm indicating that replication is not occurring from SiteA: New York to SiteB: New Jersey for this VM.

Note: It would be nice if message here was more noticeable – and that if the warning was meaningful. Like “Replication Broken off by Planned Migration” or something helpful like that…

One thing that appears to me to be confusing and possibly a bug – is the fact the VR backed plans still report the “Reprotect” message after you have used “Planned Migration” or the “Disaster Recovery” methods – when they shouldn’t.

This strikes me as “wrong” because reprotect/automated failback is currently meant to be an array-based only feature.  Indeed, if you do run, try to run the reprotect option you get in this warning dialog box.

As the dialog box clearly states, you have to edit the plan and remove the old protection groups. This triggers a new warning on the plan that it does not have any protection groups for it. (Incidentally, once you have removed the protection groups the reprotect message goes away. So how is anyone supposed to carry out this task before running reprotect is anyone’s guess.

At this point you might as well delete the old Protection Group in the SiteA: New York. They are of no use to you. There’s is a small value in keeping the Recovery Plan that now has no Protection Group backing it. Keeping the original Recovery Plan object keeps the “history” of all the tests and clean-ups you have done. Apart from that its pretty useless. In fact some folks think deleting the old Protection Groups and Recovery Plans is actually “cleaner” and “simpler”. I think there be some merit in that – but if you remember my strategy was to try to keep as much user-defined stuff as possible. Technically, the Recovery Plan doesn’t need to be deleted.

At some stage you will be recreating the Protection Group as part of the reset after the failback – but when you add a new Protection Group to the existing recovery plan, you will find your VMs will be dumped into the default priority group – and ALL your fancy customization you had in the Recovery Plan will be lost (command steps attached to a VM, Message Steps attached to a VM, Re-IP data, VM Dependencies). In fact the process is very similar to what it used to be like in SRM 1.0/4.0 – and in pact of deleting protection groups without thinking about the consequences is one I still warn folks about in SRM 5.0.

Generally, deleting Protection Groups without engaging the brain is not to be recommended…. But I guess the adage could be applied to many decisions in one’s life.

Failback with planned migration

So what steps need to take place in order for us to make the return journey? Taking the MAIL04 VM back to SiteA: New York in my case. The main goal here is delete or remove as little as possible from the configuration at the Protected Site as possible, whilst avoiding conflicts.

The first thing to remove from the Protected Site is the “old” VM. Carrying out a Planned Migration does not remove from the inventory the original VM at the production location. As you might know we need a Protection Group containing the new VM in SiteB in order to run a recovery plan to take back the original location necessitates creating a “Placeholder VM” in the inventory – in the very same location as the original VM. As you might know vCenter reacts badly to the creation of the same object, with the same name, in the same location. This “orphaned” object, if you want to call it that, simply needs to be removed from the inventory, not deleted. That’s because the files that make up this VM might be useful in speeding up the replication process – after all only the “deltas” make up changes that have accrued in the Recovery Site, need to be sent back to the Protected Site.

NOTE: Being the kind of guy I am I tried my best not to delete or remove anything to see what I could get away with. For example I tried keeping the Protected Site configuration in place (Replication Job, Protection Group and so on). For the most part I would say not to bother. The process cannot be completed without deleting some components from SRM. That inevitably means at some stage the administrator has to put them all back in again. Without any method to export/import parts of the configuration to be restored later – and no PowerCLI cmdlets for SRM to date – that means its at the moment almost impossible to automate the cleanup and reset parts.

Next, I need to enable VR on the VM that that has been “moved” by SRM during the “Planned Migration” when the wizard appears I can select the normal settings for RPO, Quiescing method and critically select the destination volume – If I’m clever I can use the SAME location as the original VM, and when I do this VR will think I am using the “Sneaker-Net” method of getting my .VMDK file around.

Note: Originally the MAIL04 VM was located on ESX1_LOCAL, and it was being replicated to the ESX3_LOCAL volume, now the process has been inverted and replication is going to take place from ESX3_LOCAL back to ESX1_LOCAL. I used local storage just because I can, and I wanted to prove how VR offers enterprise like storage features even to local storage. Of course local storage is a waste of time if you are using VMware Clustering like I do.

Note: Because MAIL04 was previously stored on ESX1_LOCAL, VR can use the files already there to accelerate the replication process. That’s pretty neat and similar to some array vendors capacity to retain snapshot deltas for the same purpose…

This essentially triggers replication for the first time of that VM from Recovery Site (SiteB: New Jersey) to Protected Site (SiteA: New York). One thing you will find is that the old replication job from that was used in the “Planned Migration” is deleted, and replaced by the new replication job put together to prepare for the failback process. So there is little point in “cleaning out” the old replication job, because VR handles that process seamlessly under the conditions of a planned failover.

So the new replication job is listed under the destination for the updates – in my case the New York Site:

And this new replication job clears out the old replication job that used to be held at New Jersey.

Effectively we have now inverted the path of replication so New Jersey, which now owns the MAIL04, is replicating its changes to New York. If you like, a “personality change” has taken place. The Recovery Site is now the Protected Site, and the Protected Site is now the Recovery Site. The arrow doesn’t point > this way anymore it points this way < instead.

The next step would be to create a Protection Group at the New Jersey site to enroll it into the management of SRM. Simply replicating a VM doesn’t necessarily make it a part of the SRM process. All replication does is copy stuff. It’s the process of creating a Protection Group and mapping that protection group to a Recovery Plan – that makes SRM work. So in the following two screen grabs – you can see me selecting VR as the replication type, and the MAIL04 VM that is being protected by the VR Server.

Note: Here I’m selecting the Protection Group created in the New Jersey site that I created earlier.

For simplicity, I called my plans “VR Failover” and “VR Failback” so I can easily ID them. But of course, SRM supports many plans with many different options and features.

As with all failback process, I always test them just as I would test any recovery plan – and I was pleased to see that it worked fine, and I’d suffered no data loss in the process… Just the loss of all the useful information in my recovery plan, which I will have to put back manually. :-(

Cleanup after failback with planned migration

PHEW! OK, so we are kind of there. Our VMs that were once in SiteB: New Jersey are now back home at SiteA: New York. Sadly, through we cannot sleep easy. Although the VMs are back home, they are not being protected anymore. In fact, the environment is much like the same state after a failover.

There is no replication from SiteA: New York > SiteB: New Jersey. There isn’t a valid Protection Group or Recovery Plan to be triggered in the event of a Planned Migration or unplanned Disaster Recovery. Getting into a state that would allow me to start all over again, and move the VMs from SiteA: New York and SiteB: New Jersey is essentially a re-run of what I did earlier.

1. At SiteB/New Jersey – remove from the Inventory the Orphaned, unwanted, abandoned VMs

2. At SiteB/New Jersey –  remove the Protection Group  that backs the Recovery Plan “stuck” with the “Reprotect” message

3. At SiteB/New Jersey – Delete the Protection Group that backed the Recovery Plan that is marked “Recovered” but has a “Replication Error” on all the VMs within it.

4. At SiteA/New York – enabled VR on the VMs in that need protection

Note: I’ve seen this state that it's doing a “Full Sync”, which you might find a bit confusing. This is nothing to worry about – it isn’t copying the whole VM again – that would be an “Initial Sync”. The “Full Sync” is actually running a type of checksum looking for the blocks that are similar/different to the sneaker-netted source.

5. SiteA/New York -  create a Protection Group for VR

6. SiteA/New York – add the Protection Group to the Recovery Plan at SiteA/New York

7. Reconfigure your Recovery Plan, and put back all the lost metadata that was once there…

Disaster recovery

My DR tests always begin with a power cord that I unceremoniously yank to simulate a total outage. The other thing I have is all comms from the Protected Site (SiteA: New York) to the Recovery Site (SiteB: New Jersey) going through a “software router” – a VM that acts as a router. If that gets powered off all communications between the two sites are lost – including replication.

If you happen to have vCenter open when the disaster strikes (and you're not at the Protected Site) – then the first thing you would probably see is the vCenter at the Protected Site would go down and become unavailable.

By default if you switch to SRM, at the Recovery Site, it will try and log you into the Protected Site. This will obviously fail, because it is now down. So if you click cancel you will see that status of Recovery Site (SiteA: New York) would “Unknown” because you are unable to log into – and the status of the Recovery Site (SiteB: New York) would be set as “Not Connected”. This means the Recovery Site, cannot connect to the Protected Site – it doesn’t mean you couldn’t connect to the SRM server at the Recovery Site – otherwise you wouldn’t have gotten this far.

There’s nothing special/unusual about this scenario – it would be exactly the same if you were using array-based replication.  As with all DR scenarios – the Recovery Plan used to bring up the lost VMs would be run just like another. Incidentally, you can still test the Recovery Plan first (if you feel you have the luxury of time) – but you're more likely to head for the big red “Recovery” button in SRM 5.0. When you run the recovery – you will notice the option to carry out a “Planned Migration” is dimmed, because the Protected Site is unavailable.

When you run the plan – the SRM server will try to do the same steps it tried in the Planned Migration such as synching with the array in Protected Site (that’s under a truck load of rubble incidentally), and shutdown the VMs in the Protected Site that were powered off, when a 6ft chunk of masonry smashed its way though the blade enclosure. I’ve often wondered why VMware runs this part of the plan, given that in disaster perhaps the first thing you lose is communication to the site that’s affected by the disaster. It seems a little counter-intuitive. But there are scenarios where you might want to try anyway. So with all this said – expect to see errors as SRM tries (and fails) to carry out an almost impossible task.

Failback with disaster recovery

So the Recovery Plan was successful, and your VMs are up and running in the Recovery Site. After a lot of work – your Protected Site is now available. I say a “lot of work” because who the heck  knows what state your production locale was left in after the disaster? If it was a terrorist attack, it might be crime scene – and criminal forensics won’t let you near it. If it was a hurricane – perhaps you were one of the lucky ones that didn’t get flattened – but the local power distribution has been so flaky you have not had reliable power there for days, weeks, months… So SRM automates a lot of things, but it doesn’t have a big button that says: “Raise Purchase Order, Contact Reseller, Ship replacement Servers/Storage/Network to SiteA, Contract Consultancy services to assist in rack-up and configuration, Validate Implementation of new hardware, Re-instate Communications to allow replication”.

If your disaster was serious and your backup lousy. You might even be talking about a total reinstallment of vCenter/SRM and going through the pairing process all over again – together with the post-configuration typical in an SRM installation (such as adding resource mappings, folder mappings, network mappings, placeholder datastores, and so on).

Of course, when disaster strikes – there is no orchestration. The Protected Site just falls off the map. If the original configuration is retrievable and does come back online. The “old” replication job will still be there, and it will be trying to replicate the destination which is now the source, and most likely a powered on virtual machine. So you might see this in the VRs management pages. The “RPO Violation” is caused by the replication being untimely stopped by the disaster.

The failback process in this situation doesn’t differ that much from if there had been a Planned Migration. It involves some clean-up and setup to “move” the VM back to its original home – the Protected Site…

1. At SiteA/New York – remove from the Inventory the Orphaned, unwanted, abandoned VMs

2. At SiteA/New York –  remove the Protection Group that backs the Recovery Plan stuck in the “Reprotect” message

3. At SiteA/New York – Delete the Protection Group that backed the Recovery Plan that is marked “Recovered” but has a “Replication Error” on all the VMs within it.

4. At SiteB/New Jersey – enabled VR on the VMs in that need protection

5. SiteA/New Jersey -  create a Protection Group for VR

6. SiteA/New Jersey – add the Protection Group to the Recovery Plan at SiteA/New York

7. Reconfigure your Recovery Plan, and put back all the lost metadata that was once there…

Cleanup after failback with diaster recovery

Again, once both the Protected and Recovery Site are available, you follow the same steps I outlined in the Planned Recovery.

1. At SiteB/New Jersey – remove from the Inventory the Orphaned, unwanted, abandoned VMs

2. At SiteB/New Jersey –  remove the Protection Group from that backs the Recovery Plan stuck in the “Reprotect” message

3. At SiteB/New Jersey – Delete the Protection Group that backed the Recovery Plan that is marked “Recovered” but has a “Replication Error” on all the VMs within it.

4. At SiteA/New York – enabled VR on the VMs in that need protection

5. SiteA/New York -  create a Protection Group for VR

6. SiteA/New York – add the Protection Group to the Recovery Plan at SiteA/New York

7. Reconfigure your Recovery Plan, and put back all the lost metadata that was once there…

There are a couple of points to be made here I think. First, it's perfectly feasible to go through the manual steps required for failover/failback. But it would clearly benefit from more automation – just like SRM 1.0/4.0 would have benefited from the kind of reprotect/automated failback we currently enjoy with SRM 5.0. As they say, these things will come – all in good time.

Clearly, there is room for improvement when it comes to some of the messages – and I would hope to see the confusing “reprotect” dialog boxes modified in future updates.

I think it would be lovely if the replication job from SiteA to SiteB could be just “inverted” – and reused. That’s the case with many array-vendors replications. Currently, I don’t see how the structure of VR wouldn’t allow for this setup – and I think would require a redesign. It would be great if the old Protection Groups and Recovery Plans could be kept, and therefore hard work I put into developing a sophisticated Recovery Plan wouldn’t be lost. Ideally, I would love to see VR gain the same flexibility that array-based replication benefits from. If you haven’t played with reprotect you really should. It’s a joy to use – and makes “moving” VMs almost feel as casual an event as VMotion is now. It’s not really for me to speculate about when VR might get the “reprotect” feature  (after all I don’t work for VMware).  For the moment the failback process for SRM 5.0/vSphere Replication is more akin to the way the SRM worked in version 1.0/4.0.

I guess the importance of this depends on how rigorous you think your Recovery Plan testing needs to be. After plenty of customers “saw value” in SRM 1.0/4.0 which lacked an automated failback process for array-based replication. My analogy for this centers around the tests people do of the fire alarms and building evacuation procedures.

You could see the “test” button in SRM likes testing the fire alarm in the building at 10am each day. Four quick rings, checks the alarm still works. On the other hand “recovery” button in SRM is like when twice a year – that alarm bell doesn’t stop ring after 4 attempts. Folks start looking at each other, and very slowly people start making their way for the exits. The downside of the current capabilities of SRM and VR together is that a bi-annual “test” would involve a lot of work to get everyone back in the building and working again. Some of it would require maintenance windows to complete, whereas a softer “test” would not. What we are all looking for is the most rigorous of tests, with the least impact on our infrastructure – I believe array-based replication and SRM achieves that lofty goal, but VR has some work to get to the same level…. By definition the more realistic and rigorous the test – like our fire alarm example – the bigger the operational impact is on day-to-day activities.

But if it only happens twice a year is it such a big deal?

Remember, Rome wasn’t built in a day.

Dig Deeper on Backing up VMware host servers and guest OSes