Manage Learn to apply best practices and optimize your operations.

SRM: Lessons learned in the First 90-days

Briefly touches on lessons that were learned in the first 90 days of SRM.

Lee Dilworth and Dave Burgess are two SRM guys from the UK. Right now it sounds like EMEA is stronger on SRM than...

their counterparts in the US! This was my best session of VMworld mainly because it relates to a topic which is very dear to my heart, and was a technical presentation which really needed the audience to know the VMware SRM pretty well. Most of the SRM/BC/DR sessions I went to were quite “general” and conceptual and didn’t deal with the nitty gritty day-to-day issues which is what I have to deal with.

So I got plenty of juicy gotchas, tips and FAQs some which I had seen but many which I hadn’t. I decided to mention the ones I hadn’t seen or understood before…

Firstly, something that I got this week was the realities of replication. In fairness, I got this from Adam Carter's (of Lefthand Networks) presentation on the realities of replication types and their limitations. Although synchronous replication offers the highest quality of integrity in terms of data, the reality is that distance limitations currently on this technology are, for many people, not far enough away to class the second location as recovery. In terms of maximum distance with fibre it is some 450km. This assumes a very good link with latencies well within the 5ms range. The reality is that most people's pipes don’t offer this superlow latency – in practice synchronous replication is limited to 50-60Km away from the primary site. In other words, not far enough to protect you from a true disaster.

Lee & Dave then went to outline their top gotchas they have seen in the first 90-days since the VMware SRM GA. I will list the ones I didn’t know. 

Gotcha 1: Expired Eval Licenses
Suprisingly, many people get caught out by licensing the SRM system with a .lic file which then expires. Although their exists an alarm with SRM to warn you about expiring/expired licenses – if you don’t configure it – you don’t the get the standard “insufficient licenses to…” message you would normally get from VirtualCenter

Gotcha 2: Rename Sites
When you install SRM you set a “site name” to identify the site. Unfortunately, currently you cannot rename a site once you have set it, even though it is exposed in some of the .XML files. You have no alternative but to re-install SRM. So moral of the story until this is fixed: pick a name you never want to change!!!

Gotcha 3: Site Recovery Adapters are not the same
Most of the SRA’s I’ve used so far just use IP and property TCP ports to speak to the Storage Array. I was suprised to hear that some SRM’s can/need to be configured with an RDM to speak to a “management” LUN on the array, apparently EMC Symetric’s is a case in point. Believe it or not I’ve yet to read the long PDF’s for each and every vendor and array type. It’s something I intend to do this week. Some guys at EMC and NetAPP have very kindly offered to provide me with storage hardware to do further development on, and reading those docs will certainly be a pre-requisite before that happens. Chatting to the SRM guys this week, it became clear that these PDF guides from the vendors are very much pitched towards full-time storage admins, rather than part-time storage guys like you and me. They assume a level knowledge that sometimes leaves VMware PoC guys like me scratching their head. 

FAQ: Low Resources Message with SRM
I’ve seen this lots in my lab environment. It’s caused by the settings in the vmware-dr.xml file.  The vmware-dr.xml has some tolerances for CPU and Memory if these are breached the “alarm” is triggered. The alarm won’t stop a virtual machine from being recovered.

FAQ. Why do you need two VCs
There’s no workaround to this requirement in SRM. Two sites, require two VirtualCenter. Contrary to popular belief this design decision has nothing to with VMware trying to make more money by selling more VirtualCenter licenses. By having separate VirtualCenter databases rather than a clustered VirtualCenter you are less vulnerable to database errors such as arbitrary permissions changes or service packs. The split model guarantees that there are no interdependency issues between the Protected and Recovery Site. It means your Recovery Plans can reside VirtualCenter that has survived the disaster. Additionally, the two VirtualCenter module allows for subtle differences between the configuration of the VirtualCenter at the Recovery Site, which allows for flexibility in your folder and resource pool structures

FAQ: Why does it in the install & pairing process say that Port 80 will be used to communicate to VirtualCenter?
Even though SRM uses SSl when it communicates to VC, it does not use 443. SRM establishes a TCP to port 80 then use HTTP connect to establish a tunnel to the VC server, then does a SSL handshake with VC over that tunneled connection.

FAQ: Why do I see “Recompute Datastore” frequently in the Taskbar?
I’ve seen this a lot – and more or less ignored it because it seemed to me very much a benign message. So it was nice to know why it happens. Put very simply, practically any change (right-click Edit Settings) to a protected virtual machine causes this. SRM must check the virtual machine each time to ensure you have fundamentally changed the configuration of that virtual machine which could affect how SRM will protect it. So it must check for things like have you added/removed a new virtual disk or patched it to a different portgroup.

TIP: Log files are good!
You wouldn’t expect an SE not to recommend using log files to pull more information out of the system. Lee & Dave gave a good example of virtual machine not being protected properly due to an invalid/incomplete “Inventory Mapping”. This resulted in an “unset” entry in the SRM log file.

TIP: The LUN is King
The LUN is a critical aspect of SRM. Bad LUN structures – like one-big LUN that contains everything really limits the flexibility you have in being able to recover teams of virtual machines that make up a particular application. For me this is a bit of a goldilock issue. One big LUN with everything is too much of a blunt stick and wouldn’t be optimized for performance. It would really limit you to replicating virtual machines unnecessarily. However, the opposite – one virtual machine per LUN. Gives you ultimate flexibility, but a lot of replication work to do – for each and every virtual machine that needs protection. Although you get a lot of flexibility – you are creating a lot of work in SRM, such as having to create protection groups for every individual VM you have. Personally, I’m a fan of groups of LUNs (boot, log, data) being gathered together and being put in the same replication group. It's kind of like 3xLUNs for Citrix, 3xLUNs for Domain Controllers, 3xLUNS for SQL. In this respect you are using the LUNs to distrubute the I/O whilst at the same time creating groups of application LUNs that can be included in the same cycle of replication so they shame the same integrity and protection groups in SRM.

Recommendation: SRM on a separate windows instances
Apparently, when you execute a recovery plan SRM can be CPU intensive – hence the SRM guys recommendation to put it on a separate box during recovery. Admittedly, a lot depends on how big/busy your existing VirtualCenter environment is. I’ve become increasingly uneasy with the increasing number of management roles the VirtualCenter box is being asked to handle – just a few too many “eggs in one basket” for me…

Dig Deeper on Backing up VMware host servers and guest OSes

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.