Problem solve Get help with specific problems with your technologies, process and projects.

Avoiding an Enhanced VMotion Compatibility gotcha

You can only VMotion between hosts with similar CPUs, which used to be a huge painpoint if you had different servers in a cluster. VMware ESX 3.5 Update 2's Enhanced VMotion Compatibility (EVC) feature improved VMotion by 'dumbing down' the more advanced CPUs of today's servers to match the capabilities of older CPUs – but for it to work, you need to enable it sooner rather than later.

VMotion is, in my opinion, one of the best features VMware virtualization has to offer. The gotcha with VMotion, however, is that you can only VMotion between two identical (or very similar) CPUs.

This can create a problem when buying extra hosts to add to your already existing cluster. Say you have a cluster of five ESX hosts, all with identical CPUs. One year later, you want to add two extra ESX hosts to accommodate for growth. When you buy those new hosts, chances are that you won't be able to buy one with an identical CPU.

Unfortunately, adding these new hosts to your cluster creates an administrative burden because the two types of CPUs are dissimilar,

Here's what happens at the guest OS level: When a guest OS is running on a CPU with a certain instruction set, the guest OS will crash with a kernel panic or a Blue Screen of Death (depending on if you're running Linux or Windows) when the extra instructions suddenly become unavailable – which is what happens when you attempt to VMotion from one type of CPU to another type of CPU.

How VMware prevents VMotion crashes

VMware has two techniques to prevent these crashes. The first is simple: After an administrator calls for a VMotion, vCenter Server will check both the source and the destination CPUs to make sure the VMotion would be successful, or it won't occur. The second technique is known as CPU masking. CPU masking hides certain features a CPU has in a way that these will not be announced to the virtual machine (VM), so that the guest OS will not know about them, and therefore not use the extra features that would normally "disappear" if a VMotion were to occur. VMware Enhanced VMotion Compatibility (EVC) will mask CPU features at host level; not just the guest OS level.

Is masking CPU instructions a failsafe way to ensure successful VMotion migrations? No -- the instruction set is only masked, not disabled, which means applications can still talk to the CPU if they want to. Let me explain: The proper way for applications that want to use certain CPU instructions is to nicely query the CPU and then act on the answer. However, if an application isn't written according to standards and accesses the instructions in an unsupported way, then this exchange is not blocked. So a poorly-written application could potentially cause crashes when moved from a CPU with an extended instruction set to a CPU without, even if EVC is enabled.

VMware's published explanation of this is as follows:

EVC utilizes hardware support to modify the semantics of the CPUID instruction only. It does not disable the feature itself. For example, if an attempt to disable SSE4.1 is made by applying the appropriate masks to a CPU that has these features, this feature bit indicates SSE4.1 is not available to the guest or the application, but the feature and the SSE4.1 instructions themselves (such as PTESE and PMULLD) are still available for use. This implies applications that do not use the CPUID instruction to determine the list of supported features, but use try‐catch undefined instructions (#UD) instead, can still detect the existence of this feature.

Therefore, for EVC to be useful, application developers must adhere to recommended guidelines on feature detection. CPU vendors recommend that software programmers query CPUID prior to using special instructions and features available on their CPUs. If this guideline is followed by programmers, EVC is a reliable mechanism for live migration of x86 virtual machines across varied hardware. Thus, you can use EVC to enable an entire cluster to use the same set of basic features, allowing migration with VMotion across any two nodes in the cluster. VirtualCenter can also set up new hardware add‐ons to the cluster and apply these masks.

Does this mean that VMware EVC will give you a false sense of security? Not at all, since the number of applications that use the try-catch method is relatively low, and the extended instruction sets are not often used in server-based applications, so it is safe to use EVC in a production environment.

You might wonder if masking CPU features impacts performance. Short answer: It doesn't. The longer answer is that it makes a difference, but only barely, since only the special instruction set is masked, which doesn't change CPU cycles the way that hyper-threading does. There is a small performance difference, since these extra instructions ease the load on the CPU, but the difference will be minimal.

VMware Enhanced vMotion Compatibility guidelines

The following is a list of baselines in which CPU types fall.

EVC Baseline Description
Intel Xeon Core 2 Applies baseline feature set of Intel Xeon Core 2 (Merom) processors to all hosts in the cluster.
Intel Xeon 45nm Core 2 Applies baseline feature set of Intel Xeon Core 2 (Penryn) processors to all hosts in the cluster.
Compared to the Intel Xeon Core 2 EVC mode, this EVC mode exposes additional CPU features including SSE4.1.
Intel Xeon Core i7 Applies baseline feature set of Intel Xeon Core i7 (Nehalem) processors to all hosts in the cluster.
Compared to the Intel Xeon 45nm Core 2 EVC mode, this EVC mode exposes additional CPU features including SSE4.2 and POPCOUNT.
AMD Opteron Generation 1 Applies baseline feature set of AMD Opteron Generation 1 (Rev. E) processors to all hosts in the cluster.
AMD Opteron Generation 2 Applies baseline feature set of AMD Opteron Generation 2 (Ref. F) processors to all hosts in the cluster.
Compared to the AMD Opteron Generation 1 EVC mode, this EVC mode exposes additional features including CPMXCHG16B and RDTSCP.
AMD Opteron Generation 3 Applies baseline feature set of AMD Opteron Generation 3 processors to all hosts in the cluster.
Compared to the AMD Opteron Generation 2 EVC mode, this EVC mode exposes additional CPU features including SSE4A, MisAlignSSE, POPCOUNT and ABM (LZCNT).

Now, to make it easier to understand, let's translate that all into a real-world example.

Your current cluster has three ESX 4.0 hosts. All three hosts are equipped with an Intel Xeon 3100 Series CPU, which are of the same Intel Xeon 45nm Core 2 family. As the number of virtual machines increases, you decide to buy an additional ESX host. This host is equipped with an Intel Xeon 7400 Series CPU. Since this CPU is also part of the Intel Xeon 45nm Core 2 family, you can add this host to the cluster without using EVC – hooray! But when you later buy another server that has an Intel Xeon 3500 Series CPU, you have to enable EVC to mask the features of the Xeon 3500.

Think about EVC while creating your designs

Since CPU features cannot be masked to a guest that is already running on the host , you have to think carefully about whether or not you need to enable Enhanced vMotion Compatibility.

A customer once bought six ESX hosts with brand new Intel Xeon Core i7 CPUs. He installed vSphere on the hosts, then,migrated the VMs from the older ESX 3.5 to the new hosts, and reinstalled the older hosts with vSphere. The older hosts had Intel Xeon 45nm Core 2 CPUs, which are fast enough to join the i7 cluster, but they customer ran into a minor issue: You can't downgrade a running host using EVC – or in simpler terms, they couldn't turn on Enhanced vMotion Compatibility, thereby instructing the i7 hosts to match the CPU capabilities of the older hosts, while the hosts were running. So all VMs in that cluster had to be powered-off again, which was no pleasant exercise. Had the customer thought of this before, he could have set the EVC level of the i7 cluster to match the Intel Xeon® 45nm Core 2 CPU before powering on any VMs on the i7 -- then the old hosts could have been added without issues.

Gabrie van Zanten (VCP) has been in the IT industry for 12 years. Currently he is a virtualization architect for a worldwide consultancy company and has designed and maintained virtual infrastructures for a number of customers. He has written articles for magazines and frequently publishes in-depth articles at his weblog, GabesVirtualWorld.


Dig Deeper on vMotion and Storage vMotion