This article is excerpted from the Prentice Hall book ‘VMware ESX and ESXi in the Enterprise’ by Edward L. Haletky....
In this section, Haletky discusses performance-gathering techniques for virtual machines. Traditional hardware-performance agents don’t provide accurate measurements in virtual infrastructures, and they can bog down performance. But there are other ways to get a true sense of your VMware infrastructure’s efficiency.
Performance and other types of monitoring are important from an operational point of view. Many customers monitor the health of their hardware and servers by monitoring hardware and performance agents.
Although hardware agents monitor the health of the ESX host, they should not monitor the health of a VM, because the virtual hardware is truly dependent on the physical hardware. In addition, most agents are talking to specific chips, and these do not exist inside a VM. So using hardware agents will often slow down your VM.
Best Practice for Hardware Agents
Do not install hardware agents into a VM; they will cause noticeable performance issues. Measuring performance now is a very important tool for the Virtual Environment; it will tell you when to invest in a new ESX host and how to balance the load among the ESX hosts. Although there are automated ways to balance the load among ESX hosts (they are covered in Chapter 11, “Dynamic Resource Load Balancing”), most if not all balancing of VM load across hosts is performed by hand, because there are more than just a few markers to review when moving VMs from host to host.
There is an argument that Dynamic Resource Scheduling (DRS) will balance VMs across all hosts, but DRS does balancing only when CPU contention exists. If you never have contention, you may still want to balance your loads by hand, regardless of DRS settings.
The first item to understand is that the addition of a VM to a host will impact the performance of the ESX host—sometimes in small ways, and sometimes in other ways that are more noticeable. The second item to understand is how performance tools that run within a VM, for example Windows, calculates utilization. It does this by incrementing a tic counter in its idle loop and then subtracts that amount of time from the system clock time interval.
Because the VM gets put to sleep when idle, the idle time counter is skewed, which results in a higher utilization representation than typical. Because there are often more VMs than CPUs or cores, a VM will share a CPU with others, and as more VMs are added the slice of time the VM gets to run on a CPU is reduced even further.
Therefore, a greater time lag exists between each usage of the CPU and thus a longer CPU cycle. Because performance tools use the CPU cycle to measure performance and to keep time, the data received is relatively inaccurate. When the system is loaded to the desired level, a set of baseline data should be discovered using VMware vCenter or other Performance Management tools.
After a set of baseline data is available, internal to the VM performance tools can determine whether a change in performance has occurred, but it cannot give you raw numbers, just a ratio of change from the baseline. For example, if the baseline for CPU utilization is roughly 20% measured from within the VM and suddenly shows 40%, we know that there was a 2x change from the original value. The original value is not really 20%, but some other number.
However, even though this shows 2x more CPU utilization for the VM, it does not imply a 2x change to the actual server utilization. Therefore, to gain performance data for a VM, other tools need to be used that do not run from within the VM. VMware vCenter, a third-party tool such as Vizioncore vFoglight, or the use of esxtop from the command line or resxtop from the remote CLI are the tools to use because these all measure the VM and ESX host performance from outside the VM. In addition, they all give a clearer picture of the entire ESX host.
The key item to realize is that when there is a sustained over 80% utilization of CPU for an ESX host as measured by vCenter or one of the tools, a new ESX host is warranted and the load on the ESX host needs to be rebalanced. This same mechanism can be used to determine whether more network and storage bandwidth is warranted.
Balancing ESX hosts can happen daily or even periodically during the day by using the vMotion technology to migrate running VMs from host to host with zero downtime. Although this can be dynamic (see Chapter 11), using vMotion and Storage vMotion by hand can give a better view of the system and the capability to rebalance as necessary.
For example, if an ESX host’s CPU utilization goes to 95%, the VM that is the culprit needs to be found using one of the tools; once found, the VM can be moved to an unused or lightly used ESX host using vMotion. If this movement becomes a normal behavior, it might be best to place the VM on a lesser-used machine permanently. This is often the major reason an N+1 host configuration is recommended.
Deployment of VMs can increase CPU utilization. Deployment is discussed in detail in a later chapter, but the recommendation is to create a deployment server that can see all LUNs. This server would be responsible for deploying any new VM, which allows the VM to be tested on the deployment server until it is ready to be migrated to a true production server using vMotion.
For example, a customer wanted to measure the performance of all VMs to determine how loaded the ESX host could become with the current networking configuration. To do so, we explained the CPU cycle issues and developed a plan of action.
We employed two tools in this example, VMware vCenter, and esxtop running from the service console or from the vMA in batch mode (esxtop –b). For performance-problem resolution, esxtop is the best tool to use, but it spits out reams of data for later graphing. vCenter averages things over 5-minute or larger increments for historical data, but its real-time stats are collected every 20 seconds. esxtop uses real and not averaged data gathered as low as every 2 seconds with a default of 5 seconds.
The plan was to measure performance using each tool as each VM was running its application. Performance of ESX truly depends on the application within each VM. It is extremely important to realize this, and when discussing performance issues to not localize to just a single VM, but to look at the host as a whole.
This is why VMware generally does not allow performance numbers to be published, as the numbers are workload dependent. It is best to do your own analysis using your applications, because one company’s virtualized application suite has nothing to do with another company’s; therefore, there can be dramatic variations in workload even with the same application set.
If you do want to measure performance of your ESX hosts for purposes of comparison to others, VMware has developed VMmark, which provides a common workload for comparison across multiple servers and hypervisors. Unfortunately, VMmark is not a standard yet. There also exists SPECvirt_sc2010 from the Standards Performance Evaluation Corporation located at www.spec.org/virt_sc2010/.