Performance management in vSphere is all about sizing and allocating pools of resources. For CPU resources, the...
pools are the virtual CPUs (vCPUs) allocated to a virtual machine (VM), the physical CPUs (pCPUs) in an ESXi server and the sum of all pCPUs in all hosts in a cluster. A VM's vCPU can be in one of three basic states: run, wait and ready. Understanding these states is crucial to understanding CPU performance in VMs.
VCPU run state
A vCPU in the run state is doing work in the VM. The VM wants CPU time and the VMkernel has allocated a pCPU core to do the work. High vCPU run time means the VM has a lot of work to do and the VMkernel is allowing that work to happen. A high run time is typically a good thing since the VM is doing work, provided the run time is not 100%, which is saturation. If the VM is saturating one vCPU, it may have a faulty application or it may need another vCPU allocated.
VCPU wait state
A vCPU in the wait state is not needed by the VM right now; instead, the VM is waiting for something. There are two basic wait types: idle wait and I/O wait. Idle wait means that VM has nothing to do until an event occurs. The event is usually a network packet or a timer expiring. Idle wait is the same as the guest OS idle time and is not a problem. On the other hand, I/O wait can be a problem. The VM is waiting for a read or write to storage and cannot do anything else until the I/O completes. The VM may have a heavy workload but is unable to do any of it until the I/O completes. A high I/O wait time on a VM is going to be a performance issue, but it will be caused by the storage being overloaded or just plain slow.
On a moderately loaded ESXi server, each VM's run and wait times will sum to 100%. In Figure 1, the performance chart shows the VM starts out a little busy. Run is around 30% and wait takes the remaining 70%. This is a VM doing some work and performing well. In the middle third, the VM's workload greatly increases. The run time climbs to nearly 80% and so the wait time drops to around 20%. The VM is doing a lot of work but isn't quite saturated. In the final third of the graph, the VM saturates its one vCPU; there is no wait time because the VM wants to use CPU all the time.
VCPU ready state
The final vCPU state for a VM is ready. This is when the VM wants to do work but the VMkernel has not yet allocated a pCPU to do so. There are a number of reasons why vCPU ready could be high on a VM, and they all mean the VM's performance is being degraded. How much degradation depends on how much ready time. Less than 10% ready is a small issue, often due to the time-sharing nature of having multiple vCPUs on one pCPU. More than 20% ready time means the VM is not getting all the CPU time it wants and results in a degraded performance.
In Figure 2 below, the performance chart shows the VM starts out well; ready time is around 2%. Run time is high as the VM is working quite hard, but at the start the VM is getting its CPU time. In the middle there is another busy VM competing for the pCPU. The pCPU is saturated, so this VM's ready time climbs to around 30%. This means that the run time drops by about 30%, resulting in less work getting completed by the VM and users being left waiting. At the end, the pCPU is hugely overloaded because three busy VMs are now competing for the one host CPU. Our VM is now getting about a quarter of the pCPU. Users will be complaining a lot because the application will be very slow. Notice that the wait time did not disappear, as there is I/O wait. The application was accessing disk and could not use CPU until the I/O completed.
Causes of vCPU ready time
The most obvious cause of vCPU ready time is ESXi server pCPU saturation. The total CPU demand for all the VMs on the host exceeds the installed CPU. This is usually easy to spot because the ESXi server triggers the vCenter Host CPU Usage alarm, but it is also the hardest to resolve.
To resolve, either decrease the demand for CPU by shutting down VMs or increase the supply by adding physical CPUs. The second cause of high ready time is that the VM or the resource pool containing the VM has a CPU limit. Resource limits are a ceiling to the amount of resource that will be delivered. Limits are applied even if there is available resource. A VM with a CPU limit may have ready time even though the ESXi server CPU is not saturated. Removing the limit will relieve the situation, and CPU usage will increase and so will application performance.
The third cause is quite obscure: Setting CPU affinity on a VM will prevent the VMkernel from using any other pCPU for the VM. To create these performance graphs, I restricted my test VM to run only on CPU 3 of my lab host. I then forced more VMs to share that one CPU core until it was hugely overloaded. As you will see in Figure 3, the ESXi server performance chart shows the other three cores became less and less loaded as I moved VMs onto CPU 3. There are almost no situations where CPU affinity is a good idea, and it's better to just avoid it. One wrinkle is that the vSphere client hides the setting for CPU affinity that is set to fully automatic when the VM is in a DRS cluster. You must set the cluster to partially automated or manual to remove CPU affinities.
As with other areas for CPU performance, you need to look at the pools of resources and avoid saturation. Understanding the CPU states is key to recognizing CPU-constrained VMs. For CPU performance, the first performance counter to watch is CPU ready. If ready time is low, the VMkernel is giving the VM all the CPU time it wants.