While the vSphere client provides performance data, the esxtop and resxtop performance utilities offer more advanced information to ease virtualization troubleshooting efforts. In this tip, we focus on using esxtop and resxtop; but the same performance statistics can be viewed from the vSphere client.
Esxtop and resxtop run in a shell session, and manual coding can be intimidating. But don't let the format discourage you. Once you get used to the controls and how to interpret the data, these tools become invaluable for reading how CPUs handle the workloads of hosts and virtual machines (VMs).
Esxtop vs. resxtop
While esxtop rund only inside an ESX service console -- either directly at the console or remotely using a secure shell console --resxtop is a remote version of esxtop. Resxtop is included in the Linux version of the vSphere command line interface (CLI) and is part of the vSphere Management Assistant (vMA). Esxtop and resxtop function the same way and provide the same information, but resxtop supports only the interactive and batch modes and cannot be run in replay mode.
Various esxtop commands and modes
Esxtop is basically a VMware version of top, the Linux command line that displays real-time CPU information. Esxtop displays information specific to virtual hosts and virtual machines (VMs) and, unlike top, can display information on all resources (CPU/memory/disk/network). It can be run in three modes: interactive
Information is displayed in a spreadsheet-like format with columns and rows. Esxtop displays many columns. Resize your console window to see all columns because they initially scroll off the right-hand side of the screen.
All esxtop commands are single-keystroke commands. You can press ? or h to get the list of commands. You can also type "man esxtop" to display esxtop's built-in documentation. You can add or remove fields (or columns) and change the display order by pressing f (add/remove fields) or o (change field order). In most cases, esxtop commands are case-sensitive and an uppercase letter accomplishes a different task from a lowercase letter.
You can switch between displaying different resource information using the following keys:
|c||Displays CPU information|
|m||Displays memory information|
|n||Displays network information|
|d||Displays disk storage adapter information|
|u||Displays disk device information|
|v||Displays individual VM disk information|
|i||Displays device interrupt utilization information|
The CPU information screen
The CPU information screen is typically the most widely used because it provides detailed statistics on how the physical/logical CPUs in the host are used and also indicates CPU scheduling problems. Figure 1 breaks down the information displayed on the CPU screen.
The top line displays the current time, host uptime, the number of running "worlds" (e.g., scheduled entity or process) and CPU load averages over the prior minute, five minutes and 15 minutes.
A load average of 0.50 means that all the CPUs on a host are only half-utilized. An amount of 1.00 means that the CPUs are fully utilized, and greater than 1.00 (i.e., 1.50) means the host needs more CPUs than are available. The two lines that follow (PCPU USED% and PCPU UTIL%) show workload percentages for each individual CPU core in the host.
In the graphic above, the host has dual six-core CPUs, so there are 12 numbers displayed. The average shown at the end is the combined average for all CPUs in the host. The PCPU USED % is the percentage of CPU usage per PCPU, and the PCPU UTIL% is the percentage of "unhalted" CPU cycles per PCPU.
Different CPU cycles
When a CPU operates in full-power mode, it's referred to as an "unhalted" CPU cycle. CPUs can be sent a halt (HALT) command, which places the CPU in a lower-power mode that is akin to sleep mode. This sleep state persists until a disruption triggers the CPU back into full power mode. The HALT command changes the C-state of the CPU from C0 (full power mode) to C1 or C1E (halt or lower-power mode) and conserves power on a system.
In most cases, these two numbers should be fairly close to each other, but two scenarios can cause the numbers to be further apart. First, if a host has CPUs that have hyperthreading enabled, it allows a single physical CPU/core to act as two logical CPUs. Unlike cores, which are physical hardware blocks in the CPU, threads are software-generated and share hardware components such as cache, registers and execution units. As a result, with hyperthreading, the PCPU USED% and PCPU UTIL% can differ because of how the CPU scheduler records usage. Even if only one thread is busy, it records usage for both threads.
The second scenario involves a CPU power-saving technology such as Dynamic Voltage and Frequency Scaling (DVFS). When DVFS is enabled, CPU voltages and frequencies are dynamically adjusted by changing P-states based on VMs' workload demands. As a result, the CPU does less-effective work when the CPU frequency is lower, which in turn lowers the PCPU USED%.
CPU C-states: Saving power
In addition to changing P-states to save power, vSphere 4.1 introduced C-states, which temporarily put a core to sleep. If either value consistently shows high readings (90% to 100%), it may indicate that your CPUs are overcommitted and cannot keep up with VM demands.
If hyperthreading is enabled, the CORE UTIL% field will also appear, which displays only the utilization percentage of each core and not the individual threads. So if a host has eight cores and 16 threads, it displays only the eight-core values, and if only one thread of a core is at 100% utilization, the core will show as 100% utilized. This gives you a view of core utilization as a whole regardless of thread utilization.
The CCPU% section contains the percentages of the total CPU time as reported by the ESX service console; this section will not appear when connected to ESXi hosts. Four percentage values display here: us for user time, sy for system time, id for idle time and wa for wait time.
User time refers to the amount of time the CPU spends performing some action for a program and system time refers to the amount of time the CPU spends performing system calls for the kernel on the program's behalf. Idle time refers to when a CPU is idle and wait time refers to when the CPU has nothing to do because it is waiting for I/O.
Low-idle percentages indicate that the CPUs are very busy. High wait percentages can indicate a resource bottleneck such as the CPU's having to wait for disk I/O. The cs/sec field is for the context switches per second recorded by the ESX service console. A context switch occurs when the kernel switches the processor from one thread to another. A high context-switch rate often indicates that too many threads are competing for the processors on the system.
While this article should help get you started with using esxtop, there is plenty more to learn. So far, we have covered the basics of using esxtop as well as details about the fields and information displayed on the CPU resource screen. But we have yet to explore what the many statistics columns mean and how to interpret them to determine whether a host is experiencing performance problems. Stay tuned.
Eric Siebert is a 25-year IT veteran with experience in programming, networking, telecom and systems administration. He is a guru-status moderator on the VMware community VMTN forums and maintains VMware-land.com, a VI3 information site.
This was first published in August 2010