Identifying and resolving vSphere storage performance issues requires an administrator to follow a simple workflow to identify an overloaded data store and then identify the busy VMs. Busy VMs should be spread across data stores, and high I/O VMs should be placed on data stores that suit their performance requirements. Here, we'll look at identifying and resolving issues with speed in vSphere storage.
Administrators should look for overloaded pools of resource first. When a single consumer causes an overload, you should allocate more resource to the pool. When a group of consumers overloads the pool, then you must either allocate more resource to this pool or split the consumers over a number of pools.
The resource pool for storage is a data store, and the VMs that use the data store share its performance. (The data store can be NFS or VMFS.) The data store ends up on some storage, disks or flash. It is these devices, how the array combines them and the array's caching that define the performance of the storage. To make things more complicated, many storage arrays will use the same set of physical disks for several NFS shares or LUNs. Multiple data stores may share the same underlying pool of disks. This same pool of disks may also be used by non-virtual workloads, reducing the visibility of the load on the disks.
No easy way to pinpoint storage problems
Storage loading is a complex area, because performance is affected by the size and rate of transactions, the read/write ratios and random versus sequential behavior. A data store does not have the same performance for all application types, as different applications put out different I/O types. Top storage performance is often expensive, and as it has a long service life, the chances to purchase upgrades which deliver more performance are infrequent. Usually, managing storage performance involves balancing the different workloads across multiple data stores.
Uncover the amount of latency
The key measure of overload on storage is its latency. Storage that responds quickly is not overloaded. If the same storage takes longer to respond than is typical, then it is likely overloaded. How long a piece of storage should take to respond depends on the underlying storage type. Flash-based storage (SSD) should respond in under a millisecond; a couple of milliseconds is very slow for flash. For SAS or Fibre Channel based storage, five milliseconds is good -- 50 is bad. For SATA storage, 10 milliseconds is normal and more than 100 milliseconds is bad.
Performance charts provide troubleshooting map
The primary tools to examine performance are the vSphere performance charts, which is just as it is with networking. Keep in mind that the same data stores will usually be set up on every ESXi server in a cluster. The VMs sharing a data store will often not be on the same ESXi server, so use the "Datastores and Datastores Clusters" view in the inventory rather than the ESXi server view. Examine the performance graphs for the data store you suspect.
Move the culprit to another data store
Once you have identified which data store is overloaded, you must then look at what is causing the overload. A single VM saturating the data store can only be resolved by placing the VM on a higher performance data store. Usually a high I/O VM is placed on a data store where the underlying storage is tuned for the VM's I/O pattern. If you cannot move the one high I/O VM, then you may want to move the other lower I/O VMs to another data store. If multiple VMs together are overloading one data store, then you may be able to move some of them to other data stores which are less heavily loaded.
In the example below, the NFS01 data store is being used by four lab VMs. Lab-01 and Lab-02 both have a constant, very light I/O load throughout the hour. Initially, Lab-03 and Lab-04 are both doing much more I/O, around 85 IOPS each, leading to high latencies of around 180 milliseconds on the data store. This is a result of two VMs competing for the limited performance of the data store and overloading the underlying low performance storage.
At 2 p.m., the I/O load from Lab-03 dropped to the same light load as Lab-01 and Lab-02. Now that Lab-04 is the only high I/O load, its IOPS climb to around 150, as there is no competition for the data store's performance. At the same time, latency dropped from 180 milliseconds to 75 milliseconds because there was much less waiting for access to the data store. The data store is still overloaded, now by a single VM.
At 2:20 p.m., the high I/O load in Lab-04 ended and the data store stopped being saturated. Because IOPS dropped to almost nothing, the latency also dropped. Notice that Lab-01 and Lab-02 also experienced higher latencies when the data store was saturated. Even though they were doing very little I/O, the performance of the I/O on the data store shared with busy VMs was impacted.
This is a basic overview to resolving vSphere storage performance. I have not looked at important considerations like I/O size, read/write ratios or random versus sequential behaviors. I also haven't looked at RAID setup, caching, queuing or storage fabric configuration. Storage design and implementation is a complex specialist area, and using an experienced storage designer is an important part of avoiding performance issues. The other crucial part is to know the workload that will be placed on the storage so the design can suit the workload.