Konstantin Emelyanov - Fotolia
VMware vSAN troubleshooting is an essential skill for IT administrators looking to get the most possible value out of this complex piece of technology.
VMware vSAN is a relatively new part of VMware's product suite and offers a hyper-converged technology that creates a shared, distributed data store from direct-attached storage devices across a vSphere cluster. VMware vSAN, with proper implementation and management, can provide significant value for its cost.
VMware stresses that rigorous verification, testing and maintenance is essential. Checking against VMware's list of verified and approved hardware is a fundamental first step.
Successful maintenance requires some foundational VMware vSAN troubleshooting techniques. The simplest way to get a picture of overall vSAN health is to evaluate the status of the vSAN cluster.
In the web client, select the cluster in question, navigate to the monitoring tab and select the vSAN tab. This will break down any vSAN-related errors or warnings.
By default, vSAN will create an alarm for any network misconfiguration it detects. It's vital that admins configure all the hosts identically. Admins should pay particular attention to aspects such as connectivity and standardization.
Maximum transmission unit size, for instance, should be standard across the hosts and switching infrastructure in the cluster. Admins should also configure it to access all the required ports as laid out in the design documentation from VMware.
Displayed warnings might be intermittent. Use the retest function to test all the components again. The warnings in Figure A are the result of a nested configuration, which means the hardware isn't certified. Beyond that, some operations, such as resyncing, require patience for the lengthy operations to complete.
VMware vSAN troubleshooting is essential because any slight misconfiguration can amplify and potentially reduce performance. This applies all the way down to the firmware version.
VMware vSAN troubleshooting during host failure
In a three-node cluster, the three nodes can tolerate one complete host failure. As the cluster size grows, the number of tolerable failures needs to grow too because it's possible to have two nodes unavailable with a large number of hosts in a cluster.
Note that there's a difference between a degraded and an absent device. Absent refers to an item that vSAN believes will return shortly, such as a rebooted host. Degraded refers to an item that won't return as easily, such as a disk failure.
There are two different ways to view and modify vSAN information. The web GUI makes a lot of the functionality available, but there are additional troubleshooting facilities available from the esxcli command line.
To get an overview of the current vSAN performance, select the vSAN cluster and click performance. The VSAN back end and vSAN consumption provide useful stats for specific performance metrics such that admins can easily find a loud machine that consumes excessive I/O.
If the performance shown isn't green, there's an issue. To check basic configuration aspects for VMware vSAN troubleshooting, admins can use Secure Shell to log in to the host and run the esxcli vsan cluster get command.
This will display a snapshot of vSAN health that admins can use for VMware vSAN troubleshooting. Issuing the esxcli vsan command will display a list of the available vSAN sub-commands.
The problem with this command is that it only shows the status of one host. Other hosts might see the issue differently, so the command should be repeated across all the nodes to ensure consistency. VMware also offers the Ruby-based VMware Virtual SAN Observer management tool, which provides and displays consistent monitoring data.
Connectivity is the source of many issues. When implementing vSAN, admins should assign multiple network interface cards to the vSAN VMkernel port. With this configuration, a single network switch failure won't interrupt connectivity.