Problem solve Get help with specific problems with your technologies, process and projects.

Build a more effective approach to vSphere troubleshooting

While there are several ways a vSphere administrator can save time and trouble before disaster visits the data center, there are a few measures to take while correcting an issue to keep a problem from growing into a full-blown catastrophe.

Having advanced vSphere troubleshooting skills is key in disaster situations, but there are a few other critical pieces to the diagnostic sequence that should not be overlooked. Here are three actions an administrator can take to reduce further damage during a disaster and also stave off looming threats before they blossom into full-blown catastrophes.

Develop a troubleshooting methodology

An official document offering some high-level steps to take during a crisis can be very beneficial to administrators. Often a sysadmin will focus efforts on the first problem they see when a much larger issue could be the cause of the problem.

A vSphere troubleshooting methodology can help to mitigate the issues and save valuable time and money for the business. This document doesn't have to be the daunting task that some make it out to be; it can be as detailed or as high level as we want. It should answer some of the following questions:

  • Where do we start to look for problems?
  • Do we know what to look for to identify the issue?
  • How do we drill down to find the root cause of the issue?
  • What changes are required to fix the root cause of the issue?
  • What resources are available if we cannot find or fix the root cause?
  • How do we know when we have completed troubleshooting the issue?

VMware's Performance Troubleshooting for vSphere 4.1 technical paper has a great outline on how to create a troubleshooting methodology.  Although it was developed for an older version of vSphere, many of the fundamentals and guidelines are still relevant.

Document your environment

Whether it serves as a reminder for yourself or a detailed explanation for others, documenting and inventorying your environment is always a good thing to do. Take a moment to think: If a LUN vanished from your SAN, would you know what VMs would be affected?  Would you know if there were any unregistered VMs on it? Would you be able to identify issues with this LUN in the vSphere logs, knowing that it was only referenced by its Network Address Authority (NAA) identifier?  Some of these questions may seem drastic and irrelevant, but we always have one-off issues that a little bit of documentation could save a lot of troubleshooting time.

So what's the best way to document and inventory our environments?  This cannot be a manual process due to the dynamic nature of data centers today. We havevMotion moving VMs from host to host and, in some cases, Storage vMotion moving VMs between LUNs.  How can we possibly keep track of all this?

Install vCheck

There is a PowerCLI script named vCheck, which was developed by Alan Renouf to provide as much information as you want about your vCenter environment. Through the use of plugins, vCheck can report on many different issues such as orphaned snapshots, discovering VMs that have been restarted by HA, reporting on vSwitches that are at risk of running out of ports, discovering performance issues and viewing all DRS actions over the last day.  There's plenty more where that came from -- vCheck has more than 100 plugins that can be enabled and disabled depending on what you want to see.

vCheck report
An administrator can configure vCheck to email a report similar to this one at specific times of the day.

This vCheck report shows basic information but it can grow to be quite large depending on the number of plugins you enable; information that vCheck gathers can be very valuable if a disaster occurs and can improve your troubleshooting or repair efforts.

Start syslog

Whether you manage vSphere or any other environment, configuring syslog is always a good idea. Syslog is a centralized depository for log files, storing them away from the systems that generate them. 

Why is this important?  Say we have an ESXi host, which by nature either logs to local drives or to RAM.  If that host has problems and access to it is lost, then we lose access to its log files. But, if we use syslog, we can parse the logs from the syslog server to assist with troubleshooting.  Some popular syslog products such as Splunk or vCenter Log Insight have slick search features as well as a way to graphical present log data.  That said, vSphere "out-of-the-box" provides us with a few syslog options.

Classic syslog: vSphere allows us to forward logs to a centralized syslog server -- such as Splunk or Log Insight -- by executing the commands below within the CLI.  There are many other parameters we can set, such as log size and rotation, but here are the basics to get it going:

esxcli system syslog config set –loghost syslog_server_ip_address

esxcli system syslog reload

esxcli network firewall ruleset set –r syslog –e true

esxcli network firewall refresh

Log to data store:  If a true syslog isn't possible, there is still a way to get the logs off the ESXi hosts and put them in another location. ESXi features advanced settings that can put logs in a unique directory located on a shared data store. This isn't as robust as syslog -- we need access that data store during an outage -- but it may be a workable option. To redirect logs to a shared data store, go to the advanced settings at Configuration->Software->Advanced Settings. Specifies the data store and location in a format matching "[DatastoreName] DirectoryName/FileName" This is a true/false setting which controls whether or not a host-specific folder is created to house the logs from within the logDir setting.  When storing log files of more than one host on the same data store, it's a good idea to have this one set to true.

Procedures help the staff, business

Although developing a troubleshooting methodology, documenting our environment with tools such as vCheck and having access to our log files in an outage do not replace key troubleshooting skills, they will help to speed up the recovery process while reducing the risks of further damage. Taking the time to implement these simple components can help make the life of the administrator easier and save the business precious time and money.

Dig Deeper on Troubleshooting VMware products

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

How do you troubleshoot your vSphere environment?
We do not have a troubleshooting guide because in every failure there is particular behavior. What we normally do is to document every step that we follow to resolve problem.

For example, if  a client reports a VM disconnection we open a new Word document where we will register every step and action, no matter if the action does not resolve the problem, we register everything (Task on vCenter, client contact, VMware support contact, physical and virtual actions, etc).

So, next time that you receive a similar report you can go to the document and follow the actions.