Having advanced vSphere troubleshooting skills is key in disaster situations, but there are a few other critical pieces to the diagnostic sequence that should not be overlooked. Here are three actions an administrator can take to reduce further damage during a disaster and also stave off looming threats before they blossom into full-blown catastrophes.
Develop a troubleshooting methodology
An official document offering some high-level steps to take during a crisis can be very beneficial to administrators. Often a sysadmin will focus efforts on the first problem they see when a much larger issue could be the cause of the problem.
A vSphere troubleshooting methodology can help to mitigate the issues and save valuable time and money for the business. This document doesn't have to be the daunting task that some make it out to be; it can be as detailed or as high level as we want. It should answer some of the following questions:
- Where do we start to look for problems?
- Do we know what to look for to identify the issue?
- How do we drill down to find the root cause of the issue?
- What changes are required to fix the root cause of the issue?
- What resources are available if we cannot find or fix the root cause?
- How do we know when we have completed troubleshooting the issue?
VMware's Performance Troubleshooting for vSphere 4.1 technical paper has a great outline on how to create a troubleshooting methodology. Although it was developed for an older version of vSphere, many of the fundamentals and guidelines are still relevant.
Document your environment
Whether it serves as a reminder for yourself or a detailed explanation for others, documenting and inventorying your environment is always a good thing to do. Take a moment to think: If a LUN vanished from your SAN, would you know what VMs would be affected? Would you know if there were any unregistered VMs on it? Would you be able to identify issues with this LUN in the vSphere logs, knowing that it was only referenced by its Network Address Authority (NAA) identifier? Some of these questions may seem drastic and irrelevant, but we always have one-off issues that a little bit of documentation could save a lot of troubleshooting time.
So what's the best way to document and inventory our environments? This cannot be a manual process due to the dynamic nature of data centers today. We havevMotion moving VMs from host to host and, in some cases, Storage vMotion moving VMs between LUNs. How can we possibly keep track of all this?
There is a PowerCLI script named vCheck, which was developed by Alan Renouf to provide as much information as you want about your vCenter environment. Through the use of plugins, vCheck can report on many different issues such as orphaned snapshots, discovering VMs that have been restarted by HA, reporting on vSwitches that are at risk of running out of ports, discovering performance issues and viewing all DRS actions over the last day. There's plenty more where that came from -- vCheck has more than 100 plugins that can be enabled and disabled depending on what you want to see.
This vCheck report shows basic information but it can grow to be quite large depending on the number of plugins you enable; information that vCheck gathers can be very valuable if a disaster occurs and can improve your troubleshooting or repair efforts.
Whether you manage vSphere or any other environment, configuring syslog is always a good idea. Syslog is a centralized depository for log files, storing them away from the systems that generate them.
Why is this important? Say we have an ESXi host, which by nature either logs to local drives or to RAM. If that host has problems and access to it is lost, then we lose access to its log files. But, if we use syslog, we can parse the logs from the syslog server to assist with troubleshooting. Some popular syslog products such as Splunk or vCenter Log Insight have slick search features as well as a way to graphical present log data. That said, vSphere "out-of-the-box" provides us with a few syslog options.
Classic syslog: vSphere allows us to forward logs to a centralized syslog server -- such as Splunk or Log Insight -- by executing the commands below within the CLI. There are many other parameters we can set, such as log size and rotation, but here are the basics to get it going:
esxcli system syslog config set –loghost syslog_server_ip_address esxcli system syslog reload esxcli network firewall ruleset set –r syslog –e true esxcli network firewall refresh
Log to data store: If a true syslog isn't possible, there is still a way to get the logs off the ESXi hosts and put them in another location. ESXi features advanced settings that can put logs in a unique directory located on a shared data store. This isn't as robust as syslog -- we need access that data store during an outage -- but it may be a workable option. To redirect logs to a shared data store, go to the advanced settings at Configuration->Software->Advanced Settings.
Syslog.global.logDir: Specifies the data store and location in a format matching "[DatastoreName] DirectoryName/FileName"
Syslog.global.logDirUnique: This is a true/false setting which controls whether or not a host-specific folder is created to house the logs from within the logDir setting. When storing log files of more than one host on the same data store, it's a good idea to have this one set to true.
Procedures help the staff, business
Although developing a troubleshooting methodology, documenting our environment with tools such as vCheck and having access to our log files in an outage do not replace key troubleshooting skills, they will help to speed up the recovery process while reducing the risks of further damage. Taking the time to implement these simple components can help make the life of the administrator easier and save the business precious time and money.