Tip

Anatomy of an Error: How to Troubleshoot

This week I sat in on a TTT (Train-The-Trainer) session for a new VMware course called “Troubleshooting”. It’s part of my job as in instructor that I must attend these events to be able to deliver a particular course. So you know this new course had some hands on labs where you did some advanced configuring/reporting using various CLI tools, and it also contained PowerCLI scripts that the instructor runs which screws up vSphere4 environment, and students run about having to try and fix the problems.

Well, last week I had a real problem. Not a very serious one. So I left until this week to resolve it. What I'm about to tell you is an account of how I fixed (or didn’t!) the problem. The idea of this is to show how you troubleshoot a real problem – and the only way to do that realistically is to have one you haven’t seen before. I hope to learn as much as I can about this process. So join me for this ride. The real thing I want to get across is not the problem itself, but how I handle it. I do the best I can with the limited skills I have. But perhaps you can learn from my mistakes as well as my successes.

Let me start with a overview of what the problem is, and what I was doing when it happened – and what I think caused it.

 

This is the problem.

I have a VM which is powered off, but vCenter still thinks it is powered on. I’ve seen this in the past in ESX2/vCenter1 and ESX3/vCenter2 – but I must admit since I’ve been

    Requires Free Membership to View

running on vSphere4 since the beta program this problem went away. So I have seen it before, but not for sometime. I also have an unpleasant DRS error message which offends my eye.

The offending VM is rtfm-xppc. I tried powering it off the conventional way with the vSphere4 client, and when that failed, I opened up a console to the VM; logged into it, and then shut it down using the Window Security Dialog box. I actually watched the VM shutdown, but vCenter still thinks it is running. Also you can see that esx4.corp.com has an error to do with DRS. This wasn’t there before I had this problem so they could be related. The operative word is COULD BE – in the 17 years of doing IT, I can tell you the number of times I’ve taken a walk down a blind alley – thinking this is the source/cause/symptom of a problem – which is actually unrelated  is often. As time goes by you learn to treat error messages with caution.

You can see that phrase “Another task is already in progress…” is there, and I saw this a number of times when I tried the hot-clone. It’s perhaps at this point I should show you the collection of error message I got up to this point.

The multiple errors you see on the “Clone virtual machine” is actually me trying, and then trying again – then trying again. I’m nothing but persistent. But I do know when to give up and admit defeat. You can see that even the “Initiated guest OS shutdown” was tried multiple times, each failing with the same error. Admitting defeat is an important part of troubleshooting. Realizing when you're out of you're depth, and when you need to call upon external help/resources is important. I’ve often been like a dog with a bone with problems – wasting many hours on a problem without getting anywhere. 

I want to lay my cards on the table here. This ESX4.corp.com book has always been a bit flaky. I’m convinced there’s a problem with the vmnicO which doesn’t show up as a conventional red X next to the network card. The local RAID controller card seems to have an issue – which I worked around by booting from USB stick of ESX4i. But worst of all, none of my HP DL G1 Proliants are actually on the supported list of servers or on the HCL. So you could argue that if you are not following the supported route to the letter then it's your own problem (assuming it is a hardware problem, which incidentally I don’t think it is). These servers were once on the HCL for ESX 3.x, but they fell out of warranty and off the HCL sometime last year if I remember rightly and I just can’t afford to buy new servers every time HP or VMware no longer support them.…

What was I doing when this problem happened?

Well, I was doing a live hot-clone of my VMs from one piece of storage to another. A bit more specifically, I have a bunch of VMs which I use to manage my remote access to a collocation about a one hour drive from where I live. I have a domain controller (dc1), a Citrix MetaFrame/Presentation Server, SQL server, vCenter Server, a VMware View server & a virtual desktop… I wanted copies of both of these on my EMC Clariion and NetApp storage. I won’t bore you with the reason why – I just did. As I need these VMs on most of the time, I decided to use the hot-clone feature. Most of the hot-clones were successful, and went through without a hitch. But I received cryptic error messages on two of the VMs, and in the end, I thought I would shut those VMs down and just cold migrate them. You can’t go wrong with a cold migrate can you? Well, you can if a previous process is hung, and it won’t allow you to power off the VM.

Why is this a big deal?

Well it isn’t really. Look, this is a lab environment, and the quickest and simplest way to possibly fix this problem is to reboot – esx4.corp.com – and then crank up the VMs again. But I want to avoid the Microsoft approach to all problems. It really ticks me off when I find students rebooting ESX hosts in class – without telling me. Like the reboot is the fix for all problems – it is if you're running Window 98SE, my friend, but not if your running a hypervisor with tens of VMs on it. I usually say to students who do this – look if you have a problem you can’t resolve come to me, and I will try my hardest to fix it, track & trace the process and report back to you. If I still can’t get it fixed within N minutes/hours and its stopping your progress in the course then I might just drop the box. If you want to teach, in my book you’re best off teaching by example – and the more who do as I do – the better for my credibility as instructor I think…

But anyway, why is it a problem? Well, if vCenter thinks a VM is powered on, when it isn’t – you can’t power it on. It’s unavailable. In weird paradoxical/catch-22 way – you can’t power on what is already powered on, even if it actually isn’t! Think about what Donald Rumsfield said about known unknowns, and you're in similar territory.

What’s the cause?

Well, when I’ve had this problem in the past – the way I’ve understood it is as a kind of “disconnect” between management system (vCenter) and the hypervisor (ESXi) has occurred. In the past I’ve often found that the process (the running VM) isn’t there on the ESX host, but vCenter thinks it's still running. It’s fair to say I’ve had the problem the other way round. ESX is still running the process (the VM), but vCenter thinks the VM has stopped. And yes, you're right – you can’t power on something that thinks it's already powered on! What causes this orphaned process – well perhaps vCenter sends the instruction to shutdown the VM, and that arrives at the ESX host – but some kind of communications error occurs which results in vCenter never being told that the VM has been stopped. In this case I think it's because some other process (the hot-clone) has become stuck/hung which is stopping other tasks from completing.

What’s the fix?

Well, I don’t know – I’ve got a number of tricks up my sleeve. I could dive in and try these steps first – but I’m not. I’m going to do the way I think I should do it. Logically, systematically and like a doctor would. That’s right I’m going to google-wack the error, and see if someone else has had the problem before, and fixed it.

Google is your friend

Here’s a screen grab of the error message I’m going to google-wack. There’s a couple of good teaching points here. Notice how a right-click allows you to cut and paste the exact text of the error to the clipboard. Why don’t vendors just put a hyperlink in their products, that says “Let me google this for you…” Also notice how a detailed description more or less repeats what is in the body of the error message. Mmm, that pretty much sums up error messages.

Ever had this? Error 88389 has occurred your system says – so you go to your event system (whatever that might be) and in the event log it says – “Error: Error 88389 – Please consult your administrator”. It’s at this stage you exclaim – BUT I AM THE ADMINISTRATOR. You get the picture. This won’t be new to you I imagine. It’s one of the occupational hazards of IT. You get used to it. But after 17 years of IT, I do find it a bit wearisome that error messages are still somewhat cryptic and unhelpful. With that said, the GOOD THING about Error 88389 is that as a string it's a very discrete piece of text you can search on. The worst, as you might know, read: “Error: An Error has occurred”  

Joking apart here’s why google is your friend. Frequently, I have students tell me in a very long winded way a problem they have had. Often this is a problem they have been living with for sometime. Sometimes I’ll have seen it and other times I haven’t. If I have seen the problem – I tell them whys and wherefores. But if I haven’t, for fun, I sometimes crank-up Google on the class room projector, and type in a word-for-word description of their problem. You know what? Sometimes I hit pay-dirt and there’s the solution at the top of the page. I’m not trying to make the student look dumb but when it works – I ask the question – what am I doing that someone in the company they work for couldn’t have done in the same time? Below is the result of my google-wack – and bear in mind I have no idea (yet) what the results will be:

Up until this point you probably thought I was being a little facetious. But I think my point is admirably made. If you are interested in reading this forum thread you can click at this link:

http://communities.vmware.com/thread/172336

In truth the results of a google-wack, especially if it gives you 588,000 responses, can be inconclusive to say the least. There was only one link that gave me a direct link (I don’t think my problem has anything to do with an increasing rate of decarbonization) but even the forum thread took some reading – and wasn’t 100% clear – this is the problem – and here’s how to fix it. It was very much a muddle of different reasons, experiences and solutions. I think that’s a skill in its own right. Reading many forum posts, and separating the wheat from the chaff. Who really looks like he’s fixed the issue, and which is closest to your problem? It’s worth saying that although the post is from late 2009, some of the respondents are actually talking about ESX3…

To summarize the forum post this is what my pals on the forum thought. It's a pretty mixed bag as you might expect:

  • It’s problem with resource pools and pools.xml
  • It’s a problem with HA, enabled and disable HA and your problem will go away
  • It’s a problem with the resource pools, and a large snapshot. Get rid of the snapshot. Disable and Re-enabled DRS and your problem will go away
  • It's a problem with VMTools being mounted on a VM, disable/unmount and your problem will go away
  • Some length commands at the Service Console involving rpm/grep/tail also associated with the pools.xml file
  • Maintenance mode, disconnect and remove from cluster, connect VI client direct to host, remove resource pools, reconnect to cluster.
  • Later on one guy gives a link to KB article. However, it applies to ESX3 not ESX4…

As you can see that’s a pretty long list. So what's worth investigating and what’s not? Well, given that error message specifically mentions DRS, I think HA is misnomer. I look at the rest of the list and think – out of all these possible things I could do, which is the easiest, least intrusive and least likely to generate an error on top of another error. The other thing that worries me is that they are all related to ESX3. Also some of the respondents talk about SSH into the ESX hosts. Well, as this is an ESX4i affected box, that isn’t directly an option – some of the commands they say you should run aren’t even available on ESXi.

I decided to hold back on the forum post for the time being – but keep it in reserve in case all my other attempts failed. I knew I was going to have my work cut out – because much of the edge-troubleshooting I would have in mind – would need a service console – and ESXi only has it’s “tech support” mode. But this was my rough plan of action studiously avoiding reboots and anything considered over-intrusive. May main goal is to get the VM working again, my second goal is to clear the red alarm next the ESX host

  1. Check that no VM has a stalled VMTools install…
  2. Use the vSphere Client directly on the VM – and see if I can power it off
  3. Use TechSupport mode to see if the VM really is running and see if ESX4i has vm-support – I remember it had the ability to kill a VM
  4. Restart vCenter Services
  5. Restart ESX Management Services

Step 1: VMware Tools install has stalled?
 

Nothing. All VMs were good on the VMTools install. I check by just selecting each environment to see if there's anyone in the process of installing. That was quite easy to do with a small number of VMs, but I would have had my work cut out to validate this if I had many VMs to check. The kind of thing I was looking for – was this on the right-click of VM:

But there was no VM with this state. I tried this one of my VMs that had been recently created – and I’m pleased to say the upgrade did work fine. So at least other VMs on the SAME ESX host were unaffected. However, I couldn't see in the vSphere Client any bulk method of IDing all the VMs that may be in this “install” state. Even the VMware Tools Status column bar doesn’t show this. The VM called vmnic3-vm still says it is “out of date” – even though the right-click says “End VMware Tools install”

I have a feeling that the only way to figure this out – would be with PowerCLI – looking for VMs with connected CD-ROMs… something like this:

$vms = Get-VM
write “VM’s with CD-ROM ‘Connected’ :”
foreach ($vm in $vms | where { $_ | Get-CDDrive | where { $_.ConnectionState.Connected -eq “true”}}) {
write $vm.name
}

Of course I didn’t write this script. I’m far too dumb for that. My friend google-wack did. OR more specifically I found a very long script of which I stole a small part from Anders Mikkelsen blog – specifically he has handy script which enumerates all the CD/Floppy/Parallel/Serial devices -  http://www.amikkelsen.com/?page_id=91 It’s quite a popular thing to want to do considering that these devices have a high chance of breaking VMotion and as consequence – DRS, DPM, VUM and so on.

Unfortunately, the VMware Tools idea was a dead end.

Step 2: Try to Power off VM from ESX host directly

When vCenter all else fails – then you can always try cranking up the vSphere Client against an ESX host and see what you can fix there. I was able to login to the ESX host – and the UI showed the VM was still powered on. However, when I reached for the power button – all the options were dimmed and unavailable.

Step 3: Tech Support Mode & VM-Support -X
 

Apparently, Tech Support Mode in ESXi should only be used in conjunction with VMware Support. I don’t have any support. So I got into Tech Support Mode, and started to dig about.You get into Tech Support Mode in ESXi by pressing Alt+F1 on the keyboard – and then typing the word “unsupported” [ENTER]. Remember to don safety goggles when doing this – if you are working without VMware Support.

The first thing I did was check to see if vm-support was there. I’ve run it on an ESX Classic host, but never on ESXi. It was there and so I was happy. Sometimes there is a bit missing in ESXi which you would expect to find on an ESX Classic host. I used the command

vm-support -x

The vm-support -X prints a list of World IDs for VMs running on an ESX hosts. In the world of the vmkernel on ESX a World ID is very much like a PID or Process ID to Linux. It’s just a number allocated to a process. To the vmkernel the VM (or .vmx) file is just another process amongst others – albeit a very important process.

I would need the output of this to use vm-support -X wid to kill the VM process. In case you don’t know, vm-support -X wid gives you the option to abort a VM to collect debugging information. Years ago, I would have used ps -ef | grep vmx to find the PID value, and then used killall -9 to achieve this – but found as ESX 3 matured you couldn’t carry out these sort of tasks anymore – vm-support -X became my alternative.

The interesting thing was that vm-support -x listed all the VMs on the ESX host – but the VM that looked powered on at both the clients – wasn’t there.  For good measure I ran ESXTOP on the other ESX hosts in the cluster (esx3.corp.com) – and found the VM wasn’t there either. Great. I have no process but both vCenter and ESX.

Step 4/5: Restart Management Services

In a way the above wasn’t such a bad thing. It felt more like both ESX & vCenter were a little bit confused. The process (the VM) had been shutdown, they just thought it hadn’t. I almost consider doing 4/5 first, but I thought it would be more interesting to number 1/2/3 first. So I restarted the management services on the ESX hosts and then vCenter in the hope the green arrow next to the VM would clear – and I would be able to power back on the VM. I was running a vm-support -X on a VM (just to confirm it would kill a VM properly on ESXi), so I decided to restart the vCenter (vpxd.exe) service first, wait for the vm-support to finish and then restart the management services on the ESX host – if that didn’t help. I was crossing my fingers at this point – because all there was left was looking at Step 6, and rebooting the esx4.corp.com host. Anyway, I restarted the VMware vCenter Service – and logged in. Unfortunately, it didn’t make a blind bit of difference. So I then decided to restart the management services on the ESX host. It’s an ESXi host remember so I had to use the DCUI (Direct Connect UI)

I had another idea. What if I could VMotion all the VMs of esx4.corp.com (which includes a bogus process that shouldn’t even be there). If I could empty the ESX host of VMs, I could then put into maintenance mode and do a reboot. Yeah, I know that breaks the rule of not using reboots to fix every problem – but heck if my VMs remain up – what do I care? To be honest I didn’t have an awful lot of confidence this would work. But I figured it was better than some of the recommendations from the forum post earlier… Even if I had to kill the box to clear this bogus VM, so long as the VMs I care about were just running somewhere else – who would worry?

Fortunately, this restart did help. It unlocked my VM and the DRS error message went away.

Lessons Learned:

1. IF you have problem with an ESX host – cut to the chase and always restart management services first

2. Google is your friend

3. List the things you can check/do – that you think will fix the problem – that require the least changes, the least admin rights – and are least likely to make a bad problem even worse

4. You think you're being clever when you pull out vm-support, esxtop – but some times the simplest solution is just to restart something (BUT DON’T REBOOT). Restart the client, Restart vCenter, Restart Management Service on the ESX host

This was first published in March 2010

There are Comments. Add yours.

 
TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.