TSX-EMEA: Top Support Issues Part II

A presentation from TSX by Darren Brunett. He covers some of the top support issues he deals with in his role as a Senior Technical Support Engineer at VMware.

Presenter: Darren Brunett

Firstly, you might ask where Part I was – as part II came first on the list of options in the schedule. Joking apart this was a good session – not least because my website was mentioned in the PowerPoint slides!  I met Darren at the London User Group briefly where have gave presentations on VMFS and other troubleshooting issues. He’s an easy going guy with a great self-depreciating presentation style. Darren ran through some of the top support issues he deals with in his role as a Senior Technical Support Engineer at VMware. These sessions always interest me – because frequently students bring the very same problem to me either during courses, or afterwards informally via email. Darren covered a lot of ground – but a couple of points he made me reach for my pen and pad to scribble them down.

Firstly, he covered recovering lost VMFS partitions caused people having a “Homer Simpson” moment. Generally, if some removes a VMFS volumes, and then re-partitions that LUN for another purposes your chances of recovery are slim. If however, someone removes a VMFS, and then has left the LUN untouched there is a good chance of recovering the VMFS. Very simply its possible to put the partition table information back in place using fdisk. You do need to use esxcfg-vmhbadevs to find out the Linux /dev/sdN value. But after that it is a case of putting the primary partition back on the disks. Expert mode is used to make sure the disk is offset for the disk alignment automatically implemented by the Vi Client. Anyway, I was very much taken by the process – so I plan to be Homer Simpson soon and give Darren’s steps a road test.

Darren went on to mention some troubleshooting on the Service Console networking side of things – familiar territory for me, which was when RTFM was given a name check. Later Darren went on to outline some issues with snapshots.  Firstly, he explained how the ability to extend virtual disk sizes with vmkfstools -X is incompatible with the snapshot feature – and currently corrupts the snapshot feature. He showed how you can find out the original size of the vmdk by viewing the metadata.vmdk of one of the snapshot delta files. This information together with vmkfstools could be used to reduce the vmdk back to its original size. After that the snapshot can be safely committed to the vmdk.

He also mentioned how the snapshot management file (.vmsd) gets destroyed when a snapshot is allowed to fill a LUN. Darren pointed out a method of renaming the damaged vmsd, and then deleting the last snapshot delta file – to free up space to add another snapshot. This builds a vmsd file to a useable state. You can then edit the vmx file to tell the virtual machine to us the last good snapshot. Clearly, some data loss he is inevitable (because of the deletion of the last snapshot in the parent/child series) but it does return the VM to useable state.

Lastly, Darren outlined some interesting networking problems – such as when two NICs in NIC Team are plugged into different V/LANs. He showed how you can use esxcfg-info | grep -i -B hint to display useful IP data that can tell you if NICs are on the same or different subnets. Additionally, he pointed out how some Spanning Tree Protocol systems cause unwanted “split-brain” situations in HA. Some STP data takes 50 seconds or more to be propagated around the network. This causes an ESX host to believe it has had a network failure as HA checks for network connectivity more frequently. The only work around appears to be modifying STP settings to make its data proliferated at faster rate.

This was last published in April 2007

