Recovering VMFS partitions with VMware ESX troubleshooting

Reinstalling servers is a common task for system administrators, but when a typical process goes awry, it can wreak havoc. In this two-part series on troubleshooting VMware ESX, a VMware expert mistakenly deletes additional partitions that show up after he performed a VMware Consolidated Backup proxy host reinstall, which nearly destroys the Virtual Machine File System and RDM data.

In the course of reinstalling a VMware Consolidated Backup (VCB) proxy host, our expert encountered two problems...

that can be major headaches for accessing files and data in the Virtual Machine File System (VMFS). First, he deleted partitions from his ESX logical unit numbers (LUNs), which could have meant losing access to all data in the VMFS. The following article features his workaround to that problem. He then discovered he couldn't restore virtual raw device mapping (RDM) in the same way, which also meant potentially losing a massive amount of valuable historical data. Part two of this series features his workaround to that issue.

As a VMware expert, I am supposed to know that I should disconnect my storage area network (SAN) from a server before reinstalling the server. The server in question was a backup host instead of an ESX host, so I mistakenly thought that Windows would disconnect the SAN.

Deleting mystery partitions is a bad idea

After the reinstall, my server had more partitions than before the reinstall. Without too much thought, I deleted them -- bad idea. The partitions I deleted happened to be those used by my VMware ESX LUNs. So if I were to reboot my ESX servers, they would permanently lose access to their data.

At this point, I knew that the prior day at least one part of the irreplaceable data had been backed up and taken off-site. But I still needed to restore the Virtual Machine File System (VMFS) and raw device mapping (RDM).

VMware ESX troubleshooting 101: Check the VMware ESX hosts

The first step in troubleshooting a VMware ESX problem is checking the ESX hosts. I discovered that the virtual machines (VMs) were still running. Apparently, ESX is robust enough to continue operation without its physical partitions. Since ESX was still running, and wanting to discover the damage, I went to the service console and ran the command fdisk –l. I ended up with a lot of blank partitions, as you can see in this file (click on link to see text).

I expected to see something like this instead (click on link to see text).

The difference between the two was that the first three partitions (highlighted in the second file) were missing.

Restoring the VMFS

After some research and some help from the tech folks who follow me on Twitter, I discovered a link to a presentation on restoring a VMFS when a partition table has been destroyed. The PDF proved invaluable.

To summarize the information in that document, to restore the VMFS you need to use the command fdisk to rebuild the partition table. Then move the start block to the proper alignment. Then refresh and re-scan to access the partitions.

Using fdisk, however, can be dangerous and requires caution. Instructions for using fdisk:

- Add a new primary partition number 1
- Take default first and last cylinders
- Change a partition's system id to fb or the VMFS partition id
- Move the beginning of the data in the partition to have an offset of 128 used for VMFS
- Write the new partition table to the disk and exit
- Repeat for all lost VMFS partitions.

This is what those instructions translated into for /dev/sda and /dev/sdc within my ESX host's service console:

# fdisk /dev/sda
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel.

Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable.
The number of cylinders for this disk is set to 39162.
There is nothing wrong with that, but this is larger than 1024, and could, in certain setups, cause problems with two things: software that runs at boot time (e.g., old versions of LILO) and/or booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK)
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
Command (m for help): n
Command action
e extended
p primary partition (1-4)
Partition number (1-4): 1
First cylinder (1-39162, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-39162, default 39162):
Using default value 39162
Command (m for help): t
Selected partition 1
Hex code (type L to list codes): fb
Changed system type of partition 1 to fb (Unknown)
Command (m for help): x
Expert command (m for help): b
Partition number (1-4): 1
New beginning of data (63-629137529, default 63): 128
Expert command (m for help): w
The partition table has been altered!

I tested my handiwork by rebooting the single ESX host that wasn't running any VMs and rescanned its LUNs. Thankfully, the VMFS had been successfully restored.

But my problem was only half solved. I could not restore the virtual RDMs in the same way, because the RDMs were a Linux LVM2.

VMware ESX troubleshooting 202: Backing up virtual machines

Since the ESX VMs were still running, I performed a system backup first. After disconnecting from the Fibre Channel network my VMware vCenter (formerly VirtualCenter) Server and the VMware Consolidated Backup (VCB) proxy server I rebooted and reinstalled with Microsoft Windows 2003. Next, I reinstalled Microsoft SQL, VMware vCenter Server and my backup software, VizionCore vRanger Pro.

Once Vizioncore vRanger Pro was installed, I proceeded to make a backup of every ESX VM, which took almost 12 hours. One VM used a virtual RDM that was 299 GB. With no VCB integration (I had yet to hook up the Fibre cables), backing up that particular file took quite a while. This particular virtual RDM holds my file server and all related important files. It also happens to be a Linux LVM2 disk, which is not abnormal for Linux users, but does make it difficult to restore.

Lessons learned

It ended up taking four days to resolve the problem caused by deleting the seemingly extraneous partitions. The lesson here is that when a routine process seems abnormal, think before you act. Even though I was about to restore the RDMs as well, acting brashly and deleting those partitions cost me four days of time.

If you find yourself in a similar quandary, remember that running VMs are still accessible. You should immediately back them up. Virtual RDMs, however, read the raw disk and not the kernel data structures currently in use, so when you perform a backup, their partitions will be deleted. So back up RDMs using some form of file copy.

ABOUT THE AUTHOR: Edward L. Haletky is the author of VMware ESX Server in the Enterprise: Planning and Securing Virtualization Servers . He recently left Hewlett-Packard Co., where he worked on the virtualization, Linux and high-performance computing teams. Haletky owns AstroArch Consulting Inc. and is a champion and moderator for the VMware Communities Forums.

This was last published in February 2009

Dig Deeper on Troubleshooting VMware products

Start the conversation

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.