In a large, highly structured and well-designed environment, infrastructure and cabling layout is very important...
-- especially when it comes to meeting standards. Once all the little design issues are ironed out, the standard turns into the golden standard, because it works for ages and all is running smoothly.
But what happens when a new server gets its Ethernet cable -- as per the defined standard -- and yet, from the vSphere client side, it appears to be cabled completely wrong?
When you have 16 or more uplinks, it can be become difficult to keep track. Ethernet ports on the VMware management side, as well as on the server console, can appear connected incorrectly or in the wrong spot.
To keep track of everything, it's very important to keep the physical Ethernet card layout in order. If the cards are in the wrong PCI slots, there will be very similar issues with incorrect network interface card (NIC) layout. A visual check is always a good idea to ensure the layout is physically correct before assuming the worst.
Diagnosing the PCI bus enumeration error
This strange phenomenon happened to yours truly a few months ago. At first, it appeared someone had plugged the cables in wrong. Once the cables where checked and verified as correct, it was concluded that it had to be the server, itself. The culprit turned out to be the PCI bus enumeration on the server had changed mid-model.
Essentially, PCI bus enumeration, as the name implies, interrogates the PCI bus and assigns the NICs -- or other devices -- a unique ID based on the PCI slot number and what devices it finds in the order that they are encountered. Anything that can be plugged into a PCI slot will be enumerated by the PCI bus is powered on and allocated a unique ID. This can include things such as Fibre Channel cards and all kinds of other cards.
When Ethernet cards are not inserted in the same layout as the other "correct" servers, the enumeration will occur in a different order. Network cards that are live will be enumerated in a different order than if the cards had been in the correct slot and appear wrong.
Sometimes, however, vendors change motherboard manufacturers or designs halfway through a generation. The redesign has the same effect as if the NICs were enumerated in a different sequence. This is what had happened with certain later-generation HP boards. One day, the NIC layout is fine. And the next, it is completely wrong.
Fixing the PCI bus enumeration error
So, how does an administrator fix this issue and correctly reassign the NICs to conform to standards? The answer isn't pretty and requires using secure shell (SSH) to edit files on the host and multiple reboots. The file "/etc/vmware/esx.conf" contains all the PCI enumeration data and other bits related to the host, itself. Before editing this file, a backup should be made as if this file is misconfigured. The easiest way to get it back is to do a reinstall of ESXi, so it is less than ideal.
When dealing with the NICs, there are two portions that need to be edited. The first contains just the generic numbering assigned against the NIC, as shown in Figure 1.
Depending on how many NICs there are in the server, you may have over a dozen different entries. This is, in effect, placing a label for the device against PCI bus ID.
The second part is:
/net/pnic/child/name = "vmnic6"
/net/pnic/child/mac = "xx:xx:xx:xx:xx:xx"
/net/pnic/child/virtualMac = "xx:xx:xx:xx:xx:xx"
This second bit is where the VMNIC (VMware terminology for Ethernet cards) is assigned its MAC addresses. I have substituted "x" in the example. The fix essentially consists of changing the NIC numbers to replicate the original layout or how the PCI bus originally had them.
Although it's labor intensive, I suggest doing one at a time. If you make a mess at this stage, it can result in phantom network ports and all different kinds of issues. Also note that this process has to be repeated on each affected host. Don't copy and paste this file between hosts, because it won't work and will break your installation.
The best way to approach fixing this, based on experience, is to start off by ensuring your out-of-band management works fine. That way, if you should somehow lose access, you can still get remote console access.
Next, remove all the other cables. One thing to note is that onboard network ports will always be enumerated first, and therefore, they are not susceptible to this issue. In an ideal world, at least one of your management NICs should be located on the onboard NIC. This way, you can fire up the vSphere client and can look at the networks, as well as the broken network switches -- if there are any.
Once that is done, individually plug in one NIC at a time. Note down what it appears as on the vSphere client network adapter page. Assuming you use VLANs in your environment, the process of identifying the correct cables to NIC is quite simple. You should be able to identify which NICs have been switched based on what VLANs they are providing.
To correct the NICs, use the PuTTY SSH client to edit (/etc/vmware/esx.conf) the file and locate the NIC that is misplaced. Then, figure out where it should be. For example, if your VMNIC 6 shows up where VMNIC 2 should be, go through the esx.conf file and find the device ID as detailed earlier:
/device/000:005:00.2/vmkname = "vmnic6"
Once you find that, you want to replace it with:
/device/000:005:00.2/vmkname = "vmnic2"
I find that to ensure it works correctly and to keep track, reboot after each configuration change. Otherwise, it could potentially make your job that much harder. A reboot is required to force the server to reread the esx.conf file. Once you have the right setup, repeat the process.
Also, if you have the same configuration on several servers, you could speed up this process by noting the Ethernet mismatches against what they should be. That way, once you have done and confirmed a few, you can skip the reboot after each VMNIC change.