Most of the guides that I have written to date have covered what users should do with VMware in order to look after...
it properly with a touch of tender loving care. Now it's time to look at what users should not do in a VMware estate. Some VMware errors can be fixed while some have massive, potentially job impacting repercussions.
This tip is by no means exhaustive, but merely suggestions that a sysadmin may want to look at and perhaps apply to their infrastructure.
Perhaps one of the simplest tips I can offer is when shutting down a host, do so through the client (web or fat client), but don't make the same mistake that I did once and do a reboot via a SSH console. Yes, it can be done, and assuming the host is in maintenance mode there shouldn't be any issues. The only problem was that I rebooted the wrong host. Luckily the affected host was also in maintenance mode, as per best practice. I learned my lesson. Although it is more time consuming, it is safer and a useful sanity check.
is the Cluster admission policy is an often overlooked area of VMware that people do not use properly. Understanding how this works is critical. If an admin wishes to turn off the cluster admission policy, ensure the system has enough capacity at all times to cope with the load from a failure of the largest host. Often I have seen the old adage "rack 'em and stack 'em" come crashing when admins have failed to look at what is needed to support such an infrastructure. It's a really bad idea to put too many virtual machines (VMs) on too few hosts.
Companies often use high end servers and pack them with more than 100 VMs per host. This is all well and good until you need to put the host into maintenance mode for whatever reason or the host crashes. Restarting 100 VMs on other cluster members will place a huge strain on the infrastructure and potentially create I/O storms. There is also a hard limit to the number of VMs that can be restarted at one time (eight). This means that some servers will effectively have to be queued before they can be restarted. This leads to extended downtime while it waits to restart on a new host.
In a similar vein, it is extremely poor practice to use storage local only to one host. Doing this means that the VM is effectively tied to an individual host. In the event of host failure, that VM cannot restart on another host as that storage is no longer available.
Some people also put "faux" clusters into a VMware environment. These usually require a shared SCSI bus and therefore all the virtual nodes have to reside on the same physical host. At the risk of stating the obvious, this breaks every cluster HA design rule in the book.
The loss of the single host means the loss of the entire cluster to one failure. This may be suitable for a Dev environment, but using it in a production setting is risky. In a similar vein, VMware Fault Tolerance (FT) is not a magic answer to avoid clustered issues. The limitations of a single CPU are still a major constraint on adoption of FT use.
Moving onto the more complex side of VMware errors, major version updates can sometimes cause issues. A failure during an upgrade -- especially if an external database host is used -- will not necessarily stop a guest from working. It will make life harder without centralized management.
Even a snapshot will not save you. When you upgrade, your database schema is usually upgraded. Rolling back after this point puts your database at risk and, more likely than not, your vCenter database will be garbage. At this point a restore of the vCenter and database tables from backup is the only way forward, assuming you are able to roll back. There is a reason why VMware suggests in place for upgrades not to be done. On a side note upgrading in the vCenter appliance is much easier and straight forward.
If the site in question uses thin provisioning, it should only be setup for use on either the storage array or the VMware side, not both. Doing both means you are running twice the thin provisioning and it can end in tears for the admin if you are not careful. You should use the same storage settings cluster wide.
Lastly is an item that many rookie admin overlook, which is the hardware compatibility list (HCL) that details the supported hardware configurations for VMware. Although to be fair, most hardware works without an issue, if you are using hardware that is not on the HCL, your support with VMware will be best effort only. This is not what you want to hear when you have a host down situation or worse. Save the heartache and make sure the hardware you purchase is listed on the HCL.
There are many things not to do and I only touched the surface. Common sense is the administrator's best tool, closely followed by a modicum of caution during implementation. Above and beyond this, it is a learning experience and sometimes VMware errors are unavoidable.