SearchVMware.com

purple screen of death (PSOD)

By Kinza Yasar

What is a PSOD?

A purple screen of death (PSOD) is a diagnostic screen with white type on a purple background that's displayed when the VMkernel of a VMware ESXi host experiences a critical error, becomes inoperative and terminates any virtual machines (VMs) that are running.

Typically, a PSOD details the memory state at the time of the crash and includes other information, such as the ESX/ESXi version and build, exception type, register dump, what was running on each central processing unit (CPU) at the time of the crash, backtrace, server uptime, error messages and core dump information.

The core dump or memory dump is a file that contains further diagnostic information from a PSOD that can be given to VMware support to determine a root cause analysis for the failure.

Why does a PSOD happen?

The purple diagnostic screen isn't as prevalent as the notorious blue screen of death -- the informal name for a Windows general protection fault error -- but it can be equally disruptive. Besides issues with the VMware hypervisors, outdated drivers, unstable graphics processing units (GPUs) and external hardware, other misconfigured settings on a device can also generate a PSOD.

The most common causes of a PSOD include the following:

What are the consequences of a PSOD?

A PSOD causes a kernel panic for VMs -- and once it initiates, the host crashes, and all services and VMs running on the host are terminated. The VMs don't get a chance to gracefully shut down, but are instead powered off abruptly. If, however, the host is part of a high availability cluster, the VMs will automatically failover to other redundant hosts in the cluster.

A PSOD not only causes an outage when VMs are unavailable, but some critical applications like database servers, backup jobs, message queues and additional services can also be affected by the abrupt shutdown. For example, if the host is part of a virtual storage area network cluster, a PSOD will affect the VSAN as well.

How to deal with a PSOD

The diagnostic message displayed by the purple screen of death provides intuitive clues into the problems the machine faces that can be very helpful during troubleshooting.

The following steps should be taken when trying to deal with a PSOD:

  1. Take a screenshot. The diagnostic message displayed inside a PSOD contains helpful information regarding the crash and can be used for troubleshooting. The ESXi servers are mostly accessed through remote tools -- such as Dell's Integrated Dell Remote Access Controller, Hewlett-Packard's Integrated Lights-Out and Cisco's Integrated Management Controller -- which make taking a screenshot easy, but if there's no remote access available, physically going to the machine and taking a picture is also an option.
  2. Restart the host. Sometimes, the easiest way to recover from a PSOD is to reboot the server. Performing this step might prevent complicated troubleshooting later, especially if the underlying issue is simple.
  3. Contact VMware support. To perform a root cause analysis and to expedite the troubleshooting process, contact VMware support, especially if the organization has a support contract.
  4. Collect the core dump. Once the server is rebooted, collect the core dump. The core dump, or vmkernel-zdump, is a zip file that contains logs and offers more detailed information seen on the PSOD to help with further troubleshooting. Even if the cause of the PSOD seems obvious, it's best to confirm by analyzing the core dump. The core dump is especially important for hosts that might be configured to automatically reset after a PSOD occurs, in which case no message is displayed.
  5. Decode the error message. The error message a PSOD produces provides insight into the actual problem. There's an infinite number of error messages that can be produced by a PSOD, such as "COS Error: Oops," "Lost Heartbeat," "Spin count exceeded (iplLock) - possible deadlock" or "Machine Check Exception: Unable to continue." The VMware website lists known VMkernel messages along with their descriptions.
  6. Check the logs. If the root cause of the PSOD isn't obvious after taking the aforementioned steps, then look for clues inside the host log files, especially for the time interval directly preceding the PSOD. The logs can also show errors related to add-in cards and other components, which, for example, can help with reseating a card inside a Peripheral Component Interconnect Express slot. For enterprise-based environments, specialized log management tools, including VMware vRealize Log Insight or SolarWinds Security Event Manager, can be used for observing the logs.
  7. Check overclock settings and clean the heat sink. Occasionally, a PSOD is caused by overclocking of a PC, which can change its hardware clock rate, voltage or multiplier, generating more heat and causing the CPU to become unstable. If a PSOD has occurred for this reason, then it's best to use a dedicated device or software to cool the PC. This can include using a cooling pad or specialized cooling software to disperse the heat faster. GPU malfunctions due to excessive heat can also cause a PSOD, so it's best to clean the device's heat sink regularly.

How to prevent a PSOD

At times, diagnosing the root cause of a PSOD can be challenging and frustrating. Therefore, the best defense against a PSOD is to prevent it from happening by taking a few precautionary measures.

The following items can help minimize or mitigate the occurrence of a PSOD:

Choosing a virtualization option can be a daunting task. Explore this guide to discover the best approach to virtualization and the pros and cons of hosted vs. bare-metal virtualization.

19 Jul 2022

All Rights Reserved, Copyright 2007 - 2024, TechTarget | Read our Privacy Statement