For those working in the semiconductor and systems space, many BSODs aren’t caused by a software logic error, but by a breakdown in the communication between the CPU and the hardware. This is the domain of WHEA (Windows Hardware Error Architecture).
When a PCIe device fails to respond, or a DMA (Direct Memory Access) transfer goes out of bounds, the system triggers Bug Check 0x124: WHEA_UNCORRECTABLE_ERROR. This is the OS telling you: “The hardware just reported a fatal problem, and I can’t ignore it.”
1. Understanding WHEA and PCIe Errors
Modern systems use AER (Advanced Error Reporting) at the PCIe layer. When a bit flips on the bus or a packet is malformed, the hardware logs it. If the error is “Uncorrectable,” the CPU receives a Machine Check Exception (MCE).
Common triggers include:
- TLP (Transaction Layer Packet) Malformation: The device sent data in a format the CPU doesn’t understand.
- Completion Timeout: The CPU asked the device for data, and the device never answered.
- DMA Misalignment: The driver told the hardware to write data to a memory address that the IOMMU (Input-Output Memory Management Unit) flagged as protected or invalid.
2. Real Use Case: The Vanishing PCIe Device
Scenario: During a stress test of a new NVMe controller or FPGA card, the system suddenly freezes and reboots with a 0x124.
Step 1: The WHEA Error Record
The standard !analyze -v is just the starting point. For hardware errors, you must look at the WHEA Record:
kd> !errrec <address_from_analyze>
This command parses the raw hardware hex data into a readable format, identifying the specific PCIe Segment, Bus, Device, and Function (BDF) that failed.
Step 3: Decoding the PCIe Status
Look for the Primary Status Register or Secondary Status Register in the output.
- Signaled Target Abort: The device intentionally stopped the transaction because of an internal error.
- Received Master Abort: The device tried to talk to the CPU, but the CPU didn’t recognize the address.
3. Debugging DMA Overwrites
DMA is a “double-edged sword.” It allows hardware to write directly to RAM without bothering the CPU, but if the driver gives the hardware the wrong physical address, the hardware will happily overwrite the Kernel or even the Page Tables.
The Solution: DMA Verifier (IOMMU)
By enabling DMA Verification in Driver Verifier, Windows uses the IOMMU to create a “sandbox” for your device.
- If your driver tells the hardware to write to
Address X, but the driver only officially mappedAddress Y, the IOMMU will block the transaction. - This turns a silent, hard-to-trace memory corruption into an immediate 0xE6: DRIVER_VERIFIER_DMA_VIOLATION.
4. Pro-Tips for System Architects
- Check the Link State: Use
!pcitreein WinDbg to see if the device is still present on the bus after the crash. If it’s gone, the device likely lost power or its firmware crashed. - Map Registers Correctly: Always use
MmMapIoSpaceto access hardware registers. Ensure you are using the correct caching attributes (usuallyMmNonCached). - Check for EMI: In hardware labs,
0x124errors that only happen under high load or near power supplies are often electrical interference (EMI) causing bit-flips on the high-speed differential pairs.
Summary Table: Hardware-Related Bug Checks
| Code | Name | Description |
| 0x124 | WHEA_UNCORRECTABLE_ERROR | A fatal hardware error reported by the CPU or PCIe bus. |
| 0xE6 | DRIVER_VERIFIER_DMA_VIOLATION | The driver attempted an illegal DMA operation (caught by IOMMU). |
| 0x101 | CLOCK_WATCHDOG_TIMEOUT | A processor is hung, often due to a deadlock in hardware/firmware handshakes. |
| 0x116 | VIDEO_TDR_FAILURE | The GPU took too long to respond, and the “Timeout Detection and Recovery” failed. |
This concludes our “Deep Dive” series on Windows Debugging. From high-level software logic to low-level hardware signals, you now have a roadmap for diagnosing almost any system failure.
