Academy of System Design
  • About
  • Coding
  • Kernel Internals
  • Debug
  • Academy
  • Electronics
  • Machine Learning/AI

Academy Video Sample

JOIN ACADEMY TODAY

Popular Posts

  • 1

    Designing for Observability — RAS, Telemetry, and the System “Flight Recorder”

    March 29, 2026
  • 2

    The Silicon-Software Contract (Hardware-Software Co-Design)

    March 29, 2026
  • 3

    The Geometry of Generalization: Understanding Why Neural Networks Work

    March 30, 2026
@2021 - All Right Reserved. Designed and Developed by PenciDesign

The Geometry of Generalization: Understanding Why Neural Networks Work

March 30, 2026 0 comments

The Process of “Life” — Task Scheduling and the CFS

March 30, 2026 0 comments

Inline Functions, Macros, and the Preprocessor Pitfalls

March 30, 2026 0 comments

RAII and Smart Pointers — Managing Resources without a Garbage Collector

March 30, 2026 0 comments

The Linker Script – The Invisible Blueprint of Your System

March 30, 2026 0 comments

Bit Manipulation, Bit-Fields, and the Endianness Trap

March 30, 2026 0 comments

The “Cost” of C++ – Virtual Functions, Vtables, and Memory

March 30, 2026 0 comments

The “Forbidden” Zone — Interrupt Service Routines (ISRs)

March 30, 2026 0 comments

The const Qualifier, constexpr, and the Symbol Table

March 30, 2026 0 comments

Pointer Arithmetic, Type Punning, and the Alignment Trap

March 30, 2026 0 comments
Blog

The Executive Handshake — System Service Exceptions (0x3B)

by Shameer Mohammed March 29, 2026
written by Shameer Mohammed

While many BSODs happen purely in the “dark” of the kernel, Bug Check 0x3B: SYSTEM_SERVICE_EXCEPTION occurs at the boundary where a user-mode application makes a request to the kernel (a System Call). This is often the result of a driver failing to properly validate a buffer passed from an application.

1. The User-to-Kernel Transition

When an app calls ReadFile() or a custom DeviceIoControl(), the CPU switches from Ring 3 to Ring 0. The kernel must treat everything coming from the app as “untrusted.”

2. Real Use Case: The Improper Buffer Mapping

Scenario: A monitoring tool for a data center hangs briefly and then crashes the host with 0x3B whenever it tries to pull telemetry from a custom PCIe sensor.

Step 1: The Exception Context

The 0x3B is unique because it includes a Context Record. Run:

.cxr <address_from_analyze>

This “warps” the debugger’s view to the exact state of the user-mode thread at the moment it crossed into the kernel.

Step 2: Finding the Faulting Address

Check the instruction:

kd> u @rip

If it’s a mov or memcpy operation involving a user-supplied pointer, look at the memory protections:

!address <pointer>

The Discovery: The driver tried to write to a buffer that the user-mode app had already freed or marked as Read-Only. Because the driver didn’t use ProbeForWrite or a try/except block, the exception was unhandled, leading to the crash.



March 29, 2026 0 comments
0 FacebookTwitterPinterestEmail
Blog

The Hardware Handshake — Debugging PCIe and DMA Issues (0x124)

by dnaadmin March 29, 2026
written by dnaadmin

For those working in the semiconductor and systems space, many BSODs aren’t caused by a software logic error, but by a breakdown in the communication between the CPU and the hardware. This is the domain of WHEA (Windows Hardware Error Architecture).

When a PCIe device fails to respond, or a DMA (Direct Memory Access) transfer goes out of bounds, the system triggers Bug Check 0x124: WHEA_UNCORRECTABLE_ERROR. This is the OS telling you: “The hardware just reported a fatal problem, and I can’t ignore it.”


1. Understanding WHEA and PCIe Errors

Modern systems use AER (Advanced Error Reporting) at the PCIe layer. When a bit flips on the bus or a packet is malformed, the hardware logs it. If the error is “Uncorrectable,” the CPU receives a Machine Check Exception (MCE).

Common triggers include:

  • TLP (Transaction Layer Packet) Malformation: The device sent data in a format the CPU doesn’t understand.
  • Completion Timeout: The CPU asked the device for data, and the device never answered.
  • DMA Misalignment: The driver told the hardware to write data to a memory address that the IOMMU (Input-Output Memory Management Unit) flagged as protected or invalid.

2. Real Use Case: The Vanishing PCIe Device

Scenario: During a stress test of a new NVMe controller or FPGA card, the system suddenly freezes and reboots with a 0x124.

Step 1: The WHEA Error Record

The standard !analyze -v is just the starting point. For hardware errors, you must look at the WHEA Record:

kd> !errrec <address_from_analyze>

This command parses the raw hardware hex data into a readable format, identifying the specific PCIe Segment, Bus, Device, and Function (BDF) that failed.

Step 3: Decoding the PCIe Status

Look for the Primary Status Register or Secondary Status Register in the output.

  • Signaled Target Abort: The device intentionally stopped the transaction because of an internal error.
  • Received Master Abort: The device tried to talk to the CPU, but the CPU didn’t recognize the address.

3. Debugging DMA Overwrites

DMA is a “double-edged sword.” It allows hardware to write directly to RAM without bothering the CPU, but if the driver gives the hardware the wrong physical address, the hardware will happily overwrite the Kernel or even the Page Tables.

The Solution: DMA Verifier (IOMMU)

By enabling DMA Verification in Driver Verifier, Windows uses the IOMMU to create a “sandbox” for your device.

  • If your driver tells the hardware to write to Address X, but the driver only officially mapped Address Y, the IOMMU will block the transaction.
  • This turns a silent, hard-to-trace memory corruption into an immediate 0xE6: DRIVER_VERIFIER_DMA_VIOLATION.

4. Pro-Tips for System Architects

  • Check the Link State: Use !pcitree in WinDbg to see if the device is still present on the bus after the crash. If it’s gone, the device likely lost power or its firmware crashed.
  • Map Registers Correctly: Always use MmMapIoSpace to access hardware registers. Ensure you are using the correct caching attributes (usually MmNonCached).
  • Check for EMI: In hardware labs, 0x124 errors that only happen under high load or near power supplies are often electrical interference (EMI) causing bit-flips on the high-speed differential pairs.

Summary Table: Hardware-Related Bug Checks

CodeNameDescription
0x124WHEA_UNCORRECTABLE_ERRORA fatal hardware error reported by the CPU or PCIe bus.
0xE6DRIVER_VERIFIER_DMA_VIOLATIONThe driver attempted an illegal DMA operation (caught by IOMMU).
0x101CLOCK_WATCHDOG_TIMEOUTA processor is hung, often due to a deadlock in hardware/firmware handshakes.
0x116VIDEO_TDR_FAILUREThe GPU took too long to respond, and the “Timeout Detection and Recovery” failed.

This concludes our “Deep Dive” series on Windows Debugging. From high-level software logic to low-level hardware signals, you now have a roadmap for diagnosing almost any system failure.

March 29, 2026 0 comments
0 FacebookTwitterPinterestEmail
Blog

The Silent Killer — Stack Overflows and Corruption (0x7F & 0x1E)

by dnaadmin March 29, 2026
written by dnaadmin

While we’ve spent a lot of time discussing the Kernel Pool (the heap), there is another critical memory area that is much smaller and far more dangerous: The Kernel Stack.

In User Mode, a stack can grow significantly. In Kernel Mode, the stack is fixed and remarkably small (typically 12 KB to 24 KB on x64 systems). If a driver exceeds this limit, it triggers a Bug Check 0x7F: UNEXPECTED_KERNEL_MODE_TRAP or a recursive 0x1E: KMODE_EXCEPTION_NOT_HANDLED.


1. Why the Kernel Stack is Small

Every thread in the system has its own kernel stack. Because there can be thousands of threads, Windows keeps the stacks small to conserve physical memory.

The Risk: Unlike the pool, which just leaks or corrupts neighbors, a stack overflow usually overwrites critical thread context or triggers a “Double Fault”—the CPU’s last-ditch effort when it encounters an exception while trying to process a previous exception.


2. Real Use Case: The Deeply Nested Recursion

Scenario: A file-system filter driver works perfectly in the lab but crashes on a customer’s machine when they run a specific disk-heavy database application.

Step 1: Identifying a Double Fault

Run !analyze -v. Look for Parameter 1 of the 0x7F bug check:

  • Arg1 = 0x08: This is a Double Fault. It almost always means the kernel stack has been exhausted.

Step 2: Examining the Stack Depth

In WinDbg, use the k command to look at the stack. In a stack overflow, you will see hundreds of lines, often repeating the same functions:

Plaintext

00 nt!KiDoubleFaultAbort
01 MyFilter!ProcessFileUpdate+0x120
02 MyFilter!ProcessFileUpdate+0x120
03 MyFilter!ProcessFileUpdate+0x120
... [hundreds of entries] ...
150 MyFilter!OnPreCreate+0x45

The Discovery: The driver is calling itself recursively. Because each call consumes a few bytes for return addresses and local variables, the 12 KB limit is hit quickly.

Step 3: Checking Stack Usage

To see exactly how much stack a thread is using, use:

kd> !thread

Look for the Limit and Stack values. If the difference is near zero, you are out of space.


3. The “Big Local Variable” Trap

Another common cause isn’t recursion, but large local arrays.

C

VOID MyDriverFunction() {
    UCHAR Buffer[4096]; // DANGER: Uses 4KB of a 12KB stack!
    // ...
}

If three or four functions in a call chain do this, the system will crash instantly.


4. How to Fix It (Blog Advice)

  • Allocate from Pool: If you need a buffer larger than a few hundred bytes, use ExAllocatePoolWithTag. Never allocate large structures on the stack.
  • Avoid Deep Recursion: The kernel is not the place for recursive algorithms. Use iterative loops instead.
  • Use Worker Threads: If you are part of a long call chain (like a storage or network stack), offload your work to a System Worker Thread to start with a fresh, empty stack.

Summary Table: Stack and Trap Bug Checks

CodeNameTypical Cause
0x7FUNEXPECTED_KERNEL_MODE_TRAPOften a Double Fault (0x8) caused by stack overflow.
0x1EKMODE_EXCEPTION_NOT_HANDLEDA kernel exception that wasn’t caught; can be a secondary result of stack corruption.
0x2BPANIC_STACK_SWITCHThe kernel detected the stack was so corrupt it had to switch to a “panic” stack.
0x139KERNEL_SECURITY_CHECK_FAILUREModern Windows detection of a “Stack Cookie” mismatch (Buffer Overflow protection).

With this, we conclude our deep dive into memory and execution errors. You now have a comprehensive guide covering IRQLs, Pool Corruption, Timeouts, Deadlocks, Access Violations, and Stack Overflows.

Do you think your blog readers would be interested in a “Part 2” series focused on Hardware-specific debugging (like PCIe training errors or DMA issues)?

March 29, 2026 0 comments
0 FacebookTwitterPinterestEmail
Blog

The Phantom Pointer — Decoding Invalid Memory Access (0x50 & 0x7E)

by dnaadmin March 29, 2026
written by dnaadmin

In our previous articles, we covered timing and resource deadlocks. Now, we return to the most frequent “bread and butter” of Windows debugging: The Access Violation.

When a driver tries to read or write to a memory address that doesn’t exist, is protected, or has already been freed, the CPU throws an exception. If the kernel doesn’t have a specific handler for that exception, it triggers Bug Check 0x50: PAGE_FAULT_IN_NONPAGED_AREA or 0x7E: SYSTEM_THREAD_EXCEPTION_NOT_HANDLED.


1. The Architecture of a Memory Fault

Windows uses a virtual memory system. Every address a driver sees is “Virtual.” The Hardware Management Unit (MMU) translates this into a “Physical” address using Page Tables.

A crash happens when:

  1. NULL Pointer: The driver tries to access address 0x00000000 (the most common bug).
  2. Use-After-Free: The driver accesses memory it already released back to the Pool.
  3. Instruction Corruption: The CPU tries to execute data as if it were code.

2. Real Use Case: The Race Condition Pointer

Scenario: A high-speed imaging driver crashes randomly during device initialization. The BSOD shows 0x50.

Step 1: Analyze the Faulting Address

Run !analyze -v.

  • Arg1: ffffa00012345678 (The memory address referenced)
  • Arg2: 0x00 (Read operation) or 0x01 (Write operation)

Step 2: Context is King

With a 0x7E or 0x50, the stack trace might look “garbage” if the instruction pointer (RIP) itself is corrupted. Use the Trap Frame to restore the registers to the exact state at the crash:

Plaintext

kd> .trap ffff8001`5521a000

Step 3: Disassembling the Failure

Now, look at the exact instruction that failed using u . (unassemble at current IP):

Plaintext

MyCameraDriver!StartCapture+0x7b:
fffff801`4a22047b 488b01          mov     rax,qword ptr [rcx]

The Insight: The CPU tried to move data from the address held in the RCX register into RAX.

If we check the registers:

kd> r rcx

rcx=0000000000000020

The Diagnosis: 0x20 is a “near-NULL” pointer. It usually means the driver has a structure pointer that is NULL, and it tried to access a member at offset 0x20.


3. Debugging “Use-After-Free” with PageHeap

Standard memory pools are recycled quickly. If Driver A frees memory and Driver B immediately allocates it, Driver A might still have a “dangling pointer” and overwrite Driver B’s data without a crash.

To catch this, use Verifier /flags 0x1 (Special Pool).

  • Windows will place the allocation at the very end of a page.
  • When the driver frees it, the page is marked “Invalid” immediately.
  • Any subsequent access by the dangling pointer triggers an immediate 0x50, catching the bug at the source rather than miles down the road.

4. Pro-Tips for the Blog

  • Initialize Your Pointers: Always set pointers to NULL after calling ExFreePool.
  • Check for NULL: Never assume ExAllocatePool succeeded. Always if (Pointer == NULL) return STATUS_INSUFFICIENT_RESOURCES;.
  • Understand the “Non-Paged” Requirement: If you are at DISPATCH_LEVEL, your code and data must be in non-paged memory. If the OS has to “page in” your code from the disk while you are at a high IRQL, you get a 0x50.

Summary Table: Access & Exception Bug Checks

CodeNameDescription
0x50PAGE_FAULT_IN_NONPAGED_AREAAccessing invalid memory at an IRQL that forbids page faults.
0x7ESYSTEM_THREAD_EXCEPTION_NOT_HANDLEDA general “catch-all” for kernel-mode exceptions (like divide-by-zero).
0xD1DRIVER_IRQL_NOT_LESS_OR_EQUAL(Recall Article 1) Specifically a page fault at high IRQL.
0x3BSYSTEM_SERVICE_EXCEPTIONAn exception happened while transitioning from User mode to Kernel mode.

With this, you have the foundational toolkit for 90% of the BSODs you will encounter in your career. Which of these scenarios have you seen most often in your labs?

March 29, 2026 0 comments
0 FacebookTwitterPinterestEmail
Blog

The Silent Standoff — Deadlocks and Power State Failures (0x9F & 0x15F)

by dnaadmin March 29, 2026
written by dnaadmin

In our final installment of this introductory series, we move from runaway loops to the opposite problem: The Deadlock. This is where the system isn’t “busy”—it’s “stuck.”

Two or more threads are waiting for resources held by each other, and neither can proceed. In the Windows kernel, this frequently manifests during power transitions (like sleeping or hibernating), leading to the infamous 0x9F: DRIVER_POWER_STATE_FAILURE.


1. The Locking Hierarchy

Kernel drivers use various synchronization primitives to protect shared data:

  • Spinlocks: Fast, used at DISPATCH_LEVEL.
  • Mutexes/ERESOURCEs: Used at PASSIVE_LEVEL, allowing threads to wait (sleep) if the resource is busy.

The Deadlock Trap: Thread A acquires Lock 1 and tries to get Lock 2. Simultaneously, Thread B acquires Lock 2 and tries to get Lock 1. Both threads sit forever, and if one of those threads is required for a system-wide operation (like a shutdown), the Watchdog eventually triggers a BSOD.


2. Real Use Case: The Sleep-Timer Deadlock

Scenario: A laptop fails to enter “Sleep” mode. The screen goes black, the fans stay on for 30 seconds, and then it crashes with 0x9F.

Step 1: Analyze the Power IRP

For 0x9F, Parameter 1 tells us the type of violation. Usually, it’s 0x3 (A device object has been blocking an IRP for too long).

Run: !analyze -v

Then, find the pending IRP:

!irp <address_from_analyze>

Step 2: Finding the Blocker

The !irp command will show which driver is currently “owning” the power request.

Plaintext

>[0, 0]   0  0 ffffe001`1a2b3c40 00000000 fffff801`4b331010-fffff801`4b442020
           \Driver\MyUsbFilter   nt!PopRequestCompletion

Here, MyUsbFilter received a “Set Power” IRP but never passed it down or completed it.

Step 3: Thread Analysis

Why is the driver stuck? We look at the thread handling that IRP:

!thread <address>

The stack trace might look like this:

Plaintext

nt!KeWaitForSingleObject
MyUsbFilter!StopTrafficAndLock+0x45
MyUsbFilter!PowerDispatch+0x12

The Discovery: The driver is waiting for a Mutex to “Stop Traffic.” However, the thread that holds that Mutex is currently blocked waiting for the Power IRP to finish. This is a classic Circular Dependency.


3. Debugging Tools for Deadlocks

If you suspect a lock issue, WinDbg has specialized tools:

  • !locks: Displays all kernel resources and which threads own them. Look for “Threads Waiting.”
  • !deadlock: If you have Driver Verifier’s Deadlock Detection enabled, this command will explicitly map out the circular chain for you.
  • !ready: Shows all threads in a “Ready” state to see if anyone is being starved of CPU time.

4. Designing for Stability (Blog Advice)

  • Lock Ordering: Always acquire locks in the exact same order across all functions in your driver. If you take Lock A then Lock B in one place, never take Lock B then Lock A elsewhere.
  • Don’t Block in Power Paths: Power dispatch routines should be fast. If you need to wait for hardware, use a timeout.
  • Use Passive-Level Interrupts: If your hardware allows, handle complexity at PASSIVE_LEVEL where you have more flexibility with synchronization.

Summary of Resource & Power Bug Checks

CodeNameDescription
0x9FDRIVER_POWER_STATE_FAILUREA driver is inconsistent or slow during a power state change.
0x15FCONNECTED_STANDBY_WATCHDOG_TIMEOUTA specialized 0x9F for modern standby devices.
0xCBDRIVER_LEFT_LOCKED_PAGES_IN_PROCESSA driver failed to release memory or locks before a process ended.
0xDEPOOL_CORRUPTION_IN_FILE_AREAOften a deadlock between the Memory Manager and a File System driver.

Conclusion of the Series

Debugging is a mix of science and intuition. By mastering IRQLs, Pool Headers, DPCs, and Locking, you move from guessing why a system crashed to knowing exactly which line of code failed.

Happy Debugging!

March 29, 2026 0 comments
0 FacebookTwitterPinterestEmail
Blog

Hunting the Infinite Loop — DPC Watchdog Violations (0x133)

by dnaadmin March 29, 2026
written by dnaadmin

In the semiconductor and hardware world, timing is everything. While previous articles focused on memory safety, this one deals with responsiveness.

The Bug Check 0x133: DPC_WATCHDOG_VIOLATION occurs when the system detects a single Deferred Procedure Call (DPC) running for an excessive amount of time, or when the cumulative time spent at DISPATCH_LEVEL exceeds a threshold. Essentially, one processor is “stuck,” preventing the OS from scheduling other critical tasks.


1. The DPC and Interrupt Architecture

To understand this crash, we must understand how Windows handles hardware.

  • ISR (Interrupt Service Routine): High priority, very short. It tells the hardware “I hear you” and schedules a DPC.
  • DPC (Deferred Procedure Call): Lower priority than an ISR but still runs at DISPATCH_LEVEL ($IRQL = 2$). It does the “heavy lifting” (like processing network packets or disk I/O).

The Golden Rule of Kernel Dev: Never block, sleep, or run long loops in a DPC. If a DPC runs too long, the Windows “Watchdog” timer barks, and the system bites with a BSOD.


2. Real Use Case: The Busy-Wait Loop

Scenario: A new PCIe driver works fine under light load, but during heavy stress tests, the system freezes for a second and then crashes with 0x133.

Step 1: The Watchdog Parameters

Run !analyze -v. For 0x133, Parameter 1 is critical:

  • Arg1 = 0: A single DPC exceeded the time limit.
  • Arg1 = 1: The cumulative time spent at DISPATCH_LEVEL was too high.

Step 2: Finding the “Hog”

If Arg1 is 0, the debugger usually points directly to the offender. We use the !dpc command to see what’s queued, but more importantly, we look at the Processor Control Block (PRCB).

Plaintext

kd> !prcb
...
DpcRoutine: fffff801`4b331010  MyStorageDriver!RequestTimeoutHandler

Step 3: Analyzing the Code

We examine MyStorageDriver!RequestTimeoutHandler.

C

while (HardwareStatus != READY) {
    // Busy waiting for a register bit to flip
    // No timeout, no yielding
}

The Flaw: The driver is “spinning” in a while loop waiting for hardware that has hung. Because this is a DPC, no other thread can preempt it on this core. The Watchdog timer expires because the CPU hasn’t returned to a lower IRQL in several milliseconds.


3. Debugging Tools for Timeouts

If the stack trace is unclear, use these commands:

  • !stacks 2: Look for threads stuck in DISPATCH_LEVEL.
  • !timer: See if any system timers were supposed to fire but were blocked by the runaway DPC.
  • !runaway: Shows which threads have consumed the most CPU time.

4. How to Fix It (Blog Advice)

  • Use Hardware Timeouts: Never write a while loop without a maximum retry count or a timestamp check.
  • Offload to Worker Threads: If you have massive data processing to do, don’t do it in the DPC. Queue a System Worker Thread (which runs at $IRQL = 0$) so the scheduler can still breathe.
  • Use KeStallExecutionProcessor Sparingly: Only use stalls for microseconds, never milliseconds.

Summary Table: Timeout & Logic Bug Checks

CodeNameTypical Cause
0x133DPC_WATCHDOG_VIOLATIONA DPC ran too long or the system stayed at DISPATCH_LEVEL too long.
0x101CLOCK_WATCHDOG_TIMEOUTA secondary processor is not responding to interrupts (often hardware/voltage).
0x9FDRIVER_POWER_STATE_FAILUREA driver is taking too long to respond to a Power IRP (sleep/hibernate).
0x139KERNEL_SECURITY_CHECK_FAILUREA stack buffer overrun was detected (modern replacement for some 0x19 cases).

In the next and final article of this introductory series, we will discuss Resource Deadlocks (0x15F and 0x9F)—where two threads are waiting for each other, and nobody is moving.

March 29, 2026 0 comments
0 FacebookTwitterPinterestEmail
Blog

Tracking Down Memory Corruption (Bug Check 0x19)

by dnaadmin March 29, 2026
written by dnaadmin

In our previous article, we looked at IRQL issues. Today, we dive into one of the most challenging areas of Windows debugging: Pool Corruption.

When a driver or kernel component writes past the end of its allocated memory buffer, it doesn’t always crash immediately. It might overwrite the “header” of the next memory block. The system only notices the damage later when it tries to allocate or free that next block, leading to a Bug Check 0x19: BAD_POOL_HEADER.


1. Understanding the Windows Kernel Pool

The Kernel Pool is the heap for kernel-mode drivers. It is divided into Pool Chunks. Each chunk starts with a Pool Header that contains metadata:

  • Previous Size: Size of the preceding chunk.
  • Pool Index: Used for tracking.
  • Block Size: Size of the current chunk.
  • Pool Tag: A 4-character “signature” (e.g., ExAl) identifying who allocated the memory.

If a driver performs an “off-by-one” error or a memcpy with an incorrect length, it overwrites the header of the adjacent chunk. When the Pool Manager later inspects that corrupted header, it triggers the BSOD to prevent further data loss.


2. Real Use Case: The “Ghost” Overwriter

Scenario: A server crashes randomly every few hours with 0x19. The stack trace usually points to nt!ExFreePoolWithTag, which is just the “victim” trying to clean up memory.

Step 1: Analyze the Parameters

In WinDbg, run !analyze -v. Look at the parameters for 0x19:

  • Arg1: 0x20 (The pool block header is corrupt)
  • Arg2: The address of the corrupted pool block.
  • Arg3/4: Internal tracking data.

Step 2: Inspecting the Neighborhood

We use the !pool command to look at the memory surrounding the crash address:

Plaintext

kd> !pool fffff801`4a220000
fffff801`4a220000 size:  40 previous size:   0  (Free)      ....
fffff801`4a220400 size:  60 previous size:  40  (Allocated)  Leak
fffff801`4a220460 size:  ?? previous size:  ??  (Corrupt)    Tag?

The debugger tells us the chunk at 0x460 is corrupt. This means the chunk immediately before it (at 0x400) is likely the one that overran its boundary.

Step 3: Identifying the Culprit via Pool Tags

Look at the tag for the chunk at 0x400. Let’s say it is Prot.

To find which driver owns that tag, use the strings command or search your source code:

findstr /s "Prot" *.c

If you don’t have the source, use:

!libpooltag Prot

This identifies the “Protocol” driver as the one that likely wrote too much data into its 0x60 byte allocation, destroying the header of the next block.


3. Advanced Technique: Special Pool

Sometimes the corruption is so subtle that !pool isn’t enough. This is where Driver Verifier and Special Pool come in.

By enabling Special Pool for a specific driver tag, Windows places each allocation on a separate memory page, right against a “guard page.”

  • The Result: Instead of corrupting a neighbor and crashing later, the driver will trigger an immediate 0x50 (PAGE_FAULT_IN_NONPAGED_AREA) the very millisecond it tries to write one byte too far.

4. Best Practices for Blog Readers

  • Always use Pool Tags: Never use ExAllocatePool. Use ExAllocatePoolWithTag. It’s your primary breadcrumb during a crash.
  • Validate Buffer Lengths: Before every RtlCopyMemory or memcpy, check your destination buffer size against the source length.
  • Use !poolval: This WinDbg command can help validate the entire pool structure if you suspect widespread corruption.

Summary of Memory Bug Checks

Bug CheckNameDescription
0x19BAD_POOL_HEADERA pool header was found to be invalid during a pool operation.
0x50PAGE_FAULT_IN_NONPAGED_AREAInvalid system memory was referenced (often due to bad pointers).
0xC4DRIVER_VERIFIER_DETECTED_VIOLATIONCaught by Verifier—this is the “Gold Standard” for debugging.
0xBEATTEMPTED_WRITE_TO_READONLY_MEMORYA driver tried to write to a segment of memory marked as read-only.

In the next article, we will tackle Deadlocks and Timeouts (0x133)—how to find out which thread is “hogging” the CPU and stalling the entire system.

March 29, 2026 0 comments
0 FacebookTwitterPinterestEmail
Blog

Mastering the Blue Screen: A Guide to Windows Kernel Debugging

by Shameer Mohammed March 29, 2026
written by Shameer Mohammed

Mastering the Blue Screen: A Guide to Windows Kernel Debugging

The “Blue Screen of Death” (BSOD) is often viewed with dread, but for a system engineer, it is a goldmine of diagnostic information. When Windows encounters a condition that compromises safe system operation, it halts and produces a Crash Dump.

In this first article, we will walk through the essential setup and a real-world analysis of a Driver IRQL Not Less or Equal bug check—one of the most common issues in the semiconductor and embedded space.


1. Setting the Stage: The Debugging Environment

Before diving into logs, you need the right tools. The industry standard is WinDbg (part of the Windows SDK).

  • Symbols: Ensure your symbol path is set correctly. Symbols translate memory addresses into human-readable function names.
    • Path: srv*C:\Symbols*https://msdl.microsoft.com/download/symbols
  • The Dump File: Locate your memory dump at %SystemRoot%\MEMORY.DMP (Complete Dump) or in %SystemRoot%\Minidump\.

2. Anatomy of a Bug Check

Every BSOD is defined by a Bug Check Code and four parameters. Let’s look at a classic case:

Bug Check 0xD1: DRIVER_IRQL_NOT_LESS_OR_EQUAL

This typically happens when a kernel-mode driver attempted to access pageable memory at a process IRQL (Interrupt Request Level) that was too high.

The Rule: You cannot access “paged” memory (memory that might be on the disk) when the CPU is running at DISPATCH_LEVEL or higher. Doing so triggers a fatal page fault.


3. Real Use Case: The Faulty Network Driver

Imagine a scenario where a system crashes every time a high-speed data transfer begins.

Step 1: Preliminary Analysis

Open the dump in WinDbg and run the “magic” command:

!analyze -v

Step 2: Interpreting the Output

The debugger identifies the faulting module:

Plaintext

MODULE_NAME: NetDriverX
FAULTING_MODULE: fffff801`4a220000 NetDriverX
PROCESS_NAME: System
TRAP_FRAME: ffff8001`5521a000

Step 3: Examining the Stack Trace

Look at the STACK_TEXT. This shows the sequence of function calls leading to the crash.

Plaintext

00 nt!KeBugCheckEx
01 nt!KiPageFault
02 NetDriverX!ProcessIncomingPackets+0x45
03 NetDriverX!IsrRoutine+0x12
04 nt!KiInterruptDispatch

Observation: The crash happened in NetDriverX!ProcessIncomingPackets called by an IsrRoutine (Interrupt Service Routine). ISRs run at high IRQL.

Step 4: Finding the Culprit

By using kb (Display Stack Backtrace) and examining the code at the offset, we find that the driver tried to access a global configuration buffer that was marked as pageable. Since the ISR cannot wait for the disk to fetch that page, the system crashed.


4. Key Takeaways for Your Blog

  • IRQL Management: Always know your current IRQL. If you are at DISPATCH_LEVEL, your data must be in non-paged memory.
  • Analyze the Trap Frame: Use .trap followed by the address provided in the analysis to see the register state at the exact moment of the crash.
  • Verification: Use Driver Verifier during development to catch these IRQL violations before they reach the end-user.

Summary Table: Common Bug Checks

CodeNameTypical Cause
0x1EKMODE_EXCEPTION_NOT_HANDLEDAccess violations or bad pointers in kernel code.
0x7BINACCESSIBLE_BOOT_DEVICEMissing storage drivers or hardware failure.
0x9FDRIVER_POWER_STATE_FAILUREDriver failing to handle sleep/wake transitions.
0x133DPC_WATCHDOG_VIOLATIONA single DPC running for too long, stalling the CPU.

In the next article, we will explore Memory Corruption (0x19) and how to use the “Pool” commands to track down “who” overwrote your buffer.

March 29, 2026 0 comments
0 FacebookTwitterPinterestEmail
  • 1
  • 2
  • 3
  • 4

About Me

About Me

Shameer Mohammed, SoC Technologist

Shameer Mohammed believes that no topic is too complex if taught correctly. Backed by 21 years of industry experience launching Tier-1 chipsets and a solid foundation in Electronics and Communication Engineering, he has mastered the art of simplifying the complicated. His unique teaching style is scientifically grounded, designed to help students digest hard technical concepts and actually remember them. When he isn't decoding the secrets of silicon technologies, Shameer is exploring the inner workings of the human machine through his passion for Neuroscience and Bio-mechanics.

Keep in touch

Facebook Twitter Linkedin Youtube Github

Resources

  • The Geometry of Generalization: Understanding Why Neural Networks Work

    March 30, 2026
  • The Process of “Life” — Task Scheduling and the CFS

    March 30, 2026
  • Inline Functions, Macros, and the Preprocessor Pitfalls

    March 30, 2026

Recent Posts

  • The Geometry of Generalization: Understanding Why Neural Networks Work

    March 30, 2026
  • The Process of “Life” — Task Scheduling and the CFS

    March 30, 2026
  • Inline Functions, Macros, and the Preprocessor Pitfalls

    March 30, 2026
  • RAII and Smart Pointers — Managing Resources without a Garbage Collector

    March 30, 2026
  • The Linker Script – The Invisible Blueprint of Your System

    March 30, 2026

Categories

  • Blog (22)
  • Coding (10)
  • Debug (3)
  • Electronics (1)
  • Kernel Internals (1)
  • Machine Learning/AI (1)

Frontend

  • The Geometry of Generalization: Understanding Why Neural Networks Work

    March 30, 2026
  • The Process of “Life” — Task Scheduling and the CFS

    March 30, 2026
  • Inline Functions, Macros, and the Preprocessor Pitfalls

    March 30, 2026
  • RAII and Smart Pointers — Managing Resources without a Garbage Collector

    March 30, 2026

Subscribe Newsletter

  • Facebook
  • Twitter
  • Linkedin
  • Youtube
  • Email
  • Github
  • Stack-overflow

Read alsox

The Architect’s Dilemma — Real-Time Determinism vs. Throughput

March 29, 2026

The Memory Layout — Stack, Heap, and Data...

March 30, 2026

The Registry & Boot Configuration — Critical Process...

March 29, 2026