The Memory Hierarchy — Caches, Coherency, and the Interconnect

by dnaadmin March 29, 2026

written by dnaadmin March 29, 2026

In modern system design, the CPU is often a victim of its own speed. While processor frequencies have scaled into the GHz range, external DRAM remains orders of magnitude slower. As a System Architect, your primary job isn’t just to ensure the CPU can “think”—it’s to ensure the CPU is never “starving” for data.

This is where the Memory Hierarchy and Cache Coherency become the defining features of your SoC architecture.

1. The Pyramid of Latency

Every layer of memory is a trade-off between capacity and latency.

L1 Cache (Instruction/Data): Tiny (32-64KB), but accessible in ~1ns. This is the “workspace.”
L2 Cache: Larger (256KB-1MB), shared by a cluster of cores.
L3 Cache (LLC): Massive (8MB+), the final gatekeeper before the system bus.
System DRAM: Gigabytes of storage, but with a latency penalty of 100ns+.

The Architect’s Rule: Every “Cache Miss” is a performance catastrophe. If your firmware doesn’t respect spatial and temporal locality, your high-performance SoC will spend 90% of its time waiting for the bus.

2. The Invisible Traffic Cop: Cache Coherency

In a multi-core system, what happens when Core A modifies a variable that Core B also has in its local L1 cache? Without Coherency, Core B would read stale data, leading to silent corruption.

We manage this through a hardware protocol, most commonly MESI (Modified, Exclusive, Shared, Invalid).

Modified: This core has the only valid copy and has changed it.
Exclusive: This core has the only copy, and it matches main memory.
Shared: Multiple cores have a copy; it matches main memory.
Invalid: The data in this cache line is “garbage” and must be re-fetched.

3. The Interconnect: The Heart of the System

In a complex SoC, the Interconnect (like ARM’s AMBA CHI or AXI) is the “highway” that connects CPU clusters, GPUs, and high-speed I/O.

Snooping: The Interconnect monitors (“snoops”) the memory traffic. If Core A requests a memory address that Core B has “Modified,” the Interconnect forces Core B to write that data back or provide it directly to Core A.
Directory-Based Coherency: In massive data center chips with 64+ cores, “snooping” creates too much traffic. Architects use a Directory—a central database that tracks which core owns which memory line—to reduce bus congestion.

4. Real-World Architectural Trade-offs

Tightly Coupled Memory (TCM) vs. Cache

For real-time embedded systems (like an SSD controller or an ABS braking system), caches are dangerous because they are non-deterministic. You don’t know if you’ll hit or miss.

The Solution: Use TCM. This is a small slice of SRAM mapped to a fixed address. It has L1-like latency but zero jitter. You put your critical ISRs and stack here.

False Sharing: The Firmware Performance Killer

If two cores are updating two different variables that happen to sit on the same 64-byte Cache Line, the hardware will constantly bounce that line between the cores.

Design Fix: Use compiler attributes (like __attribute__((aligned(64)))) to ensure high-frequency variables sit on their own cache lines.

5. Summary for the System Architect

Feature	Design Goal	Impact on System
Write-Back vs. Write-Through	Reduce bus traffic	Write-back is faster but requires complex coherency logic.
Inclusive vs. Exclusive Cache	Manage L3 utilization	Inclusive caches simplify snooping; Exclusive caches provide more total storage.
Non-Maskable Interrupts (NMI)	Debugging hangs	Essential for extracting state when the interconnect is “locked up.”

Closing Thought

As we move toward Chiplets and CXL (Compute Express Link), the memory hierarchy is stretching outside the chip and across the data center rack. Understanding how to manage data consistency at the local level is the first step toward mastering the warehouse-scale computers of the future.

In our next article, we will tackle the debate that defines embedded software: Real-Time Determinism vs. Throughput—and how to choose the right OS for your architecture.

The Memory Hierarchy — Caches, Coherency, and the Interconnect

1. The Pyramid of Latency

2. The Invisible Traffic Cop: Cache Coherency

3. The Interconnect: The Heart of the System

4. Real-World Architectural Trade-offs

Tightly Coupled Memory (TCM) vs. Cache

False Sharing: The Firmware Performance Killer

5. Summary for the System Architect

Closing Thought

The First Milliseconds — Architecting the Secure Boot Flow

The Architect’s Dilemma — Real-Time Determinism vs. Throughput

You may also like

Leave a Comment Cancel Reply