Cache-Coherent Memory Subsystems

Updated 28 March 2026

Cache-coherent memory subsystems are multi-level architectures that ensure all processors see a single value per memory location using protocols such as MESI, MOESI, and their variants.
They implement coherence through methods like snooping and directory-based protocols to balance performance, scalability, and energy efficiency.
Advanced designs integrate programmable engines and disaggregated memory via interconnects like CXL, supporting heterogeneous systems and robust synchronization.

A cache-coherent memory subsystem is a multi-level memory architecture in which multiple computing elements, such as CPUs or accelerators, maintain a consistent view of shared memory through a hardware-coordinated protocol. This subsystem ensures that all processors see a single value for any memory location, preserving correctness in the face of parallel and distributed computation. Historically grounded in multiprocessor systems-on-chip (SoCs), cache coherence has evolved to encompass heterogeneous, manycore, and even disaggregated architectures, leveraging a variety of protocols, directory structures, and state machines to achieve scalability, performance, and programmability.

1. Core Principles and Protocol Taxonomy

Fundamental to cache-coherent memory subsystems is the enforcement of single-writer/multiple-reader (SWMR) invariants—either through snooping, directory-based protocols, or variants such as MSI, MESI, MOESI, and their extensions.

Directory-Based Protocols: These schemes maintain per-block sharer and ownership state in a distributed or centralized directory; the BlackParrot BedRock protocol, for example, uses canonical MOESIF coherence states (Modified, Owned, Exclusive, Shared, Invalid, Forward), with transitions specified via explicit next-state functions at both the cache (LCE) and directory (CCE) controllers (Wyse et al., 2022, Wyse, 2 May 2025).

Snooping Protocols: These rely on broadcast buses/channels, such as in the 3D MPSoC cluster-local MOESI architecture, where every cache observes update and invalidation requests in real time (Cataldo et al., 28 Apr 2025).

Emerging Models: Recent designs shift toward de-capitalizing directories to minimize area and energy, as in directoryless shared last-level caches (DLS) leveraging weak consistency and speculative mechanisms (Liu et al., 2012), embedding synchronization at the coherence protocol level for disaggregated memory (Yu et al., 2023), or integrating hybrid and programmable coherence engines (Wyse et al., 2022, Wyse, 2 May 2025).

2. Implementation Architectures

Architectural realizations span monolithic multicore, chiplet-based, heterogeneous, and globally distributed settings.

Tile-Based Multicore: Systems such as BlackParrot implement per-core private L1s, a directory-based shared cache (L2 or LLC), and a scalable interconnect, with MOESIF or MSI/MESI protocols augmented by programmable or fixed-function coherence engines (Wyse et al., 2022, Wyse, 2 May 2025, Zoni et al., 5 Aug 2025).

3D NoC and MPSoC: In cluster-based 3D MPSoCs, intra-cluster communication uses a crossbar (e.g., ARM CCI-400 derivative) and MOESI snooping, while inter-cluster communication over packet-switched NoCs (NORMA model) operates without coherence, requiring explicit messaging for data consistency (Cataldo et al., 28 Apr 2025).

Distributed Directories: Scalable manycore architectures (e.g., Intel KNL mesh) distribute directory state across the die, mapping blocks to "tiles" using complex pseudo-random hashing to balance load and avoid hotspots. Directory queries and coherence traffic span the entire mesh, impacting latency and software optimization opportunities (Kommrusch et al., 2020).

Disaggregated Memory and CXL: Coherent disaggregated memory—enabled by interconnects like CXL (Compute Express Link)—unifies hosts and remote memory or accelerators into a global shared address space. Protocols extend coherence domains using MESI/MOESI over CXL.cache and CXL.mem, supported by hardware-enforced directories either in the host's root complex, switches, or devices (Liu et al., 11 Jun 2025, Wang et al., 28 Nov 2025, Xu et al., 2024, Assa et al., 2024).

3. Protocol State Machines and Formal Models

Protocols are specified as state machines, with transitions defined for each cacheline state and event—load, store, invalidation, or external command.

BlackParrot BedRock Protocol Example:

MOESIF states: I, S, E, F, M, O.
State transition function for LCE:

$\delta_{\rm LCE}(I, \mathit{Ld}) = S, \quad \delta_{\rm LCE}(I, \mathit{St}) = M, \ldots$

CCE directs message phases: Request, Command, Fill, Response, atomizing all compound commands to prevent transient state races (Wyse, 2 May 2025).

Programmability and Verification: Fixed-function FSMs minimize latency; microcode engines offer protocol patching and experiment with novel policies at a small area/performance penalty. Transient state elimination dramatically shrinks the protocol model's verification complexity, reducing runtime by orders of magnitude (Wyse, 2 May 2025).

Generalized Models: Synchronization primitives are unified with cache coherence via embedded wait queues and variable-size protection domains for scalable lock handoff—captured in the Generalized Cache Coherence (GCS) formalism with directory tuples $(\mathrm{perm}, \mathrm{SH}, Q, \mathrm{RList})$ and explicit queue-driven transition rules (Yu et al., 2023).

State-Embedding in Disaggregated Memory: SELCC protocol uses a 64-bit latch-word to encode both owner and sharer sets atomically, ensuring MSI-consistent transitions realized entirely via RDMA atomics (Wang et al., 2024).

4. Performance, Scalability, and Energy

Performance metrics are tightly coupled to protocol design, hierarchy depth, and interconnect characteristics.

Latency and Bandwidth Models: Analytical expressions, such as

$L_{\text{remote}} \approx 2 L_{\text{link}} + L_{\text{DRAM}} + L_{\text{coh}},$

quantify remote cacheline access over CXL-based or disaggregated subsystems, with latencies typically in the $300$– $600\,\mathrm{ns}$ range and bandwidths of $18$– $52\,\mathrm{GB/s}$ (Liu et al., 11 Jun 2025).

Directory Overheads: BedRock's duplicate-tag directory introduces a modest, constant $6.25\%$ SRAM overhead, compared to $\geq 7.8\%$ for full bit-vector directories, scaling favorably with core count (Wyse et al., 2022).

Programmable Engines & Area: Microcode-programmable CCEs incur $\sim 4\%$ area/logic overhead, maintaining performance within $(\mathrm{perm}, \mathrm{SH}, Q, \mathrm{RList})$ 0 of FSM baselines for typical SPLASH-3 workloads (Wyse et al., 2022).

Efficiency in Disaggregation: GCS improves throughput of in-memory key-value stores on disaggregated memory by up to two orders of magnitude over POSIX reader-writer locks, by collapsing handoff messages and minimizing in-network transactions (Yu et al., 2023). SELCC achieves near-linear scaling up to 8 compute nodes, outperforming RPC-based protocols especially under limited server CPU availability (Wang et al., 2024).

Optimization Limitations: Hardware-enforced pseudo-random block-to-directory mappings in manycores, while balancing load, impose large software overheads for locality-aware scheduling—diminishing the benefits of reduced access latency due to increased instruction count and disrupted prefetching (Kommrusch et al., 2020).

5. Programming Models and System Integration

Cache-coherent memory subsystems underpin seamless parallel programming, SVM, and accelerator support.

Shared Virtual Memory: Integrated SVM allows CPUs and accelerators (e.g., GPUs/MTTOPs) to transparently dereference shared data without explicit DMA, supporting sophisticated synchronization via directory-MOESI protocols and extensions of the pthreads model (xthreads) (Hechtman et al., 2013).

Heterogeneous and Extensible Subsystems: In SoCs with mixed CPU/accelerator components, RL-based orchestration systems dynamically select coherence modes (ranging from non-coherent DMA to fully-coherent caching) to optimize throughput and bandwidth utilization at runtime, independent of static hardware allocation (Zuckerman et al., 2021).

CXL-Specific Programming: The CXL0 abstraction formalizes remote/local stores, flushes, and persistence primitives, supporting coherent and durable operation across failure domains. Transformations like FliT extend standard linearizable data structures to guarantee durable linearizability under CXL's partial-failure model (Assa et al., 2024).

6. Advanced Topics: Synchronization, Disaggregation, and Security

Modern protocols address latency and energy bottlenecks in distributed and disaggregated memory systems, often embedding synchronization directly in the coherence substrate.

Scalable Synchronization: By merging lock semantics into the coherence layer (GCS), multi- $(\mathrm{perm}, \mathrm{SH}, Q, \mathrm{RList})$ 1 granularity synchronization becomes practical even at data center scale (Yu et al., 2023).

Zero-CPU Server Protocols: SELCC implements all metadata operations via one-sided RDMA atomics, eliminating remote CPU involvement and enabling low-latency, scalable, strongly consistent data access (Wang et al., 2024).

Programmability and Adaptivity: The movement toward programmable/flexible directories (via microcode engines or hybrid pipelines) supports in-field updates, experiment-friendly policy changes, and security feature insertion without silicon respins (Wyse, 2 May 2025).

Failure Tolerance and Persistence: On CXL, process and data failure domains are disjoint; techniques such as switch-offload replication, erasure coding, and checkpointing are necessary to achieve bandwidth-optimized durability, with trade-offs governed by bandwidth, latency, and recovery targets (Xu et al., 2024).

7. Future Directions and Open Problems

Research continues to address open challenges in cache-coherent memory subsystems, including:

Hierarchical and region-based directory protocols to control coherence traffic explosion in many-device or disaggregated settings (Wang et al., 28 Nov 2025).
Adaptive, model-driven cache management policies using lightweight, hardware-amenable ML (e.g., GMM-based policies) for emerging coherence extensions (Chen et al., 2024).
Compositional verification of flexible/programmable coherence engines for rapid SoC prototyping (Zoni et al., 5 Aug 2025, Wyse, 2 May 2025).
Cross-layer co-design of OS, programming models, and hardware to facilitate portability, durability, and recoverability on volatile, coherent disaggregated memory (Assa et al., 2024, Xu et al., 2024).
Battle-tested, open-source reference platforms for cycle-accurate full-system simulation and FPGA-free validation of complex protocols at scale (Zoni et al., 5 Aug 2025, Wyse et al., 2022).

Cache-coherent memory subsystems remain central to scalable multicore, heterogeneous, and distributed computation, with modern protocols leveraging hardware mechanisms, programming abstractions, and system-level co-design to balance consistency, performance, power, and correctness in increasingly complex systems.