Heterogeneous Memory System Architecture

Updated 8 December 2025

Heterogeneous Memory System Architecture is a unified framework that integrates various memory technologies to balance bandwidth, latency, capacity, and persistence for diverse applications.
Hardware-software co-design enables dynamic data placement and migration through tiered memory organization and task-aware allocation strategies.
Analytical models and benchmarks quantify performance gains and energy savings, guiding optimal resource allocation in high-performance, AI, and embedded systems.

A heterogeneous memory system architecture encompasses a composition of distinct memory technologies, capacities, and interconnects within a unified compute environment, where the goal is to exploit the respective strengths—such as bandwidth, latency, capacity, or persistence—of each technology and align application and system software with the best-fit memory device. The resultant architectures support more complex data placement, access, and migration decisions at both hardware and software levels, and are central to high-performance computing, embedded systems, AI, and disaggregated datacenter platforms.

1. Taxonomy of Heterogeneous Memory Systems

Heterogeneous memory systems can be classified by the combination and organization of their diverse device types:

Multi-Tiered DRAM/NVM Compositions: Architectures often include a fast, low-capacity memory (e.g., HBM2/3, eDRAM, SRAM) as a cache or near-memory, and a slow, high-capacity memory (e.g., DDR4/5, LPDDR, persistent memory such as Intel Optane DC, or STT-RAM) as a backing store (Babaie et al., 2023, Onsori et al., 2019, Hwang et al., 21 Apr 2025).
Disaggregated and Composable Pools: Platforms decouple compute from memory, enabling dynamic aggregation of heterogeneous pools across CXL-attached devices or composable servers (Wang et al., 5 Nov 2024).
Tightly Coupled UMA/NUMA Systems: Devices such as Grace Hopper GH200 superchips integrate CPU and GPU with both HBM and LPDDR/DDR, unified by high-bandwidth interconnects and coherent virtual memory (Fusco et al., 21 Aug 2024).
Hybrid On-Die/3D-Stacked Memory: 3D embedded chip-multiprocessors (eCMPs) stack eDRAM and STT-RAM, with bank-level allocation optimized via convex programming (Onsori et al., 2019).
Shared Memory for Accelerators/SoCs: ADAS SoCs and neuromorphic processors employ many-ported SRAM/DRAM banks with arbitrary-port algorithms and topologies to support massive concurrent accesses from CPUs, GPUs, and AI accelerators (Luan et al., 2022, Moradi et al., 2017).

Within these architectures, memory management units (MMUs), page tables, and cross-device coherence models are tailored to the intended access patterns and performance/power constraints, giving rise to application-specific and energy-optimal design points (Kim et al., 2017).

2. Hardware-Software Co-Design and Data Placement

Effective utilization of heterogeneity requires dynamic, context-aware placement and migration policies. Key approaches include:

Automatic Tier Guidance: Software frameworks such as SICM identify allocation-site "arenas" and use sampled access statistics and ski-rental cost models to migrate pages between DRAM and NVM or persistent memory, optimizing both for bandwidth and capacity constraints, with runtime-only profiling and no prior input knowledge (Olson et al., 2021).
Task- and Latency-Aware Allocation: Dynamic memory management mechanisms solve integer linear programs or heuristics that assign multitasking workloads to appropriate memory models—copy-then-execute, zero-copy, or unified memory—minimizing response time and peak footprint within platform constraints (Wang et al., 2022).
Runtime Abstraction and Coherence: Solutions such as RIMMS expose hardware-agnostic APIs (e.g., hete_Malloc, hete_Sync) and track "last owner" metadata per object, efficiently synchronizing or migrating objects between CPU, GPU, and FPGA domains in response to per-task assignment and dataflow (Gener et al., 28 Jul 2025).
Granular Kernel-to-Memory Mapping: Hardware-accelerated mechanisms (e.g., H2M2) dynamically balance kernels (GEMMs, GEMVs, attention heads) across bandwidth-centric (HBM) and capacity-centric (LPDDR) memory/accelerator pairs to minimize end-to-end inference latency under current capacity and bandwidth constraints (Hwang et al., 21 Apr 2025).

The following table provides a brief mapping between primary memory placement models and their typical use contexts:

Policy/Mechanism	Target Hardware/Platform	Optimization Target
SICM Exploit (thermos, knapsack)	Cascade Lake + Optane DC	Bandwidth/capacity trade-off (Olson et al., 2021)
Latency-Aware Prefetching	Jetson AGX/Xavier, NVIDIA UVM	Startup latency hiding (Wang et al., 2022)
H2M2 Dynamic Mapping	Asymmetric LLM inference HW	Minimax latency/bandwidth (Hwang et al., 21 Apr 2025)
RIMMS "last writer"/fragment API	CPU+GPU+FPGA runtime	Minimal transfers, high-level API (Gener et al., 28 Jul 2025)

3. Memory System Modeling and Performance Quantification

Rigorous theoretical models underlie design, evaluation, and run-time decisions in heterogeneous memory systems:

Cycle-Accurate Queue and Latency Models: gem5-based models parameterize DRAM caches in terms of AMAT ( $\mathrm{AMAT} = h L_{\mathrm{cache}} + (1-h) L_{\mathrm{mem}}$ ), per-request DRAM/NVM timings, and buffering strategies (ORB/CRB/WB) to enable reasoned trade-offs across associativity, line size, and replacement (Babaie et al., 2023).
Energy and Endurance Optimization: Convex programming is used to minimize $E_\text{total}$ under constraints on capacity, static power, and STT-RAM write cycles, selecting optimal per-bank allocation, type, and placement [ $\{\text{DRC}_{m,x,y}, \text{STC}_{n,x,y}\}$ ], leading to architectural points with up to $61.3\%$ energy savings and $9\%$ IPC improvement (Onsori et al., 2019).
Bandwidth/Latency Analysis: Source-level benchmarking (e.g., Heimdall, microbenchmarks) characterizes per-access latency ( $\sim$ 80 ns DDR4, $>$ 300 ns Optane, $>$ 400 ns CXL.mem), bandwidth scaling under parallelism, and effect of allocation/interleaving on throughput (Wang et al., 5 Nov 2024, Fusco et al., 21 Aug 2024).
Application-Aware Migration and Placement: Rank- and cost-based placement algorithms dynamically solve knapsack or thermos rules, quantifying "hotness" per allocation site as access count per page to inform migration for maximum throughput under tight DRAM budgets (Olson et al., 2021).

4. System Software and Runtime Mechanisms

Software layers play a critical role in making heterogeneous memory accessible and performant:

Unified Virtual Addressing/Shared Memory: Modern CPUs/GPUs expose a shared address space. For tightly-coupled systems (e.g., GH200), transparent fine-grained access is achieved through ATS-enabled hardware MMUs and page-table sharing; in distributed DSM (e.g., SAT S-DSM), chunk-based allocation and automated MESI-coherence enforce global consistency (Cudennec, 2020, Fusco et al., 21 Aug 2024).
Transactional and Scope-Based Memory Models: Systems such as HeTM introduce a global, device-replicated transactional region, supporting speculative execution, batched validation, shadow buffering, and low-overhead merge to present the illusion of atomic updates across CPU and GPU (Castro et al., 2019).
Flexible Consistency and Coherence: Fine-grain coherence specialization and per-chunk protocol selection allow the system to match protocol strength to workload phase, reducing network traffic (99% reduction possible) and ensuring efficient data movement (Cudennec, 2020) [(Alsop et al., 2021)*].
Energy-Aware Polling and Micro-Sleep: Hybrid programming models employ event-driven subscription combined with adaptive micro-sleep loops, substantially reducing core wakeups and energy use during communication-heavy runs (Cudennec, 2020).
Accelerator-Specific MMU Design: Research demonstrates that accelerators require tailored MMUs: even ideal L1 TLBs introduce significant overhead if translation is offloaded to CPUs, necessitating independent, often application-specific MMU design (Kim et al., 2017). Integration with SVM demands TLB prefetching, miss-handling helper threads, and DMA engines aware of translation faults (Kurth et al., 2018).

5. Hardware Structures and Architectural Considerations

Hardware design for heterogeneous memory exhibits diversity in topology, arbitration, and integration:

Split-and-Dispatch Topologies: Many-ported SRAM subsystems for ADAS SoCs hierarchically split accesses across clusters and banks, scheduling via round-robin and weighted arbitration to ensure deterministic latency and near-ideal throughput even for 16+ heterogeneous masters (>95% of theoretical bandwidth) (Luan et al., 2022).
3D-Stacked and Mesh/Heterogeneous Routing: DYNAP-style neuromorphic processors combine local CAM/SRAM for synapse storage with hierarchical-mesh routers, delivering $100\times$ memory footprint savings, $<15$ ns hop latency, and ultra-low energy per routed event in asynchronous QDI circuits (Moradi et al., 2017).
Modular and Disaggregated Architectures: CXL-based clusters support direct memory pooling and composable expansion, with CXL Type 3 memory expanders offering 256 GiB+ DDR5 per device, PCIe5 x8 links, and disaggregated protocol stacks (Wang et al., 5 Nov 2024).
Asymmetric/Paired Device Memory: LLM inference accelerators pair bandwidth-centric (HBM3) and capacity-centric (LPDDR5X) memory, with independent kernels dispatched to each, fast MMU supports, and protocol for efficient cross-device page migration, achieving sublinear cost scaling with model size and batch (Hwang et al., 21 Apr 2025).
Staged Controllers: The SMS controller design decouples batch formation, scheduler arbitration, and DRAM command issuance, achieving +41% throughput and 4.8x better fairness compared to prior monolithic designs, with over 40% area and 66% power savings (Ausavarungnirun et al., 2018).

6. Challenges and Open Problems

While heterogeneous memory systems unlock significant gains in performance, energy, and flexibility, multiple areas remain open for research:

Fine-Grain Consistency and Migration: Efficient migration protocols at sub-page or object granularity pose implementation and correctness challenges, particularly as remote/fabric-based memory becomes prominent (Wang et al., 5 Nov 2024, Babaie et al., 2023).
Security and Side-Channel Resistance: Unified memory exposes new attack surfaces: co-located attackers can observe page faults to reconstruct DNN models, driving research in in-flight AES-GCM co-encryption and hardware-assisted security (Wang et al., 2022).
Compiler, OS, and Runtime Integration: Full utilization demands automatic, low-overhead, and portable integration of memory management schemes into compilers, system allocators, and application binaries, with minimal or no manual tuning (Olson et al., 2021, Gener et al., 28 Jul 2025).
Disaggregated/Fabric Topologies: CXL 3.0 and beyond introduce multi-hop, composable, multi-tenant deployments, requiring new algorithms for page placement, load balancing, latency hiding, and security enforcement (Wang et al., 5 Nov 2024). The integration of distributed shared memory or event-based models provides a flexible software substrate, but also adds new performance and correctness constraints (Cudennec, 2020).
Energy-Performance-Endurance Trade-offs: Methods for dynamically balancing STT-RAM versus eDRAM allocation, bank placement, and the enforcement of endurance constraints remain critical as NVM is adopted more widely (Onsori et al., 2019).

7. Exemplary Benchmarks and Applications

The evaluation of heterogeneous memory spans microbenchmarks, synthetic workloads, and large application domains:

Heimdall Benchmark Suite: Used to systematically stress CXL-attached memory devices via pointer chasing, strided access, atomic, and bandwidth-saturation patterns, as well as full-scale LLM inference, graph kernels, and KV-store end-to-end performance (Wang et al., 5 Nov 2024).
REAL-LIFE AI/ML Use Cases: H2M2 is evaluated on LLMs (GPT3-175B, Chinchilla-70B, LLaMA2-70B), showing 1.46 $–$ 2.94 $\times$ speedup over baseline homogeneous LPDDR in single-iteration inference (Hwang et al., 21 Apr 2025).
SPEC, PARSEC, CORAL: Profiling and migration tools demonstrate large (up to $7\times$ ) speedups over naive first-touch policies when running real-world scientific, graph, or vision codes under DRAM-scarce conditions (Olson et al., 2021, Onsori et al., 2019).
Embedded and Neuromorphic Systems: DYNAP chips and ADAS memory systems demonstrate functional correctness and throughput on challenging real-time tasks such as high-speed symbol recognition (Poker-DVS) and multiple concurrent sensor/data processing pipelines (Moradi et al., 2017, Luan et al., 2022).

The body of research on heterogeneous memory system architecture thus encompasses a broad and rapidly advancing set of hardware and software innovations—spanning device integration, contention-aware scheduling, energy and endurance optimization, application-guided data placement, and secure runtime support—all geared toward maximizing the delivered bandwidth, minimizing effective latency, and maintaining programmer productivity across increasingly diversified compute substrates.