Heterogeneous Memory Management
- Heterogeneous Memory Management is a system-level approach that dynamically schedules various memory types—such as DRAM, NVM, and HBM—based on workload demands and tier characteristics.
- It employs adaptive data placement and migration policies using online profiling, cost-benefit models, and low-overhead monitoring to optimize resource usage and minimize latency.
- Recent evaluations show that HMM frameworks can achieve up to 2.5× performance improvement and significant efficiency gains by reducing resource contention and migration overhead.
Heterogeneous Memory Management (HMM) encompasses the system-level techniques, algorithms, and abstractions required to manage multiple memory types—such as DRAM, non-volatile memory (NVM), high-bandwidth memory (HBM), LPDDR, and CXL-attached memory—present in modern compute platforms. HMM aims to maximize application performance, resource efficiency, and programmer productivity, by dynamically, transparently, and adaptively placing and migrating data among these disparate memory tiers according to their latency, bandwidth, and capacity characteristics. The scope includes OS- and runtime-based HMM, hardware-managed tiering, and platform-independent memory-abstraction frameworks.
1. Foundations, Drivers, and Challenges
HMM arises due to increasing heterogeneity in system memory hierarchies. Conventional systems provided one flat DRAM space. Contemporary servers and accelerators increasingly blend low-latency, limited-capacity DRAM with higher-latency/lower-bandwidth, high-capacity NVM (e.g., Optane DC PMem), HBM for bandwidth-constrained workloads, slow capacity-centric LPDDR, or pooled CXL- or FPGA-attached memory (Liu, 2017, Chen et al., 26 Feb 2025). The main challenges include:
- Resource contention and diversity: Mixed workloads exhibit divergent sensitivities to memory bandwidth and latency. Static allocation leads to contention or resource underutilization (Liu, 2017).
- Data placement and migration: Choosing where data resides—fast or slow tier—must match dynamic access patterns, minimize migration overhead, and adapt to phase changes (Nonell et al., 2020, Olson et al., 2021).
- Granularity and profiling: Fine-grained object or page-level profiling is needed; coarse page movement can waste bandwidth or cause false sharing (Kadekodi et al., 26 Oct 2025).
- System noise and monitoring overhead: Profiling must be sufficiently accurate without introducing overhead that perturbs real-time or large-scale applications (Nonell et al., 2020).
Key requirements for effective HMM are transparency, online adaptivity, low-monitoring overhead, and runtime policy flexibility.
2. Core Concepts and Policy Mechanisms
Data Placement and Migration
- HMM policies operate primarily at the object, page, or region granularity. Policies may apply cost/benefit models that weigh the expected benefit of moving data to a fast tier against the cost of migration (e.g., runtime gain per byte vs. bandwidth usage) (Ren et al., 2019, Wu et al., 2017).
- The hotness of data (frequency and recency of access) is typically measured online via hardware events (PEBS, A/D bits, page-faults) and used to drive migration. Bucket-based or histogram-based tracking and smooth exponential decay avoid thrashing and oscillation (Kadekodi et al., 26 Oct 2025, Sha et al., 2022).
Allocation and Abstraction
- Vertical and horizontal allocation policies (editor's term) partition memory by cache/bank/region/capacity to align with application sensitivity, mitigating multi-level contention (Liu, 2017).
- Memory abstractions—provided at the OS, runtime, or library level—hide hardware details and present a unified API for allocations across tiers (e.g., "hete" buffers in RIMMS, arena allocators in SICM) (Gener et al., 28 Jul 2025, Olson et al., 2021).
Profiling and Monitoring
- Robust, low-overhead profiling is enabled through hardware features such as Intel PEBS (Processor Event-Based Sampling), A/D bits in page tables, or OS hooks for working-set tracking (Nonell et al., 2020, Chen et al., 26 Feb 2025, Sha et al., 2022).
- Online, continuous profiling can cluster per-object, per-page, or per-context access statistics, enabling real-time adaptivity without offline runs (Olson et al., 2021, Wu et al., 2017).
Migration and Consistency
- Data-migration engines use asynchronous copies, parallel migrations, and atomic remapping to minimize application pause and ensure data consistency (Sha et al., 2022, Ichimura et al., 3 Apr 2026, Chen et al., 26 Feb 2025).
- Device-side hardware tiering (e.g., HeteroMem) achieves ultra-low-latency migration by managing page remapping and hot/cold tracking within the device, transparent to the host CPU/OS (Chen et al., 26 Feb 2025).
3. Representative Frameworks and Implementations
| Framework/System | Level | Main Features |
|---|---|---|
| SICM (Olson et al., 2021) | OS/runtime | Arena-based heap, periodic profiling, per-site binpacking, online ski-rental migration |
| Sentinel (Ren et al., 2019) | Runtime (DNN) | Object-based profiling, cost-model, association with DNN topology, proactive migration |
| Jenga (Kadekodi et al., 26 Oct 2025) | OS/lib, allocator | Context-based allocation, smooth hotness tracking, minimal thrashing |
| KLOC (Kannan et al., 2020) | OS (kernel) | Kernel-object semantic contexts, sub-100ms migration, per-context heat/liveness scoring |
| RIMMS (Gener et al., 28 Jul 2025) | Runtime/library | Hardware-agnostic "hete" abstraction, O(1-2cy overhead, automated consistency |
| HMM-V (Sha et al., 2022) | Hypervisor | PML/A-D bits, fast parallel migration, multi-VM DRAM pooling |
| HeteroBox/HeteroMem (Chen et al., 26 Feb 2025) | Device/FPGA | Hardware emulation of tiers, device-managed migration/profiling, host transparency |
| Marionette (Fernandes et al., 6 Nov 2025) | Library (C++) | Compile-time memory context, device/host unified API |
Sentinel, for example, intercepts allocation at the tensor/object level, runs full-instrumentation profiling in a first epoch, then reuses these access statistics to drive a lightweight, knapsack-based optimizer for data placement in subsequent epochs. Semantic association (layer IDs) enables proactive prefetching to hide data-transfer latency (Ren et al., 2019).
In the virtualization domain, HMM-V leverages Intel's PML (Page Modification Logging) to efficiently record guest page accesses, uses a temperature quantifier, and migrates the hottest K pages with minimized pause, supporting proactive DRAM pooling across multiple VMs (Sha et al., 2022).
Device-side approaches (HeteroMem) can move the entire tiering decision logic into the CXL-attached device, using count-min-sketches and ping-pong bitmaps for hot/cold detection with <1% overhead, completely transparent to both OS and applications (Chen et al., 26 Feb 2025).
4. Monitoring, Profiling, and Hotness Measurement
High-fidelity, low-cost monitoring mechanisms are a cornerstone of HMM:
- PEBS-based sampling collects sparse but representative memory access traces at line or page granularity with tuning for sampling rate and buffer depth, enabling accurate hotness estimation with average <2.3% runtime overhead, demonstrated up to 128k-core scale (Nonell et al., 2020).
- Context-based and semantic tracking: Grouping allocations by call context, object type (weights, activations, gradients), or OS semantics (inode-based KLOCs) allows for finer correlation of access pattern and migration decision, increasing hit-rate and reducing false positives (Kadekodi et al., 26 Oct 2025, Kannan et al., 2020).
- Bucketed decay and smooth cooling: Adaptive smoothing or exponential decay of access counters reduces migration thrashing, allowing hot/cold boundaries to dynamically follow working-set shifts without abrupt oscillations (Kadekodi et al., 26 Oct 2025).
Bucket- or histogram-based policy mechanisms automatically select thresholds that fit the fast tier's capacity and observed access distribution, as in HMM-V's hot-set extraction or Jenga's logarithmic binning (Sha et al., 2022, Kadekodi et al., 26 Oct 2025).
5. Policy Algorithms and Decision Strategies
Most policies utilize variants of cost-benefit optimization—with per-object/page expected gain vs. migration cost:
- Greedy/knapsack heuristics: Rank by benefit per unit size and fill the fast tier until capacity (Ren et al., 2019, Wu et al., 2017).
- Break-even/ski-rental models: Compare (i) the cost of keeping data in slow memory (misses, latency) and (ii) the cost of moving it to fast memory (migration bandwidth, remapping), migrating only when net gain is positive (Olson et al., 2021).
- Proactive/preemptive migration: Use semantic knowledge (e.g., when a tensor will next be needed by a DNN layer) to prefetch in advance and hide transfer latency, calculated as δ = (size/BW) + ε (Ren et al., 2019).
- Device-side promotion/demotion: In hardware-managed systems (e.g., HeteroMem), use count-min-sketch and ping-pong bitmap-based sampling, triggering migration atomically when page hotness crosses a threshold, with device-managed remapping (Chen et al., 26 Feb 2025).
Hybrid approaches (e.g., combining context-based page allocation with smooth hotness tracking) demonstrate superior fast-tier hit rates and reduced migration volume, directly impacting application throughput (Kadekodi et al., 26 Oct 2025).
6. Evaluation Highlights and Practical Impact
Quantitative experiments consistently demonstrate significant performance gains of HMM frameworks over baseline non-tiered or hardware-only caching approaches:
| System / Scenario | Performance Gain |
|---|---|
| Sentinel (Ren et al., 2019) | ≤8% from fast-only; +18% over prior SW |
| SICM Online (Olson et al., 2021) | 1.8–2.5× over unguided first-touch |
| KLOC (Kannan et al., 2020) | 1.4–4.0× over migration-only baseline |
| Jenga (Kadekodi et al., 26 Oct 2025) | ~28% over 2nd-best tiered manager |
| HeteroMem (Chen et al., 26 Feb 2025) | +5.1–16.2% over CPU-side tiered schemes |
| HMM-V (Sha et al., 2022) | +31–51% over Optane MM/NUMA_B |
Profiling and migration overheads are kept minimal: e.g., ≤3% CPU, <0.3% memory usage for hotness counters, and ≤2% VM slowdown due to page tracking (Kadekodi et al., 26 Oct 2025, Sha et al., 2022). PEBS and hardware-assisted sampling scale to the largest HPC clusters with ≤10% worst-case overhead (Nonell et al., 2020).
Outside general-purpose scenarios, HMM strategies enable workload classes—such as massive ensemble simulations with CPU+GPU streaming (Ichimura et al., 3 Apr 2026), object-oriented data layout for hardware acceleration (Fernandes et al., 6 Nov 2025), and virtualization with pooled DRAM (Sha et al., 2022)—to execute at scales and with efficiency otherwise impossible under a homogeneous memory model.
7. Limitations, Open Challenges, and Future Directions
Among identified limitations:
- Startup/convergence penalty: Online profilers may take several intervals to adapt, resulting in transient suboptimality (Olson et al., 2021).
- Granularity and fragmentation: Monolithic allocation sites and coarse granularity can restrict fine-grained tiering (Olson et al., 2021, Sha et al., 2022).
- Tier scaling and policy extensibility: Many current systems address two-way tiering only; extension to multi-level or pooled/shared memory (CXL, network-attached) introduces new complexity (Olson et al., 2021, Chen et al., 26 Feb 2025).
- Integration with managed runtimes and accelerators: Bridging OS-level tiering with managed runtimes (JVM, .NET) or with heterogeneous accelerators (FPGA, GPU, custom ASIC) requires cross-layer interfaces (Gener et al., 28 Jul 2025, Fernandes et al., 6 Nov 2025).
Emerging work focuses on:
- Proactive migration using dynamic phase detection and machine-learning classifiers (Nonell et al., 2020).
- Hardware/software co-design of tiered memory and page-table management (Hwang et al., 21 Apr 2025).
- Dynamic tier resizing, directory-based coherence for accelerator-rich systems, and automated prefetching (Gener et al., 28 Jul 2025, Kadekodi et al., 26 Oct 2025, Chen et al., 26 Feb 2025).
- Scalability across multi-tenant environments, isolation, and distributed tiered memory over high-speed fabrics or CXL (Chen et al., 26 Feb 2025, Hwang et al., 21 Apr 2025).
A plausible implication is that as system heterogeneity proliferates and bandwidth/latency asymmetries deepen—across CPUs, GPUs, FPGAs, and CXL/PCIe domains—HMM will become essential for both system software and application-level frameworks to achieve efficient, transparent utilization of all memory resources. The field continues to evolve rapidly along dimensions of granularity, adaptivity, abstraction, and cross-layer integration.