Tiered Memory Service Overview
- Tiered Memory Service is a mechanism that orchestrates data movement across different memory tiers, balancing speed, capacity, and cost for optimal system performance.
- It employs adaptive data placement and migration policies using hardware counters and statistical models to promote frequently accessed data to faster tiers while demoting infrequently used data.
- Integration of hardware/software co-design and QoS controls ensures efficient resource allocation, reduced latency, and enhanced performance in multi-tenant and high-performance applications.
A tiered memory service is a system-level or user-level mechanism that orchestrates the placement, migration, and access of data across multiple physical memory tiers, each with distinct performance, capacity, reliability, and cost characteristics. In contrast to hardware caching, a tiered memory service manages page or region remapping between main memory technologies such as DRAM, non-volatile memory (NVM), CXL-attached pooled DRAM, compressed memory, or device-backend far memory, often within a single address space. The core objective is to maximize application performance and system cost-efficiency by dynamically positioning hot (frequently accessed) data in faster, lower-latency tiers, while relegating cold data to slower, higher-capacity tiers—typically with adaptive, workload- and hardware-aware policies.
1. Architectural Foundations of Tiered Memory Services
Modern tiered memory services are organized around one or more fast (low-latency, high-bandwidth) memory tiers—commonly DDR4/DDR5 DRAM or HBM co-packaged with the CPU—and one or more slower tiers: e.g., persistent memory (Optane DCPMM), remote DRAM accessed via CXL.mem, compressed memory pools, NVM, or even NVMe SSD in advanced systems.
Key architectural characteristics:
- Address space unification: Applications perceive a single virtual address space, while the memory service transparently manages physical page placement among tiers (Maruf et al., 2022, Kumar et al., 2024).
- NUMA abstraction: Tiers appear as separate NUMA nodes to the kernel; Linux and Windows integrate tiered NUMA-aware memory management in recent versions (Maruf et al., 2022, Zhou et al., 2024).
- Device-level heterogeneity: Tiers can vary by latency (e.g., DRAM 80–100 ns, CXL/Optane 145–400 ns), bandwidth (tens to hundreds of GB/s for DRAM, <36 GB/s for CXL DRAM), and reliability characteristics (Yang et al., 22 Mar 2025, Zhou et al., 2024, Song et al., 2020, Pan et al., 6 Oct 2025).
- Hardware support: Recent platforms leverage hardware counters (PEBS/IBS), device-side profiling (CM-Sketch in NeoProf), and memory controller partitioning for hotness measurement and efficient telemetry (Zhou et al., 2024, Yang et al., 22 Mar 2025).
- Kernel and user-level partitioning: Some mechanisms operate inside the kernel MMU (e.g., TPP, KLOC, ARMS), others purely in user space as LD_PRELOAD/interposed allocators (MaxMem, ARMS), or leverage device offload (NeoMem).
2. Core Data Management and Migration Policies
Tiered memory services center on efficient, adaptive data placement and migration strategies over multiple tiers:
- Hot/Cold Identification: Hot/cold classification uses a combination of hardware-access counters (PEBS, Count-Min Sketch in NeoProf (Zhou et al., 2024)), page-fault statistics, or page-table Access bit scanning. Approaches range from threshold-based (TPP (Maruf et al., 2022)) to statistical and moving average models (ARMS (Yadalam et al., 6 Aug 2025)).
- Promotion/Demotion Mechanisms:
- Proactive demotion: Idle/cold pages in fast memory are proactively migrated to slow tiers, freeing space and ensuring headroom for hot or newly allocated pages (Maruf et al., 2022, Raybuck et al., 2023).
- On-demand promotion: Hot accesses to slow-tier pages trigger promotion, with NUMA hint faults as a low-overhead trigger in TPP (Maruf et al., 2022), or device-offloaded indication in NeoMem (Zhou et al., 2024).
- Batch migration and throttling: Migration bandwidth is schedule-aware, employing batched migrations sized to available bandwidth and periodicity (ARMS (Yadalam et al., 6 Aug 2025)).
- Non-exclusive shadowing: Nomad demonstrates shadow-copy retention that enables fast demotion via PTE updates without full copy/remigration under pressure, sharply reducing thrashing (Xiang et al., 2024).
- Threshold Adaptation and Self-tuning:
- Multi-level moving average: ARMS computes page hotness based on short/long EWMA, dynamically re-ranking and adapting to hot-set changes without static thresholds (Yadalam et al., 6 Aug 2025).
- Parameter tuning/Bayesian optimization: BO discovers optimal hot/cold/migration threshold vectors for specific workloads/hardware combinations, delivering up to 2Ă— performance over default heuristics (Kanellis et al., 25 Apr 2025).
- Migration-friendly filtering: Per-process ping-pong detection in multi-tenant settings halts migration for applications exhibiting pathological demote/promote cycles, limiting migration cost and interference (Cho et al., 14 May 2025).
3. Service-Level Objectives, Resource Control, and QoS
With the rise of multi-tenant and performance critical workloads, advanced services introduce explicit service-level objectives (SLOs) and per-application controls:
- QoS and admission control: Mercury implements per-tier page limits and bandwidth throttles at the cgroup level, driven by offline profiled app requirements (minimum local DRAM, minimum CPU fraction) and priorities (Lu et al., 2024).
- Bandwidth and contention awareness: Mercury's bandwidth management module decomposes intra-tier (local DRAM) and inter-tier (CXL) interference, enforcing budgeted resource allocation, live throttling, and real-time feedback adaptation with a 200 ms control loop (Lu et al., 2024).
- Proportional occupancy and QoS guarantees: MaxMem maximizes colocation by maintaining per-application fast-memory miss ratios (FMMR) and fairly redistributing hot DRAM regions to meet user-specified latency targets, with user-space enforcement of migration and occupancy policies (Raybuck et al., 2023).
4. Implementation Paradigms and Hardware/Software Co-design
The realization of a tiered memory service spans a spectrum from pure software to hybrid hardware/software co-design:
- Kernel-level integration: Systems like TPP (Maruf et al., 2022), KLOC (Kannan et al., 2020), and Nomad (Xiang et al., 2024) provide kernel-patch implementations, integrating page promotion/demotion, LRU-based cold detection, and PTE manipulation within the mainline kernel.
- User-space management: MaxMem (Raybuck et al., 2023) and ARMS (Yadalam et al., 6 Aug 2025) intercept allocations/mmap and page faults in user space, leverage userfaultfd for migration, and manage per-process/region metadata transparently—accelerating deployment and policy iteration.
- Hardware-centric profiling: NeoMem (Zhou et al., 2024) offloads per-page hotness tracking to a CM-Sketch-based block in the CXL device, delivering high-resolution and low-overhead profiling visible to the host via MMIO, enabling high-frequency, accurate hot-page promotion decisions.
- Near-memory processing and tiering: Stratum (Pan et al., 6 Oct 2025) combines Mono3D Stackable DRAM (with internal “tiers” due to Z-dimension RC delay) and embedded logic for in situ data placement of expert parameters in MoE models, exploiting knowledge of access probability distributions to optimize tier assignments.
5. Advanced and Specialized Service Designs
Research extends the basic tiered-memory concept to multiple specialized and emerging domains:
- Multi-tier compressed memory: TierScape (ntier) slices compressed memory into up to six tiers defined by (algorithm, allocator, backing media), couples this with ILP-driven optimal placement policies, and achieves TCO savings far beyond two-tier systems (Kumar et al., 2024).
- Virtualized environments: Guest Physical Address Consolidation (GPAC) leverages nested page-table control in VMs to coalesce hot but spatially scattered subpages, minimizing host DRAM footprint and improving multi-VM performance by 50–70% in isolation, 10–13% at scale (Prakash et al., 6 Jun 2025).
- LLM and retrieval-augmented architectures: ShardMemo (Zhao et al., 29 Jan 2026) and FaTRQ (Zhang et al., 15 Jan 2026) illustrate multi-tier memory in agentic LLM systems—tiering state, long-term sharded evidence, and skill artifacts, as well as deploying CXL-attached residual vector refinement for fast, budgeted ANN.
- Fine-grained hardware segmentation: MNEME demonstrates combined inter- and intra-memory asymmetry exploitation via segmented bitline DRAM/PCM, enabling multiple sub-tiers within each memory technology and lowering both performance and reliability costs (Song et al., 2020).
6. Performance Outcomes, Design Trade-offs, and Open Challenges
Comprehensive evaluation across real systems and production workloads has demonstrated major gains from tiered-memory services:
- Performance and TCO: ntier improves memory TCO savings by 22–40 pp over 2-tier at iso-performance, MaxMem improves 11–38% throughput vs. alternatives, Mercury yields up to 53% throughput improvement and strict SLO adherence (Kumar et al., 2024, Raybuck et al., 2023, Lu et al., 2024).
- Tail latency: MaxMem, Mercury, and TPP all demonstrate large reductions (10Ă— in some cases) in p99 latency for key-value and multi-tenant DB workloads (Raybuck et al., 2023, Lu et al., 2024, Maruf et al., 2022).
- Overhead: Well-designed tiered services consistently keep CPU utilization for background policy logic to <3% of a single core, and per-page metadata overhead to <0.3% of system memory (Jenga (Kadekodi et al., 26 Oct 2025), ARMS (Yadalam et al., 6 Aug 2025)).
- Migration efficiency and thrashing control: Nomad and ARMS demonstrate up to 6Ă— performance improvement in heavy-pressure (thrashing-prone) regimes versus synchronous migration approaches or static threshold systems (Xiang et al., 2024, Yadalam et al., 6 Aug 2025).
- Scalability and co-location: MaxMem and Mercury extend to dozens of applications, supporting transparent adaptation to hot set changes and bandwidth contention (Raybuck et al., 2023, Lu et al., 2024).
Open challenges remain:
- Fully knob-free self-tuning that matches or exceeds workload-specific Bayesian optimization (Kanellis et al., 25 Apr 2025, Yadalam et al., 6 Aug 2025).
- Hardware/OS co-design for robust multi-device, multi-tier CXL and NVM handling (e.g., aggregating per-DIMM profiling, handling huge pages (Zhou et al., 2024)).
- Fine-grained or sub-page granularity migration to better utilize expensive near-memory, especially in highly-skewed or irregular access patterns (as observed in virtualized GPAC (Prakash et al., 6 Jun 2025) and context allocators (Kadekodi et al., 26 Oct 2025)).
- Generalizing tiered policies beyond page granularity to encompass kernel objects, device buffers, or multi-application “contexts” as in KLOC (Kannan et al., 2020).
7. Design Guidelines and Best Practices
Synthesis across systems yields the following best practices:
- Leverage lightweight kernel machinery (LRU, NUMA, hint faults) for core cold/hot detection (Maruf et al., 2022, Kannan et al., 2020); offload expensive sampling or counting to device logic where possible (Zhou et al., 2024).
- Implement migration policies that combine multi-epoch history and recency, reject static thresholds, and incorporate migration cost–benefit analysis (Yadalam et al., 6 Aug 2025, Xiang et al., 2024).
- Provision per-application or per-tenant SLO/occupancy controls, propelling fair resource division and predictable performance (Lu et al., 2024, Raybuck et al., 2023).
- Enable migration throttling and per-process controls to avoid negative interference and ping-ponging, especially in mixed or multi-tenant environments (Cho et al., 14 May 2025, Xiang et al., 2024).
- In virtualized systems, coalesce hot subpages in the guest to reclaim expensive near-memory utilization gains with no host changes (Prakash et al., 6 Jun 2025).
- Optimize background policy and migration/batching to minimize both compute and bandwidth overheads—periodic background threads at 10–500 ms intervals are common (Yadalam et al., 6 Aug 2025, Kannan et al., 2020, Kadekodi et al., 26 Oct 2025).
- Adopt increasingly automated parameter tuning and, where possible, move toward knob-free, adaptive tiering mechanisms (Yadalam et al., 6 Aug 2025, Kanellis et al., 25 Apr 2025).
For a comprehensive technical profile, see TPP (Maruf et al., 2022), ARMS (Yadalam et al., 6 Aug 2025), Mercury (Lu et al., 2024), and NeoMem (Zhou et al., 2024), which collectively encompass leading practices in kernel, user-space, hardware-offload, and QoS-centric tiered memory service design.