Efficient Page-Based Management

Updated 14 May 2026

Efficient page-based management is a framework for dynamic allocation, tracking, migration, and reclamation of memory pages in systems with fluctuating capacities.
It combines classic paging algorithms with advanced translation, eviction, and cache coordination methods to achieve high throughput and low latency in multi-tiered and distributed environments.
The research emphasizes practical design insights and policy decoupling strategies that optimize resource utilization in databases, LLM inferences, and accelerator-driven workloads.

Efficient page-based management refers to a suite of data structures, algorithms, and system-level policies for allocating, tracking, migrating, and reclaiming pages in memory hierarchies, storage devices, distributed systems, and high-level application runtimes. It aims to maximize throughput, minimize latency, bound fragmentation, and achieve adaptivity to diverse workloads and hardware configurations. Research in this area spans from online paging algorithms with dynamic memory capacities, to translation infrastructure in buffer managers, to user-space page management for out-of-core workloads, and to domain-specific problems such as KV-cache management for LLM inference. This article presents a detailed synthesis of key methods, analytical frameworks, and experimental results from the state-of-the-art literature, centering on principles, algorithmic techniques, performance trade-offs, and practical guidelines.

1. Dynamic-Capacity and Multi-Tier Paging Models

Traditional paging models assume a fixed-capacity buffer or main memory, but in modern systems (cloud, multi-core, power-gated caches), capacity may fluctuate dynamically. The model formalized by Peserico (Peserico, 2013) extends the classical setting:

Time is discrete; each step is a page request, a "growth" (+, increase capacity by 1), or "shrink" (–, decrease by 1, potentially evicting a page).
The induced page-request sequence and capacity sequence $(T, m_1, m_2,\ldots)$ govern eviction/fault events.
The dynamic (h,k)-competitive ratio compares an online algorithm (ALG, capacity $k$ ) to an offline OPT with per-step capacity $\lfloor (h/k)m_i \rfloor$ .

Notably, classic paging algorithms (LRU, FIFO, MARK, CLOCK, RAND) maintain near-optimal competitive ratios even when capacity varies adversarially, matching

$P_L(h,k) = \max_{1 \le k' \le k} \left\{ \frac{k'}{ k' - \lfloor (h/k)k' \rfloor } \right\}$

The worst-case ratio for static capacities, $k/(k-h+1)$ , is closely approximated by $P_L(h,k)$ for realistic parameters, and classic marking and conservative algorithms achieve this bound.

For systems with multi-level or tiered memory (e.g., DRAM-NVM-SCM), page management schemes have adapted classic algorithms. The N-level Aging algorithm (Oren, 2017) extends two-level recency counters to map demoted pages to appropriate levels using a leading-zero-based target calculation, maintaining k-bit utilization per page. Simulation shows up to N× improvement in hit ratio compared to conventional two-level aging in random-access workloads.

2. Buffer Pool Translation and Array-Based Techniques

Modern buffer managers must map logical page IDs (PIDs) to buffer frames efficiently for both OLTP (B-tree, point lookup) and OLAP/graph/vector workloads (scan, high-fanout traversal). CALICO (Zhou et al., 1 Apr 2026) demonstrates that array-based translation can match or exceed performance of hash-based and OS-page-table-backed buffers by:

Employing a multi-level translation: a small upper-level index (trie, radix tree, hash table) routes the "prefix" of a hierarchical PID to a last-level translation array; the "suffix" indexes into the array.
Using an 8-byte TranslationEntry per page (frame_id/version/state), enabling lock-free operations and fine-grained pinning/unpinning.
Leveraging a thread-local path cache to exploit spatial locality in translation, reducing average access cost $E[T] = (1-\alpha)T_{array} + \alpha (T_{array} + T_{up})$ for prefix switch rate $\alpha$ .
Reclaiming cold translation arrays by "hole punching": tracking live-entry counts at OS-page granularity and invoking madvise(DONTNEED) when counts reach zero.
Enabling efficient batch prefetch for vector/graph workloads, issuing parallel software/hardware prefetches and a single async I/O for all non-residents.

In large-scale experiments, CALICO achieves up to 3.9× in-memory and 6.5× out-of-memory speedup for vector search in PostgreSQL, matches or exceeds LeanStore and standard vmcache in B-tree and scan workloads, and reduces translation-memory overhead to near-optimal for sparse PID spaces.

3. Advanced Page Replacement Algorithms and Machine-Driven Eviction

The EEvA family (Demin et al., 2024) introduces expert-based buffer page replacement algorithms, interpolating between fast but pattern-blind heuristics (LRU, FIFO) and adaptive but costly ML oracles:

Each page (primary expert) and table (support expert) maintains an exponentially-weighted reward; table-level initialization enables "warm starts" for reloaded pages.
The eviction strategy uses mirror descent and exponential-weights to probabilistically sample which page to evict, balancing hit-rate regret and time cost across workload types.
Deterministic, table-aware, and averaged randomized variants provide design flexibility.
In synthetic and TPC-C benchmarks, EEvA (especially the sequential "clock" variant) consistently outperforms both LRU and TinyLFU, with up to 40% more hits and 24% higher TPS in read-heavy scenarios, without degradation in write-dominated workloads.

4. Page Management in Non-Relational and Accelerator-Oriented Workloads

In LLM inference, the bottleneck is dynamic, high-volume growth and sharing of key-value (KV) caches on GPU memory.

PagedAttention/vLLM (Kwon et al., 2023) splits each sequence's KV cache into fixed-size pages, managed by a virtual page table (block table), with on-demand allocation and copy-on-write sharing, reducing typical fragmentation from 60–80% to <1% and achieving 2–4× throughput improvement.
vTensor (Xu et al., 2024) extends this model by decoupling memory scheduling (CPU, CUDA VMM) from computation (unmodified attention/Triton/FlashAttention kernels), reserving virtual address space up front and mapping only necessary physical pages, freeing an average of 71.25% (57GB) of GPU memory versus vLLM's paged-attention, and improving end-to-end throughput by up to 1.86× in multi-turn chat.

Key design insights include reserving full virtual space up-front, lazy page mapping, CPU-driven page-table updates, asynchronous prefetch/extension during GPU compute phases, and defragmentation via radix-tree–based prefix reuse.

5. Layered and Distributed Page-Cache Mechanisms

Efficient page-based management also encompasses distributed and tiered environments:

DPC (Bergman et al., 21 Apr 2026), a distributed page cache over CXL 3.0, unifies the aggregate cluster DRAM budget by enforcing a single-copy invariant: each file page has exactly one local owner, and remote nodes access via CXL-based mapping rather than replication. A global directory manages page-ownership and coherency, with batched invalidation to amortize per-page overhead. Empirically, DPC delivers up to 12.4× speedups and a 5.6× geometric-mean throughput increase across representative distributed data and model-serving workloads.
Nomad (Xiang et al., 2024) introduces non-exclusive memory tiering with transactional page migration and page shadowing, mitigating thrashing in fast memory under pressure by retaining shadow copies in slow memory. Transactional migration is decoupled from the critical path, and remapping to slow memory is performed without copying if the master is clean. In micro- and macro-benchmarks, Nomad achieves up to 6× better bandwidth than exclusive migration under high pressure.

For virtualization and huge-page workloads, FHPM (Li et al., 2023) introduces companion-page redirection to track access at fine granularity, allowing base-page–level hot/cold detection, dynamic split/collapse with proactive EPT refills (60% VM-exit reduction), and up to 33–61% performance gain in tiered DRAM/PMem setups compared to pure huge-page or base-page management.

6. Practical Considerations, Algorithmic Tuning, and Policy Decoupling

Several best practices and open challenges have emerged across these lines of work:

In dynamic-capacity paging (Peserico, 2013), replacement and capacity allocation can be decoupled: one may fix the per-VM/CPU/core allocation curve and still use LRU or equivalent for selection with optimality guarantees.
For log-structured and flash-based storage, page-differential logging (Kim et al., 2010) and cost-based log cleaning (Lomet et al., 2020) use algebraic cost models and per-segment prioritization (minimizing $|dc_i/du|$ via live/empty page fractions and update rates) to reduce write amplification, extend device lifetime, and minimize background overhead.
Move_pages2 (Rayhan et al., 22 Mar 2025) exposes batch size and migration mode as user-tunable parameters for database-driven control of NUMA/page-tier migration rates, yielding up to 2.09× faster migration and 1.84× higher throughput. Amortization of TLB shootdowns via batching is shown to be the dominant lever for practical performance, with per-page error logging and flexible asynchronous/synchronous modes for adapting to varied workloads.
UMap (Peng et al., 2019) and similar userfaultfd/user-space approaches provide explicit interfaces for per-region page size choice, prefetch/readahead degree, and user-defined eviction callbacks, achieving 1.25–2.5× speedups over kernel mmap in storage-/network-bound HPC workloads.

A consistent theme in these advances is separating the logical policy (which pages to keep, which to migrate, when to split/collapse, how to batch operations) from mechanism (memory and translation data structures, OS or accelerator APIs, migration protocols), enabling domain-specific and workload-driven optimization with minimal loss of generality or efficiency.

7. Future Directions and Theoretical Implications

Open research questions and directions include:

Empirical profiling and modeling of real-world memory/capacity fluctuation patterns in multitenant clouds, power-managed multi-core, and emerging accelerator systems (Peserico, 2013).
Extending competitive analysis for dynamic resources to scheduling, admission control, and other online problems (Peserico, 2013).
Characterizing and minimizing coherence/shootdown/locking bottlenecks in distributed and NUMA environments via fine-grained tracking and ownership models (Gao et al., 2024, Bergman et al., 21 Apr 2026).
Automating the tuning of controller parameters (batch size, replacement intervals, migration thresholds) based on measured cost and throughput models (Rayhan et al., 22 Mar 2025).
Integrating page-based principles with new hardware capabilities: on-chip page-table caches, CXL "region controllers," and fine-grained accelerator-side VMM (Zhou et al., 1 Apr 2026, Xu et al., 2024).
Generalizing hybrid page-based management to N-way memory hierarchies (SCM, tiered CXL, disaggregated pools) and workload-specific buffer or activation caches (Oren, 2017, Xu et al., 2024).
Refining algorithm selection using workload-driven access graphs or locality models to distinguish between marking/conservative/online randomized policies (Peserico, 2013).

These problems remain central to advancing system throughput, responsiveness, and resource efficiency in contemporary and future page-managed environments.