Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 173 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Paged KvCache Strategy

Updated 5 November 2025
  • Paged KvCache strategy is a method that partitions the key-value cache into fixed-size pages to manage memory efficiently and support long-context LLM inference.
  • It employs operating system paging concepts like dynamic page allocation, lock-free management, and block-wise eviction to drastically reduce memory fragmentation.
  • Empirical benchmarks show reductions in memory wastage from 60–80% to below 5% and throughput improvements up to 5.18×, enabling scalable concurrent serving.

A paged KvCache strategy refers to the systematic partitioning and management of the key-value (KV) cache in LLMs into fixed-size pages, blocks, or segments for optimized memory allocation, retrieval, eviction, and transfer throughout the model inference and serving pipeline. This strategy enables modular, dynamic, and resource-efficient operation for long-context inference, concurrent serving, and large-scale distributed model deployments. Paged KvCache solutions span in-GPU memory layout, inter-device transfers, tiered storage hierarchies, advanced eviction/compression algorithms, and workload-aware scheduling. These approaches are pivotal in overcoming the quadratic scaling of memory with sequence length and in meeting diverse throughput, latency, and memory constraint requirements in both research and production systems.

1. Foundations of Paged KvCache: Concepts and Mechanisms

Paged KvCache management fundamentally draws from operating system concepts of paged virtual memory. The KV cache, comprising the persistent key and value tensors for self-attention layers, is divided into fixed-size pages (or blocks), each storing K/V tuples for a segment of token positions. Each request or sequence is associated with a page/block table mapping logical token offsets to physical pages in memory. Key mechanisms include:

  • Dynamic page/block allocation and deallocation: Memory is provisioned only as needed per sequence/request, eliminating the need for large monolithic preallocations and reducing both internal and external fragmentation to below 5% in practice (Joshi et al., 8 Jun 2025).
  • Page/block table indirection: Enables non-contiguous physical memory layout of logical KV cache, supporting prefix sharing and efficient reuse of common prompt blocks across requests.
  • Concurrent lock-free management: Allocation and reclamation rely on efficient free lists or bump-pointer allocators stored in device global memory, allowing constant-time per-batch operations and high concurrency.

Pseudocode for page allocation and mapping (as implemented in vLLM/FMS) is:

1
2
3
4
5
6
7
8
9
10
11
12
procedure Reserve(seq_id, len):
    n = ceil(len / P)
    B = Pop(F, n)  // obtain n blocks from free list
    page_table[seq_id] = B  // assign blocks

procedure Assign(seq_id, pos, K_new, V_new):
    for t in pos:
        b = floor(t / P)
        o = t % P
        p = page_table[seq_id][b] * P + o
        K[p] = K_new[t]
        V[p] = V_new[t]

This block-based abstraction is orthogonal to the underlying model architecture—MHA, GQA, MQA, or custom attention variants—making it widely deployable (Joshi et al., 8 Jun 2025, Rehg, 30 Sep 2024).

2. Paged KvCache in Inference: Memory Efficiency, Fragmentation, and Throughput

Paged KvCache strategies directly address the memory-inefficiency and fragmentation issues endemic to monolithic KV preallocation. Production settings with variable-length or concurrent requests typically waste 60–80% of KV memory under monolithic allocation; paged designs reduce this wastage to less than 5% (Joshi et al., 8 Jun 2025). Empirical studies report:

Metric Monolithic KV Paged KvCache
Internal fragmentation 60–80% < 5%
Prefix sharing None Supported
Allocation latency Linear/global Constant/lock-free
Throughput (tokens/sec) Baseline Up to 5.18× (vLLM+KV-Compress) (Rehg, 30 Sep 2024)

Paged retrieval enables serving batch-sized, long-context inference with high GPU utilization on commodity hardware (24–80 GB VRAM). It supports variable-length, multi-turn, and concurrent request deployment scenarios without statically overallocating GPU memory per job.

Technical integration is facilitated by fused or JIT-compiled attention kernels (e.g., FlexAttention in FMS or dynamic kernels in vLLM) that support scattered KV cache layout while retaining high memory bandwidth efficiency (Joshi et al., 8 Jun 2025). Virtual tensor infrastructure (vTensor (Xu et al., 22 Jul 2024)) further decouples physical allocation from computation, eliminating kernel overhead from address translation and enabling 1.86–2.4× throughput improvement through dynamic (de)allocation.

3. Block-wise and Page-aware Eviction/Compression Algorithms

Paged KvCache provides the substrate for block-level eviction and fine-grained compression methods—prerequisites for supporting ultra-long contexts (128K–1M tokens) under tight memory budgets. Notable eviction and compression strategies aligned to paging include:

  • PagedEviction: Implements block-wise eviction based on attention-free importance proxies (e.g., Vi2/Ki2\|V_i\|_2/\|K_i\|_2). Evictions are block-aligned to free memory without fragmenting pages, maintaining high memory efficiency and throughput (Chitty-Venkata et al., 4 Sep 2025).
  • KV-Compress: Supports variable-rate, per-head/layer contiguous block eviction to physically reclaim memory in PagedAttention. Per-head importance metrics (squared attention weights) guide block selection. Compression rates up to 8x/64x yield negligible/≤10% accuracy degradation, respectively, and throughput gains up to 5.18× (Rehg, 30 Sep 2024).
  • KVCrush: Utilizes binary headwise attention pattern representations for clustering and selecting representative tokens, maintaining high accuracy at 4× compression with \<0.5% latency overhead. Page-wise extension preserves accuracy and efficiency in end-to-end paged systems (Jha et al., 24 Feb 2025).
  • DefensiveKV/Layer-DefensiveKV: Applies worst-case risk aggregation (max-pooling importance) with per-layer/global top-K selection, synergizing with page/block-based memory layouts to robustly manage temporal dynamics in token importance (Feng et al., 15 Oct 2025).

These algorithms are intrinsic to paged frameworks (as opposed to older token-level policies that induce widespread fragmentation) and are essential for matching realized memory savings to theoretical compression rates (Rehg, 30 Sep 2024).

4. Adaptive, Personalized, and Importance-driven Paging Strategies

Beyond structural page/block management, modern paged KvCache systems leverage task, model, and workload adaptivity to further improve efficiency:

  • Cake-slicing (CAKE): Allocates memory globally across layers by quantifying spatial/temporal attention dynamics to compute per-layer "preference scores", adaptively distributing the total cache ("cake") proportional to live requirements. Layer budgets are refined in a cascading manner with attention-shift-tolerant eviction indicators (Qin et al., 16 Mar 2025).
  • Personalized/Optimal Allocation (XKV, BaKlaVa, Ada-KV): Use per-layer/head profiling (one-time or mini-prefill) and combinatorial optimization to assign personalized cache budgets guided by importance vectors, enabling up to 61.6% and 70% memory reduction over paged methods with static per-head/layer budgets (Li et al., 8 Dec 2024, Gulhan et al., 18 Feb 2025, Feng et al., 16 Jul 2024).
  • Workload-aware and semantic-aware paging: Empirical studies at scale reveal highly skewed cache reuses and short block lifespans per request/user category, motivating eviction priority by per-type/turn predicted reuse probability. Fine-grained adaptive retrieval (LouisKV) triggers sub-page segment transfers only at semantic boundaries, amortizing retrieval overheads and boosting end-to-end latency by up to 4.7× (Wu et al., 13 Oct 2025, Wang et al., 3 Jun 2025).

Such strategies enhance memory and compute scheduling (i) by aligning cache allocation to actual utility, (ii) by prioritizing blocks likely to be reused, and (iii) by supporting dynamic adaptation to observed or predicted changes in request/workload characteristics.

5. Hierarchical Paging: Disaggregation, Tiered Storage, and Distributed Serving

Paged KvCache is foundational for multi-tier and disaggregated inference architectures, enabling partitioning and transfer of cache blocks between local (GPU, DRAM) and remote (CPU, SSD, network) storage at block/page granularity.

  • Disaggregated serving (Mooncake, TransferEngine): Decouples prefill and decode clusters, leveraging a centralized but paged KVCache pool. Pages (hash-chained blocks, e.g., 512 tokens) are deduplicated, migrated across DRAM/SSD and compute nodes using RDMA-based messenger services (Qin et al., 24 Jun 2024, Licker et al., 31 Oct 2025).
  • Hierarchical storage (AdaptCache): Applies per-entry adaptive compression and device (DRAM, SSD) placement using utility-based greedy algorithms. Maximizing recompression of high-reuse and compressible pages increases DRAM hit rates (up to 81%) and reduces time-to-first-token by 1.4–2.4× vs. static baseline (Feng et al., 28 Aug 2025).
  • RDMA-optimized paging (TransferEngine): Supports efficient cross-server paged KvCache transfer for dynamic elastic scaling, using heads-first, contiguous page layout to maximize RDMA throughput and custom notification mechanics (ImmCounter, UVM watcher), saturating 400 Gbps per GPU and overlapping transfer with computation (Licker et al., 31 Oct 2025).

These multi-tier architectures dynamically orchestrate prefix reuse, block-wise transfer, and eviction for SLO-compliant throughput under resource and concurrency constraints.

6. Practical Design Considerations and Limitations

Paged KvCache strategies must balance operational constraints such as page/block size, device granularity, kernel integration, and scheduling. Key factors include:

  • Page size and block partitioning: Page/block size directly modulates fragmentation, allocation overhead, and transfer efficiency. Optimal sizes are grid-searched (common values 64–512 tokens) to match workload batch sizes and model memory access patterns (Joshi et al., 8 Jun 2025).
  • Attention kernel compatibility: Fusing or JIT-compiling attention kernels (e.g., via FlexAttention) supports scattered memory layouts but requires careful mask/offset management to avoid host-device memory copies or compute bottlenecks (Joshi et al., 8 Jun 2025).
  • Cache management overhead: All major algorithms (PagedEviction, KV-Compress, DefensiveKV, KVCrush) are designed to operate within linear-time complexity in cache size, with thoroughly assessed runtime overheads (<1.5–0.5% in best cases).
  • Workflow adaptation: Integration with LLM serving stacks (e.g., vLLM, FMS, Mooncake) is typically drop-in, not requiring model retraining or architecture modification.
  • Limitations: Pure paging cannot, on its own, select which tokens to evict/retain beyond block granularity; it must be paired with importance-driven retention/compression methods for maximal gains under ultra-tight budgets. Some storage tiers (NVMe/SSD) may add latency not present in all-serving environments.
  • Compatibility: Paged KvCache is compatible and often orthogonal to quantization, tensor offload/compression, and model adaptation techniques.

7. Impact, Experimental Benchmarks, and Future Directions

Paged KvCache technologies are now pervasive in LLM inference research and production deployments, undergirding advances in scaling, efficiency, and robustness:

  • Throughput and memory: Up to 5.18× throughput increases (KV-Compress (Rehg, 30 Sep 2024)), >10× decoding speedup over full-KV (CAKE (Qin et al., 16 Mar 2025)), stable memory usage at 3% of full cache while retaining task accuracy.
  • Quality preservation: State-of-the-art adaptive/compressed paged KvCache strategies (CAKE, XKV, Layer-DefensiveKV, KVCrush) consistently maintain or surpass full-KV accuracy, even at extreme compression ratios.
  • Scalability and concurrency: Decoupled memory/computation (vTensor (Xu et al., 22 Jul 2024)), hierarchical pools, and RDMA-based paging (TransferEngine (Licker et al., 31 Oct 2025)) democratize serving of million-token contexts and mass concurrent flows on commodity and cloud hardware.
  • Multi-disciplinary impact: Techniques cross-fertilize OS, database, networking, and deep learning system design, with applications in edge, real-time, and elastic LLM deployment settings.

The paged KvCache paradigm continues to evolve, with active areas including page/block-aware retraining, kernel/hardware co-optimization, multi-tier predictive scheduling, and integration with model-side adaptation to memory and bandwidth constraints.


References (by representative arXiv id):

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Paged KvCache Strategy.