Papers
Topics
Authors
Recent
2000 character limit reached

KVCache Management in LLMs

Updated 15 October 2025
  • KVCache management is the process of organizing intermediate key and value tensors used in transformer self-attention, critical for LLM throughput and latency optimization.
  • Innovative techniques like disaggregated memory pooling, mixed-precision compression, and dynamic eviction enable significant improvements in context length and computational efficiency.
  • Adaptive strategies including layer-wise allocation, game-theoretic head budgeting, and workload-aware eviction policies drive scalability and efficient resource use on constrained hardware.

Key-Value (KV) Cache management in the context of LLMs and related transformer-based architectures refers to the design, allocation, compression, sharing, and scheduling policies governing the intermediate key and value tensor storage used by self-attention mechanisms during autoregressive inference. With model and context scaling, the KV cache constitutes the largest and most dynamic memory footprint in practical LLM serving and inference pipelines. Its efficient management directly impacts model throughput, latency, scalability, and the feasibility of processing long contexts on constrained hardware resources.

1. Architectural Disaggregation and Resource Pooling

A fundamental evolution in KVCache management is the architectural decoupling of attention memory from the rest of the transformer pipeline and the pooling of memory resources across distributed hardware. Infinite-LLM (Lin et al., 5 Jan 2024) introduces DistAttention, where each attention layer’s KV cache is partitioned into fixed-size sub-units (“rBlocks”), enabling both fine-grained management and distributed computation. Each rBlock can be independently allocated, swapped, or scheduled across devices. The system orchestrates this via a two-tiered control plane: a local rManager virtualizes per-device memory for sub-block allocation, while a global gManager maintains a distributed ledger of available and “borrowed” memory across the cluster. This allows the aggregate GPU (and even CPU) memory pool to be used flexibly in response to workload peaks, enabling nearly 2M-token contexts and a 1.35–3.4× improvement in throughput compared to node-local page-based swapping strategies. Mooncake (Qin et al., 24 Jun 2024) further decouples prefill and decode clusters, building a tiered cache hierarchy using underutilized CPU, DRAM, and SSD resources, coordinated via a KVCache-centric scheduler (“Conductor”) that manages cache-aware routing and dynamic hot-spot migration between nodes.

2. Memory Compression and Token Retention Policies

To address the linear growth of KVCache with sequence length, recent developments employ a spectrum of compression and pruning algorithms. One approach is product quantization as seen in PQCache (Zhang et al., 1 Jul 2024): keys (and optionally values) are partitioned and quantized per sub-vector, transforming the retrieval process into a maximum inner-product search over centroids. Selectively attending to only the top-k tokens as determined by approximate scores reduces both the computation and communication overhead. LeanKV (Zhang et al., 4 Dec 2024) introduces a unified framework combining mixed-precision quantization (e.g., higher precision for keys, lower for values) with per-token significance-driven pruning. Dynamic per-head sparsity adapts cache budgets live to attention distributions. Channel shrinking via low-rank decomposition [(Wang et al., 16 Sep 2024), CSKV] compresses along the channel dimension based on singular value analyses, which, when combined with quantization-aware fine-tuning, can yield up to 95% reduction in KV memory with limited accuracy degradation.

Adaptive eviction and allocation strategies are central to methods such as CAKE (Qin et al., 16 Mar 2025), which frames cache allocation as a “cake slicing” problem using layer-specific preference scores derived from spatial attention entropy and temporal shift. The allocation per layer is formalized as Bl=PlkPk×BtotalB_l = \frac{P_l}{\sum_k P_k} \times B_{\text{total}}, with cascading updates during prefill and dynamic eviction indicators that combine mean and variance of attention to retain or evict tokens.

3. Layer-wise and Attention-Head Adaptive Allocation

Layer-wise and head-wise budget assignment has supplanted uniform cache allocation for more optimal trade-offs between fidelity and resource use. XKV (Li et al., 8 Dec 2024) formalizes KV cache allocation as a combinatorial optimization problem: for each layer, importance retention ratio Ri=Sum(TopK(ni,wi))Sum(wi)×100R_i = \frac{\text{Sum(TopK}(n_i, w_i))}{\text{Sum}(w_i)} \times 100% is maximized under a global memory constraint. A greedy “dynamic differences of importance distribution” (DDID) strategy computes the optimal per-layer allocation, yielding up to 61.6% reduction in memory use while preserving task accuracy.

At the attention-head level, BaKlaVa (Gulhan et al., 18 Feb 2025) and CoKV (Sun et al., 21 Feb 2025) assign cache memory non-uniformly per head. BaKlaVa uses one-time profiling to estimate the cosine similarity between per-head input and output activations, allocating larger budgets to more “critical” heads. CoKV advances this by leveraging cooperative game theory and Shapley value approximations, modeling heads’ contributions as part of a coalition rather than in isolation, and allocating cache budgets to maximize the global model utility.

4. Decoding-Efficient KVCache Policies and System Design

Several systems improve the practical aspects of KVCache management beyond memory reduction, focusing on cache transfer, scheduling, and throughput optimization. P/D-Serve (Jin et al., 15 Aug 2024) shifts from block-fixed to contiguous buffer transfers, enabling single-burst KVCache device-to-device migration over RDMA (RoCE), with a 46% reduction in D2D KVCache transfer time and a 60% improvement in throughput at scale. PiKV (Liu et al., 2 Aug 2025) extends this line for mixture-of-experts (MoE) models via expert-sharded KV storage, reducing redundant cache replication and integrating modular compression, adaptive scheduling, and query-aware routing, all validated against hardware-optimized primitives (e.g., via Nvidia kvpress). KunServe (Cheng et al., 24 Dec 2024) departs from KVCache-centric paradigms to a parameter-centric approach: during memory throttling, selectively dropping replicated model parameters releases immediate memory for KV allocations, with live KVCache exchange and remote attention across the cluster, reducing tail TTFT by up to 72× under bursty workloads.

5. Workload-Aware and Semantic Cache Reuse

Empirical analysis of production-scale serving (Wang et al., 3 Jun 2025) demonstrates that KVCache reuse patterns are highly workload-dependent: while multi-turn requests drive much of the reuse in some workloads, even single-turn requests (with overlapping system prompts) constitute up to 97% of cache hits in API-driven (to–B) scenarios. The reuse time for KV cache blocks is typically well-captured by exponential models, supporting predictive, workload-aware eviction policies that compute priorities using a tuple of reuse probability and spatial offset. Compared to LRU and LFU, this policy yields a 1.5–3.9% higher cache hit rate and up to 41.9% lower QTTFT under realistic request traces.

Expanding from exact to fuzzy sharing, SemShareKV (Zhao et al., 29 Sep 2025) applies token-level locality-sensitive hashing (LSH) to match tokens between lexically distinct but semantically similar prompts, injecting relevant cached KV states with rotary position embedding alignment. Ablation studies show negligible quality loss yet a 6.25× TTFT speedup and 42% GPU memory reduction in multi-document summarization.

6. Comparative Efficacy and Scaling Considerations

Recent comparative studies report dramatic improvements in both efficiency and capacity. Infinite-LLM (Lin et al., 5 Jan 2024) supports nearly 1.9 million token contexts with a 2×–19× increase in maximum context over prior designs; Moencake (Qin et al., 24 Jun 2024) achieves up to 525% throughput gain. Compression-centric methods such as PQCache (Zhang et al., 1 Jul 2024), DynamicKV (Zhou et al., 19 Dec 2024), and CAKE (Qin et al., 16 Mar 2025) sustain full or even improved model performance with cache retention ratios as low as 1–3.2%. LeanKV (Zhang et al., 4 Dec 2024) and KVCrush (Jha et al., 24 Feb 2025) are shown to reduce memory use by 4×–11× and accelerate generation with minimal or sub-1% accuracy compromise. Importantly, the design and evaluation of these techniques account for the tight coupling between cache size, latency, throughput, attention kernel performance, and the statistical distribution of attention across layers, heads, and query types.

System/Paper Compression Ratio / Memory Reduction Throughput/Latency Impact Quality Impact
Infinite-LLM (Lin et al., 5 Jan 2024) Up to 19× longer contexts 1.03–2.4× throughput gain No reported accuracy drop
Mooncake (Qin et al., 24 Jun 2024) N/A 525% throughput (sim.), 75% real Meets SLOs
PQCache (Zhang et al., 1 Jul 2024) 5–10× token reduction Minimal added latency; scalable Maintains/improves scores
LeanKV (Zhang et al., 4 Dec 2024) 3–5× (up to 11×, 5% loss) 1.9–6.9× throughput gain ∼Lossless (<3% in-optimal)
CAKE (Qin et al., 16 Mar 2025) ≈3.2% cache retention >10× decoding speedup (128K ctx) Full or improved performance
CobKV (Sun et al., 21 Feb 2025) Memory usage ↓ up to 64% <50% decoding latency (rel. full) Maintains/exceeds baseline
SemShareKV (Zhao et al., 29 Sep 2025) 42% GPU memory saving (5k tokens) 6.25× TTFT speedup Negligible quality loss

7. Task and Modality Adaptivity

Many works now emphasize adaptive, context-aware, and even modality-aware KVCache management. VL-Cache (Tu et al., 29 Oct 2024) demonstrates that directly importing LLM KV compression methods is suboptimal for vision-LLMs, due to their divergent attention and sparsity patterns. VL-Cache dynamically computes layer-wise sparsity and adopts a modality-aware token scoring policy, allocating cache budget per layer explicitly proportional to information density and focusing on post-visual context. DynamicKV (Zhou et al., 19 Dec 2024) introduces progressive per-layer token selection policies, with periodic, task-aware budget normalization and buffer updates, retaining only ∼1.7% of the cache with 85% of full-quality performance, outperforming fixed-pattern methods under extreme compression.

Conclusion

KVCache management has progressed from rigid, uniform allocation and single-node caching to distributed, dynamic, and modality/task-adaptive methods. Techniques span system-level architectural innovations (disaggregated attention, pooled cluster memory, tiered storage), fine-grained memory compression (mixed-precision, quantization, low-rank, PQ, pruning, binary feature clustering), workload-aware and game-theoretic allocation (per-head/layer, Shapley, DDID), and efficient transfer and scheduling policies (contiguous RDMA transfer, remote attention, cooperative expert shard placement). Recent empirical studies quantify the benefits of these strategies in terms of memory reduction, throughput gains, minimum quality loss, and operational scalability in production environments. As model sizes, context lengths, and deployment scales continue to increase, KVCache management remains a central and rapidly evolving foundational challenge in efficient LLM serving.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to KVCache Management.