KVCache Management in LLMs

Updated 15 October 2025

KVCache management is the process of organizing intermediate key and value tensors used in transformer self-attention, critical for LLM throughput and latency optimization.
Innovative techniques like disaggregated memory pooling, mixed-precision compression, and dynamic eviction enable significant improvements in context length and computational efficiency.
Adaptive strategies including layer-wise allocation, game-theoretic head budgeting, and workload-aware eviction policies drive scalability and efficient resource use on constrained hardware.

Key-Value (KV) Cache management in the context of LLMs and related transformer-based architectures refers to the design, allocation, compression, sharing, and scheduling policies governing the intermediate key and value tensor storage used by self-attention mechanisms during autoregressive inference. With model and context scaling, the KV cache constitutes the largest and most dynamic memory footprint in practical LLM serving and inference pipelines. Its efficient management directly impacts model throughput, latency, scalability, and the feasibility of processing long contexts on constrained hardware resources.

1. Architectural Disaggregation and Resource Pooling

A fundamental evolution in KVCache management is the architectural decoupling of attention memory from the rest of the transformer pipeline and the pooling of memory resources across distributed hardware. Infinite-LLM (Lin et al., 2024) introduces DistAttention, where each attention layer’s KV cache is partitioned into fixed-size sub-units (“rBlocks”), enabling both fine-grained management and distributed computation. Each rBlock can be independently allocated, swapped, or scheduled across devices. The system orchestrates this via a two-tiered control plane: a local rManager virtualizes per-device memory for sub-block allocation, while a global gManager maintains a distributed ledger of available and “borrowed” memory across the cluster. This allows the aggregate GPU (and even CPU) memory pool to be used flexibly in response to workload peaks, enabling nearly 2M-token contexts and a 1.35–3.4× improvement in throughput compared to node-local page-based swapping strategies. Mooncake (Qin et al., 2024) further decouples prefill and decode clusters, building a tiered cache hierarchy using underutilized CPU, DRAM, and SSD resources, coordinated via a KVCache-centric scheduler (“Conductor”) that manages cache-aware routing and dynamic hot-spot migration between nodes.

2. Memory Compression and Token Retention Policies

To address the linear growth of KVCache with sequence length, recent developments employ a spectrum of compression and pruning algorithms. One approach is product quantization as seen in PQCache (Zhang et al., 2024): keys (and optionally values) are partitioned and quantized per sub-vector, transforming the retrieval process into a maximum inner-product search over centroids. Selectively attending to only the top-k tokens as determined by approximate scores reduces both the computation and communication overhead. LeanKV (Zhang et al., 2024) introduces a unified framework combining mixed-precision quantization (e.g., higher precision for keys, lower for values) with per-token significance-driven pruning. Dynamic per-head sparsity adapts cache budgets live to attention distributions. Channel shrinking via low-rank decomposition [(Wang et al., 2024), CSKV] compresses along the channel dimension based on singular value analyses, which, when combined with quantization-aware fine-tuning, can yield up to 95% reduction in KV memory with limited accuracy degradation.

Adaptive eviction and allocation strategies are central to methods such as CAKE (Qin et al., 16 Mar 2025), which frames cache allocation as a “cake slicing” problem using layer-specific preference scores derived from spatial attention entropy and temporal shift. The allocation per layer is formalized as $B_l = \frac{P_l}{\sum_k P_k} \times B_{\text{total}}$ , with cascading updates during prefill and dynamic eviction indicators that combine mean and variance of attention to retain or evict tokens.

3. Layer-wise and Attention-Head Adaptive Allocation

Layer-wise and head-wise budget assignment has supplanted uniform cache allocation for more optimal trade-offs between fidelity and resource use. XKV (Li et al., 2024) formalizes KV cache allocation as a combinatorial optimization problem: for each layer, importance retention ratio $R_i = \frac{\text{Sum(TopK}(n_i, w_i))}{\text{Sum}(w_i)} \times 100%$ is maximized under a global memory constraint. A greedy “dynamic differences of importance distribution” (DDID) strategy computes the optimal per-layer allocation, yielding up to 61.6% reduction in memory use while preserving task accuracy.

At the attention-head level, BaKlaVa (Gulhan et al., 18 Feb 2025) and CoKV (Sun et al., 21 Feb 2025) assign cache memory non-uniformly per head. BaKlaVa uses one-time profiling to estimate the cosine similarity between per-head input and output activations, allocating larger budgets to more “critical” heads. CoKV advances this by leveraging cooperative game theory and Shapley value approximations, modeling heads’ contributions as part of a coalition rather than in isolation, and allocating cache budgets to maximize the global model utility.

4. Decoding-Efficient KVCache Policies and System Design

Several systems improve the practical aspects of KVCache management beyond memory reduction, focusing on cache transfer, scheduling, and throughput optimization. P/D-Serve (Jin et al., 2024) shifts from block-fixed to contiguous buffer transfers, enabling single-burst KVCache device-to-device migration over RDMA (RoCE), with a 46% reduction in D2D KVCache transfer time and a 60% improvement in throughput at scale. PiKV (Liu et al., 2 Aug 2025) extends this line for mixture-of-experts (MoE) models via expert-sharded KV storage, reducing redundant cache replication and integrating modular compression, adaptive scheduling, and query-aware routing, all validated against hardware-optimized primitives (e.g., via Nvidia kvpress). KunServe (Cheng et al., 2024) departs from KVCache-centric paradigms to a parameter-centric approach: during memory throttling, selectively dropping replicated model parameters releases immediate memory for KV allocations, with live KVCache exchange and remote attention across the cluster, reducing tail TTFT by up to 72× under bursty workloads.

5. Workload-Aware and Semantic Cache Reuse

Empirical analysis of production-scale serving (Wang et al., 3 Jun 2025) demonstrates that KVCache reuse patterns are highly workload-dependent: while multi-turn requests drive much of the reuse in some workloads, even single-turn requests (with overlapping system prompts) constitute up to 97% of cache hits in API-driven (to–B) scenarios. The reuse time for KV cache blocks is typically well-captured by exponential models, supporting predictive, workload-aware eviction policies that compute priorities using a tuple of reuse probability and spatial offset. Compared to LRU and LFU, this policy yields a 1.5–3.9% higher cache hit rate and up to 41.9% lower QTTFT under realistic request traces.

Expanding from exact to fuzzy sharing, SemShareKV (Zhao et al., 29 Sep 2025) applies token-level locality-sensitive hashing (LSH) to match tokens between lexically distinct but semantically similar prompts, injecting relevant cached KV states with rotary position embedding alignment. Ablation studies show negligible quality loss yet a 6.25× TTFT speedup and 42% GPU memory reduction in multi-document summarization.

6. Comparative Efficacy and Scaling Considerations

Recent comparative studies report dramatic improvements in both efficiency and capacity. Infinite-LLM (Lin et al., 2024) supports nearly 1.9 million token contexts with a 2×–19× increase in maximum context over prior designs; Moencake (Qin et al., 2024) achieves up to 525% throughput gain. Compression-centric methods such as PQCache (Zhang et al., 2024), DynamicKV (Zhou et al., 2024), and CAKE (Qin et al., 16 Mar 2025) sustain full or even improved model performance with cache retention ratios as low as 1–3.2%. LeanKV (Zhang et al., 2024) and KVCrush (Jha et al., 24 Feb 2025) are shown to reduce memory use by 4×–11× and accelerate generation with minimal or sub-1% accuracy compromise. Importantly, the design and evaluation of these techniques account for the tight coupling between cache size, latency, throughput, attention kernel performance, and the statistical distribution of attention across layers, heads, and query types.

System/Paper	Compression Ratio / Memory Reduction	Throughput/Latency Impact	Quality Impact
Infinite-LLM (Lin et al., 2024)	Up to 19× longer contexts	1.03–2.4× throughput gain	No reported accuracy drop
Mooncake (Qin et al., 2024)	N/A	525% throughput (sim.), 75% real	Meets SLOs
PQCache (Zhang et al., 2024)	5–10× token reduction	Minimal added latency; scalable	Maintains/improves scores
LeanKV (Zhang et al., 2024)	3–5× (up to 11×, 5% loss)	1.9–6.9× throughput gain	∼Lossless (<3% in-optimal)
CAKE (Qin et al., 16 Mar 2025)	≈3.2% cache retention	>10× decoding speedup (128K ctx)	Full or improved performance
CobKV (Sun et al., 21 Feb 2025)	Memory usage ↓ up to 64%	<50% decoding latency (rel. full)	Maintains/exceeds baseline
SemShareKV (Zhao et al., 29 Sep 2025)	42% GPU memory saving (5k tokens)	6.25× TTFT speedup	Negligible quality loss

7. Task and Modality Adaptivity

Many works now emphasize adaptive, context-aware, and even modality-aware KVCache management. VL-Cache (Tu et al., 2024) demonstrates that directly importing LLM KV compression methods is suboptimal for vision-LLMs, due to their divergent attention and sparsity patterns. VL-Cache dynamically computes layer-wise sparsity and adopts a modality-aware token scoring policy, allocating cache budget per layer explicitly proportional to information density and focusing on post-visual context. DynamicKV (Zhou et al., 2024) introduces progressive per-layer token selection policies, with periodic, task-aware budget normalization and buffer updates, retaining only ∼1.7% of the cache with 85% of full-quality performance, outperforming fixed-pattern methods under extreme compression.

Conclusion

KVCache management has progressed from rigid, uniform allocation and single-node caching to distributed, dynamic, and modality/task-adaptive methods. Techniques span system-level architectural innovations (disaggregated attention, pooled cluster memory, tiered storage), fine-grained memory compression (mixed-precision, quantization, low-rank, PQ, pruning, binary feature clustering), workload-aware and game-theoretic allocation (per-head/layer, Shapley, DDID), and efficient transfer and scheduling policies (contiguous RDMA transfer, remote attention, cooperative expert shard placement). Recent empirical studies quantify the benefits of these strategies in terms of memory reduction, throughput gains, minimum quality loss, and operational scalability in production environments. As model sizes, context lengths, and deployment scales continue to increase, KVCache management remains a central and rapidly evolving foundational challenge in efficient LLM serving.

Markdown Upgrade to Chat

References (17)

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache (2024)

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (2024)

PQCache: Product Quantization-based KVCache for Long Context LLM Inference (2024)

Unifying KV Cache Compression for Large Language Models with LeanKV (2024)

CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios (2024)

CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences (2025)

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference (2024)

BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference (2025)

CoKV: Optimizing KV Cache Allocation via Cooperative Game (2025)

10.

P/D-Serve: Serving Disaggregated Large Language Model at Scale (2024)

11.

PiKV: KV Cache Management System for Mixture of Experts (2025)

12.

KunServe: Efficient Parameter-centric Memory Management for LLM Serving (2024)

13.

KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider (2025)

14.

SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching (2025)

15.

DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs (2024)

16.

KVCrush: Key value cache size-reduction using similarity in head-behaviour (2025)

17.

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KVCache Management.