Key-Value Cache Management

Updated 21 October 2025

Key-value cache management is the systematic regulation of key and value tensors using admission and eviction policies to enhance throughput and memory efficiency in transformer models.
Compression techniques such as mixed-precision and product quantization enable significant memory reduction while preserving accuracy in large-scale AI applications.
Adaptive hierarchical budget allocation and distributed metadata systems ensure scalable, real-time performance improvements in both database and LLM inference environments.

Key-value cache (KVC) management refers to the suite of algorithms, architectural strategies, and system-level optimizations that govern the retention, eviction, compression, and retrieval of intermediate state—key and value tensors—across a spectrum of hardware and software environments. KVC fundamentally shapes the throughput, memory efficiency, and context capability of transformer-based models, both in web-scale storage and in generative inference for LLMs. The research landscape encompasses methods for size-heterogeneous object stores, high-throughput proxy caches, distributed metadata systems, and the memory-limited regimes of GPU-based LLM inference, reflecting the centrality of KVC management to efficient, scalable AI and storage infrastructure.

1. KVC Management Principles: Policy Formulation and Theoretical Underpinnings

The essence of KVC management lies in the precise regulation of which keys and values are retained under fixed or variable resource budgets. Admission and eviction policies often draw from classic cache theory, but modern settings require generalizations that embrace variable object sizes, heterogeneous access frequencies, dynamic importance, and structural dependencies.

Size-aware admission: In traditional caches, admission decisions rely on frequency or recency metrics, as in TinyLFU. Extensions for variable-sized objects (e.g., "Lightweight Robust Size Aware Cache Management" (Einziger et al., 2021)) introduce comparison schemes that aggregate frequencies across variable-sized victims, ensuring that byte hit ratio, not just object hit ratio, is optimized. Admission decisions may be formalized as:

$f_\text{candidate} \geq \sum_{i} f_{\text{victim}_i} \implies \text{admit candidate}$

where $f$ denotes frequency estimates derived via count-min sketches or similar structures.

Importance-aware retention: Modern policies for LLMs compute token "importance" based on future attention received, recency, or more sophisticated temporal and statistical signals (Ren et al., 9 Feb 2024). The mean attention score (MAS) emerges as a robust metric:

$\text{MAS}(x) = \frac{\sum_{\text{future } t} \text{AttentionScore}(x, t)}{\#\,\text{future references}}$

This mitigates recency bias and ties retention more directly to future computational demand.

Eviction scope and robustness: The definition of "eviction candidate" sets is now contextual. For example, tokens with high standard deviation in attention received (reflecting volatile or unpredictable impact) are excluded from eviction since removal risk destabilizing the computation (Ren et al., 9 Feb 2024).

Graph- and Game-Theoretic Models: Emerging work explores graph-based importance propagation (GraphKV (Li et al., 30 Aug 2025)) where token redundancy is dynamically suppressed by constructing similarity graphs among tokens, and attention head budget allocation via cooperative game theory (CoKV (Sun et al., 21 Feb 2025)) using approximated Shapley values to capture joint contributions.

2. Compression, Quantization, and Merging for Memory Efficiency

Given the O(sequence length × model layers × head dimension) scaling, direct retention of all KVs quickly becomes intractable. To counter this, the field has developed multifaceted compression schemes:

Heterogeneous Mixed-Precision Quantization: Methods such as LeanKV (Zhang et al., 4 Dec 2024) and MiKV (Yang et al., 28 Feb 2024) employ non-uniform quantization, storing keys—which influence all attention scores—at higher precision than values, which only scale outputs:

Keys: typically 8 bits (K8)
Values: typically 4 bits (V4) Tokens crossing a significance threshold may be dropped (pruned) or further compressed.

Residual and Predictor-Based Quantization: AQUA-KV (Shutova et al., 31 Jan 2025) exploits strong layer-to-layer dependencies in the transformer. By fitting predictors $f_k$ and $f_v$ to reconstruct keys and values in layer $l$ from $l-1$ , only the unpredictable residuals are quantized. This allows compression to 2–2.5 bits per value with $<1\%$ accuracy loss.

Product Quantization (PQ): PQCache (Zhang et al., 1 Jul 2024) applies product quantization to key vectors—subdividing each key into $m$ sub-vectors, encoding each via learned codebooks, and facilitating approximate maximum inner product search (MIPS) for subsequent retrieval. This is particularly efficient for large, fixed context windows as in LLM prefill phases.

Merging with Consistency Guarantees: KeepKV (Tian et al., 14 Apr 2025) replaces convex merging with the ZIP-Merging algorithm, which, via the Electoral Votes mechanism, exactly preserves the weighted contribution of merged KVs in softmax computation. This eliminates output perturbation:

$o_t = \frac{\sum_i p_i s_i^t v_i}{\sum_i p_i s_i^t}$

where $p_i$ accumulates the vote counts of merged entries.

Token Grouping and Alternate Representations: KVCrush (Jha et al., 24 Feb 2025) encodes token attention patterns as binary feature vectors (across heads), clustering tokens using efficient Hamming distance. A small set of representative tokens then proxies for groups of dropped tokens, minimizing accuracy loss under tight budgets.

3. Hierarchical and Adaptive Budget Allocation

Modern LLMs require differentiated KVC allocation across layers and attention heads:

Layer-wise Cake-Slicing: CAKE (Qin et al., 16 Mar 2025) assigns cache budgets to layers proportionally to empirical measurements of spatial (entropy of attention distribution) and temporal (variance over time) attention dynamics. The allocation for layer $l$ is:

$B_l = \frac{\mathcal{P}_l}{\sum_{k=0}^{L-1} \mathcal{P}_k} \cdot B_\text{total}$

with

$\mathcal{P}_l = H_l^{1/\tau_1} \cdot V_l^{1/\tau_2}$

where $H$ and $V$ measure attention entropy and variance, respectively.

Head-wise Cooperative Allocation: CoKV (Sun et al., 21 Feb 2025) uses Sliced Shapley Values (SSV) to score the cooperative importance of each head and modulates budgets such that high-SSV heads receive a greater share:

$c_i = B \cdot \left(\frac{NSV_i^\mathcal{H}}{\sum_j NSV_j^\mathcal{H}}\right) + s$

where $s$ is the fixed local window and $B$ is the shared cache pool.

Phase-Partitioned Optimization: SCOPE (Wu et al., 18 Dec 2024) introduces a two-pool model (prefill $\Phi^p$ , decoding $\Phi^d$ ). Prefill is minimally compressed to retain reasoning capabilities, while decoding is dynamically managed with top-K, sliding, and adaptive window strategies.

4. System-Level Implementations and Distributed Metadata Management

Real-world deployments in databases and distributed LLM serving require efficient system-level KVC management.

Expert-Sharded KV Storage for MoE Models: PiKV (Liu et al., 2 Aug 2025) distributes KVs across expert shards, aligning with expert activations to avoid full cache replication on all devices, thus reducing both GPU memory and cross-device communication. Adaptive routing minimizes unnecessary token-to-expert accesses, and modular compression reduces local storage.

Efficient Retrieval and Clustering: LouisKV (Wu et al., 13 Oct 2025) arranges semantic-aware cache retrieval at variable granularity, using cosine similarity of query vectors to define semantic boundaries and k-means clustering for input grouping. A decoupled approach manages prefill and decode phases separately to fit observed attention locality.

Cache-Optimized Metadata: MetaHive (Heidari et al., 26 Jul 2024) in storage systems like RocksDB dissociates metadata from data but ensures physical proximity by key-suffix rules, enabling both rapid validation and extensible, compatible metadata in heterogeneous environments.

Scalable Distributed KVC for Prefill: Analysis in (Zhu et al., 28 May 2025) shows that existing database backends (e.g., Redis, CHIME, Sherman) fail to match the mixed range query and random get() access patterns of LLM KVC prefill workloads. Range queries dominate due to high block hit rates and sequence locality. Future metadata management must balance low-latency sequential access, efficient random retrieval, and hierarchical hotness-aware placement.

5. Benchmarks, Evaluation Metrics, and Empirical Findings

KVC management approaches are rigorously evaluated using a mix of synthetic and real-world traces, with the following primary metrics:

Metric	Description	Significance
Hit Ratio	Fraction of cache accesses resulting in a cache hit	Latency and throughput proxy
Byte Hit Ratio	Fraction of bytes served from cache vs. total required	Bandwidth/cost efficiency proxy
Perplexity/Accuracy	Standard downstream metrics on benchmarks (LongBench, NeedleBench, InfiniteBench, etc.)	Fidelity of compression/eviction methods
Latency/Throughput	System-level time to process tokens (e.g., decoding speedups, TTFT)	Real-time and batch performance
Memory Reduction	Reduction in raw KVC footprint (e.g., 4×, 11×, 83%)	Feasibility of scaling

Key empirical findings include:

AV (aggregated victims) reaches near–state-of-the-art hit ratios at much lower CPU overhead than classical size-aware policies (Einziger et al., 2021).
Mixed-precision (MiKV) and predictor-residual quantization (AQUA-KV) achieve <1% accuracy loss at extreme compression rates (2–2.5 bits/entry) (Yang et al., 28 Feb 2024, Shutova et al., 31 Jan 2025).
GraphKV and KVCrush demonstrate improved token diversity and context resilience by grouping or propagating importance beyond pure top-K selection (Li et al., 30 Aug 2025, Jha et al., 24 Feb 2025).
Advanced system-level strategies (PiKV, LouisKV, KVComp) yield up to 4.7× throughput increase and >80% memory reduction, with integration in MoE and paged GPU/CPU systems (Wu et al., 13 Oct 2025, Liu et al., 2 Aug 2025, Jiang et al., 30 Aug 2025).

6. Practical Applications and Real-World Impacts

KVC management techniques are deployed across a variety of production and research environments:

Databases, Object Stores, and CDNs: Size-aware policies and metadata-optimized cache layouts prevent bottlenecks from large, infrequently accessed blobs and enable efficient data integrity checking with minimal overhead (Einziger et al., 2021, Heidari et al., 26 Jul 2024).
LLM Inference: Token-level, model-level, and system-level KVC optimizations are central to scaling context lengths to 128K+ tokens in models ranging from Llama-2 to Mixtral, and to enabling multi-turn conversational systems without catastrophic forgetting (Li et al., 27 Dec 2024, Liu et al., 21 May 2025).
Multi-GPU/Node and MoE Architectures: Expert-sharded, routed, and compressed KVC design as in PiKV allows inference at unprecedented context and model scales (Liu et al., 2 Aug 2025).
Benchmarking: Datasets such as LongBench, InfiniteBench, NeedleBench, and others drive evaluation, with detailed taxonomy work situating new methods in the broader ecosystem (Li et al., 27 Dec 2024).

7. Future Directions and Open Challenges

Several trends and research challenges continue to shape the KVC management landscape:

Dynamic, Relational Selection: Propagation-based, graph neural network–inspired retention continues to evolve (see GraphKV (Li et al., 30 Aug 2025)).
Hierarchical, Hotness-Aware Caching: System-level KVC managers will increasingly use hotness-driven tiered storage, supporting efficient, workload-aware index structures (Zhu et al., 28 May 2025).
Zero-Perturbation Compression: ZIP-Merging and voting-based consistency signals (as in KeepKV (Tian et al., 14 Apr 2025)) are likely to become standard to eliminate hallucinations and attention inconsistency under heavy compression.
Efficient MoE and Grouped Architectures: Ongoing work in expert-sharded pooling, adaptive routing, and compression-aware scheduling (PiKV (Liu et al., 2 Aug 2025)) is foundational for multi-expert LLM deployments.
Cross-Device and Real-Time Adaptation: The co-design of compression, retrieval, and memory management at the hardware- and system-software boundary will see increasing focus as model sizes and batch sizes outpace current memory and bandwidth limits.
Benchmarks, Standardization, and Reproducibility: As the space matures, continued work in standardized evaluation (as in (Li et al., 27 Dec 2024)) and open-source implementations (Many authors, e.g., EasyKV, CAKE, PiKV) will play a central role in guiding adoption and future research.

KVC management, at the intersection of algorithm design, systems engineering, and AI modeling, remains foundational for scaling both data infrastructure and large-scale LLMs.