KV Cache Optimization in Transformers
- KV cache optimization is a technique that reduces the linear growth of memory in transformers by selectively retaining, compressing, and merging key-value pairs.
- It employs methods like selective token caching, head-wise quantization, and dynamic budget allocation, achieving up to 20× compression with minimal accuracy degradation.
- These approaches enable sustained high-throughput inference and extended context lengths in large language models while managing on-device resource constraints.
A key-value (KV) cache in transformers stores the intermediate keys and values from previous decoding steps to accelerate autoregressive inference by eliminating redundant computation. While essential for throughput at long context lengths, KV cache memory grows linearly with sequence length and model width, quickly exhausting on-device resources and limiting practical deployment of LLMs. Consequently, KV cache optimization—reducing the storage and computational burden of KV caches without significant accuracy degradation—has emerged as a major focus for both foundational research and production systems.
1. Characterization of KV Cache Bottlenecks
In modern transformer-based LLMs, each decoder layer stores a set of keys and values , where is the accumulated sequence length and the head or hidden dimension. The aggregate memory footprint for an -layer, -dimensional model after decoding tokens is , where is the bytes per element (e.g., 2 for FP16). For example, a 32-layer, , model at FP16 requires approximately 32 GB for the KV cache alone (Liu et al., 8 Aug 2025). This linear scaling quickly exceeds common GPU memory limits, especially for large-batch inference or extended contexts.
Empirical traces from production-scale LLM services demonstrate that high cache reuse rates (54–62%) can be harnessed for substantial throughput gains, but also reveal that ideal cache sizing is workload-dependent and that cache thrashing (repeated KV eviction and regeneration) can bottleneck latency (Wang et al., 3 Jun 2025). The complexity of the attention operation, where each new token must attend over all prior KV pairs, makes optimizing both memory and inference compute essential.
2. Selective Token and Head-wise KV Cache Compression
The dominant approach for KV cache optimization targets selective retention (or merging) of KV entries judged as critical for model fidelity, based on attention or importance metrics.
Selective Token Caching
Methods such as StreamingLLM, SnapKV, H₂O, and RazorAttention maintain a subset of tokens according to cumulative attention scores, recency, or submodular optimization. For instance, H₂O solves a submodular maximization problem to retain heavy-hitter tokens (Liu et al., 8 Aug 2025). Token eviction can be combined with merging, as in LOOK-M, which introduces compensatory strategies using cosine similarity-based assignment and averaging, pivotal, or similarity-weighted merges of dropped KV pairs into the retained set (Wan et al., 26 Jun 2024).
Attention Head Heterogeneity and Budget Allocation
Not all attention heads exhibit equal utility for long-range retrieval. RazorAttention identifies “retrieval heads” that can attend over the full context (using probing, echo/induction score), retaining a full cache only for these, and restricting non-retrieval heads to a fixed-length buffer (recent tokens plus attention sinks). A lightweight “compensation token,” built as the mean of dropped KV pairs, is appended to preserve information needed for future queries (Tang et al., 22 Jul 2024). Task-KV introduces further semantic differentiation at the attention-head level using the distance of each head’s vector to a semantic center, allocating “full” cache to heterogeneous heads and a small/truncated buffer with “middle activations” to the remainder. The resulting allocation delivers 40–60% memory reduction without measurable accuracy drop (He et al., 25 Jan 2025).
Per-layer and Per-head Profiling
Profiling-guided budget allocation schemes such as BaKlaVa and EvolKV explicitly assign per-head and per-layer cache budgets based on measured or optimized utility. BaKlaVa estimates each head/group’s importance by cosine similarity between input and output vectors under softmaxed attention, then solves a resource allocation problem to maximize head-importance-weighted utility under a global budget (Gulhan et al., 18 Feb 2025). EvolKV casts the per-layer budget assignment as a multi-objective optimization, leveraging evolutionary search (CMA-ES) to maximize downstream task performance under memory constraints, often outperforming heuristic layer assignments (Yu et al., 10 Sep 2025).
3. Quantization and Mixed-Precision Compression
A second axis of KV optimization exploits the relative resilience of transformer attention to numerical precision reduction in stored KV states:
- Uniform Quantization: Methods such as KVQuant, KIVI, AsymKV, and MiniKV use per-channel or per-group minimum/maximum scaling, quantizing keys and values to 2–8 bit integers and storing necessary zero-points and scales. MiniKV demonstrates that using per-layer, per-group (e.g., group of 16 elements) 2-bit quantization with custom scale/offset achieves 86% memory reduction and over 98.5% accuracy recovery, with throughput exceeding that of uncompressed FP16 (Sharma et al., 27 Nov 2024, Liu et al., 23 May 2024).
- Mixed-Precision and Heterogeneous Quantization: LeanKV empirically documents that attention scores vary over several orders of magnitude, while value norms are less dynamic, motivating use of higher precision for keys (e.g., 8-bit) and lower for values (e.g., 4-bit) (Zhang et al., 4 Dec 2024). The compression algorithm dynamically assigns quantization levels on a per-token, per-head basis with further per-level pruning where the cumulative significance falls below a low-precision cutoff.
- Quantization with Error Correction: Techniques such as GEAR supplement aggressive quantization with a low-rank residual patch and outlier tokens stored at full or higher precision, enabling 3–4× memory compression at near-lossless accuracy (Gao et al., 31 Mar 2025).
Quantization can be stacked after structural or sparsity-based compression for further gains. Best practices dictate quantizing after pruning/merging operations and retaining higher precision for critical tokens.
4. Advanced and Hybrid Compression Algorithms
Hybrid approaches combine multiple compression levers or offer more adaptive policies:
- Head-behaviour similarity: KVCrush leverages ultra-compact per-token binary signatures indicating which heads would have retained each KV position under per-head thresholds, then clusters and selects representative tokens for evicted groups in linear time, achieving 4× memory reduction at <1% accuracy degradation (Jha et al., 24 Feb 2025).
- Residual Merging and Attention Compensation: ZeroMerge (ZSMerge) introduces a multi-dimensional, momentum-smoothed, head-wise token importance metric for fine-grained buffer allocation, and a residual slot mechanism absorbing evicted tokens using momentum-based averaging, with a compensated attention term correcting for the merged nature of those slots. This allows up to 20× compression, maintains performance and throughput, and is architecture-agnostic (Liu et al., 13 Mar 2025).
- Dynamic Budget Adjustment: DBudgetKV proposes a dynamic, attention-norm-based halting rule: tokens are pruned only as long as the norm of recent attention remains within a set threshold of the full-cache value, guaranteeing lossless or near-lossless compression on a per-input, per-layer basis (Ni et al., 24 Feb 2025).
- Multi-modal and Semantic Diversity-aware Compression: MixKV in the multi-modal regime augments standard importance scoring by a diversity measure, with mixing coefficients determined by empirical redundancy within each head; this consistently improves baseline importance-only methods on LVLMs at extreme compression ratios, particularly for modules with high intra-head redundancy (Liu et al., 23 Oct 2025).
- Depth-dimension Compression: MiniCache merges adjacent layers in the middle-to-deep stack where per-token vectors are highly similar, using SLERP interpolation for unit directions and norm preservation, subject to an angular distance-based retention exception, yielding up to 5× memory savings at near-lossless accuracy (Liu et al., 23 May 2024).
A selection of these methods is summarized below:
| Method | Compression Ratio | Accuracy Loss | Latency Overhead |
|---|---|---|---|
| RazorAttention | ~3× | <2 points | Negligible |
| MiniKV | 86% reduction | <1.5% | Up to 66% throughput |
| LeanKV | 2.7–5.7× | <1% | <1% |
| ZeroMerge | 20× | minimal | Constant, small |
| KVCrush | 4× | <1% | <0.5% |
| BaKlaVa | 70% | <5% | <2% |
| Task-KV | 40%–60% | <1% | None |
5. System and Production-scale Optimization
Real-world LLM serving systems must address system-level constraints such as memory tiering (GPU+CPU), cache paging, and multi-tenant sharing.
- Workload-aware Cache Management: Empirical analysis of production traces supports that workload distributions are highly heterogeneous; effective cache eviction leverages exponential models of reuse probability per request type, generating up to 24% more throughput and 42% lower tail first-token latency compared to LRU/FIFO/LFU, especially at moderate capacities (Wang et al., 3 Jun 2025).
- Memory Tier Exploitation: MIRAGE introduces parameter remapping, enabling dynamic reuse of inactive parameter tensors' GPU-address space for temporary expansion of the KV cache, thus avoiding expensive bidirectional swapping to/from CPU RAM and achieving up to 86.7% throughput gain and dramatic tail-latency reduction in multi-tenant environments when paired with high bandwidth CPU-GPU links (Li et al., 15 Jul 2025).
- Attention Kernel and Cache Layout: Compression effectiveness in production critically depends on compatibility with attention kernel implementations (FlashAttention, PagedAttention). Some sparsity-based methods require recomputation of attention scores not stored for efficient kernel implementations, limiting practical gains (Gao et al., 31 Mar 2025).
- Paging, Quantization, and Hybridization: Approaches such as KVCrush or LeanKV are fully compatible with paged cache management (as in vLLM) and can be merged with quantization to enable maximal gains, as the retained token set is decoupled from physical cache layout or datatype (Jha et al., 24 Feb 2025, Zhang et al., 4 Dec 2024).
6. Limitations, Trade-Offs, and Future Directions
Despite substantial progress, challenges remain. Fixed-budget schemes often underperform on tasks or inputs with high context-sensitivity. Quantization at extreme low bitwidths can destabilize perplexity in tasks requiring precise long-range composition (e.g., reasoning). Streaming or continual inference workflows may require new algorithmic primitives beyond the prevalent static prompt-based setting. Hardware/software codesign—e.g., exposing unified “KV_Compress” APIs at the kernel level, integrating quantization and sparsity with on-chip memory managers—represents a frontier for further performance gains (Liu et al., 8 Aug 2025).
Research directions include hybrid schedulers that dynamically blend pruning, quantization, and structured attention based on real-time resource vs. fidelity control problems, and learned or reinforcement-learning-based policies for optimal budget allocation (e.g., as outlined for BaKlaVa and LeanKV). Co-optimization with hardware scheduling, fine-tuning per-head or per-token precision, and robust benchmarking under diverse user workloads are active areas (Liu et al., 8 Aug 2025).
7. Empirical Performance and Application Scenarios
Across LLM and multi-modal regimes, state-of-the-art KV cache optimization achieves:
- Compression ratios ranging from 2.5× to >20×, depending on strategy and tolerance for minor degradation.
- Task scores reduced by ≤1–2 points at high compression (RazorAttention, MiniKV, LeanKV, ZeroMerge), with certain methods (EvolKV, ZeroMerge) surpassing full-cache accuracy under tight memory constraints for particular tasks (Yu et al., 10 Sep 2025, Liu et al., 13 Mar 2025).
- Throughput increases up to 5×, memory savings enabling context windows up to 80 K tokens, and latency overheads typically ≤1%.
- In multi-modal/VLLM/AV-LLM contexts, tailored techniques such as MixKV, PureKV, AccKV address otherwise crippling semantic redundancy, spatial/temporal noise, or modality misalignment, with empirical gains of 1–9% over importance-only compression under tight budgets (Jiang et al., 29 Oct 2025, Jiang et al., 14 Nov 2025, Liu et al., 23 Oct 2025).
Production deployments must carefully balance compression ratio, accuracy targets, and system-level efficiency, often requiring empirical, task-specific tuning or online adaptation.
In summary, KV cache optimization encompasses a spectrum of algorithmic and system-level strategies—selective retention, quantization, merging, memory management, and adaptive scheduling—that collectively enable long-context, high-throughput LLM inference and deployment at scale, with empirically validated, minimal impact on downstream quality (Liu et al., 8 Aug 2025, Tang et al., 22 Jul 2024, Sharma et al., 27 Nov 2024, Li et al., 15 Jul 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free