Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

KV Cache Compression in Transformer LLMs

Updated 4 August 2025
  • KV Cache Compression is a set of techniques that reduce the memory footprint of Transformer LLM key-value caches by leveraging sparsity, dimensionality reduction, and quantization.
  • These methods combine strategies like token pruning, low-rank factorization, and mixed-precision quantization to maintain accuracy while managing long-context sequences.
  • They enable scalable, efficient inference for applications such as long document processing and retrieval-augmented generation while preserving output fidelity and safety.

Key–Value (KV) cache compression refers to the set of algorithmic strategies and practical techniques for reducing the memory and computational footprint of the KV caches held by Transformer-based LLMs during autoregressive inference. The KV cache retains, for each previously generated token and every attention layer, the key and value activations required for subsequent attention computations, enabling fast generation and long-range context modeling. However, KV cache size grows linearly with both context length and batch size, often becoming the primary bottleneck for scaling LLMs to long document processing, retrieval-augmented generation, and cost-efficient web deployments. Recent advances in the field have focused on reliably reducing KV cache requirements—across multiple dimensions—while preserving or minimally impacting downstream accuracy, safety, and coherent context reasoning, with techniques spanning quantization, sparsification, layer-wise compression, semantic chunking, head-specific retention, and more.

1. Fundamental Techniques for KV Cache Compression

The literature categorizes KV cache compression methods via three primary axes: (i) sparsity, (ii) channel/dimensionality reduction, and (iii) quantization. Many approaches combine these elements to optimize trade-offs:

Hybrid frameworks interleave these, e.g., combining latent compression with quantization (Yang et al., 20 Oct 2024), or semantic chunking with aggressive pruning (Liu et al., 1 Feb 2025).

2. Importance Scoring and Retention Strategies

Central to most sparsity and retention schemes is the definition of “importance” for tokens or groups:

  • Attention-Based Metrics: Accumulated attention score, either over the global sequence or within a moving local window, is common (Yang et al., 28 Feb 2024, Behnam et al., 19 Feb 2025, Cai et al., 4 Jun 2024). However, naive approaches can bias toward early or recently-added tokens, potentially yielding context loss.
  • Global-Local Balancing: To mitigate positional and cumulative attention biases, recent work aggregates both global (accumulated attention across full context) and local (recent window) metrics—normalizing or combining them into a unified importance score (Li et al., 11 Dec 2024). For example:

sGloLoc=max(sGlomean(sLoc)mean(sGlo), sLoc)s_{Glo-Loc} = \max\left( s_{Glo} \cdot \frac{\text{mean}(s_{Loc})}{\text{mean}(s_{Glo})},~s_{Loc} \right)

  • Semantic Chunking: Rather than scoring tokens individually, ChunkKV (Liu et al., 1 Feb 2025) aggregates importance for contiguous chunks, enhancing preservation of linguistic and contextual integrity.
  • Head/Layer Adaptive Retention: Several methods assign retention budgets individually to layers (Cai et al., 4 Jun 2024, Liu et al., 23 May 2024, Zhou et al., 19 Dec 2024) or attention heads (Rehg, 30 Sep 2024, Tang et al., 22 Jul 2024), based on per-layer attention patterns (such as pyramidal funneling), headwise attention spans, or groupwise token redundancy.
  • Modality-Awareness: For multi-modal models, such as vision-LLMs, scoring policies that explicitly differentiate visual from textual token roles are critical (Tu et al., 29 Oct 2024). Methods like VL-Cache and AirCache target redundancy in visual tokens by leveraging inter-modal attention matrices and tailored scoring windows (Huang et al., 31 Mar 2025).

3. Quantization and Mixed-Precision Methods

  • Scalar, Vector, and Mixed-Precision Quantization: Early methods used uniform per-channel quantization (e.g., using FP16 or INT4 for all). Modern approaches apply importance-aware mixed schemes, allocating higher bitwidth to crucial tokens/latents and aggressively quantizing others (e.g., MiKV (Yang et al., 28 Feb 2024), SVDq (Yankun et al., 21 Feb 2025)).
  • Latent Channel Quantization: SVDq projects key matrices onto SVD-derived latent bases, then applies bitwidth schedules that rapidly decay with decreasing singular values—often skipping near-zero channels, enabling compression ratios up to 400x with minimal quantization error.
  • Vector Quantization: CommVQ (Li et al., 23 Jun 2025), employing residual additive quantization, partitions KV vectors, encodes them using compact binary sequences via learned codebooks, and is specifically designed so the codebook commutes with the RoPE transformation. The additive scheme is trained via expectation-maximization, achieving 87.5% memory reduction at 2-bit without significant accuracy loss.
  • Residual Vector Quantization: Built on high-fidelity audio compression paradigms, RVQ divides scaled KV vectors into groups (using non-contiguous grouping for keys, contiguous for values) and applies multi-stage (typically depth=8) quantization with codebooks updated by exponential moving average (Kumar, 21 Oct 2024).

4. Layerwise, Semantic, and Adaptive Compression Frameworks

  • Depth-wise and Cross-layer Strategies: MiniCache (Liu et al., 23 May 2024) exploits the high similarity of KV states between adjacent layers, interpolating their directions with SLERP and merging magnitudes, leading to up to 5× compression with minimal loss. Cross-Layer Latent Attention (CLLA) (Yang et al., 20 Oct 2024) shares low-dimensional compressed representations across layers, further quantized for up to 98% space reduction.
  • Semantic-Preserving Compression: ChunkKV (Liu et al., 1 Feb 2025) selects full semantic chunks, preserving subject–predicate–object and phrase boundary information, and reuses index mappings across layers to reduce computational overhead, yielding up to 26.5% throughput improvement and robustness on retrieval and reasoning tasks.
  • Partial and Headwise Retention: RazorAttention (Tang et al., 22 Jul 2024) keeps full caches for “retrieval heads” (which genuinely reference long-range tokens), while truncating others, using a compensation token to aggregate dropped information. KV-Compress (Rehg, 30 Sep 2024) permits different eviction rates per head/layer in a paged attention cache, realizing up to 8× compression and >5× throughput gains in vLLM deployments.
  • Dynamic, Task-Adaptive Budgets: DynamicKV (Zhou et al., 19 Dec 2024) adaptively adjusts per-layer/per-task cache sizes, maintaining ∼85% of full-cache performance when retaining only 1.7% of tokens, and strongly outperforming SOTA in retrieval-heavy settings. PyramidKV (Cai et al., 4 Jun 2024) further illustrates dynamic cache allocation: most KV pairs retained in lower layers (where attention is distributed), with shrinking budgets in higher layers ("attention sinks").

5. Output Fidelity, Safety, and Theoretical Guarantees

  • Information Preservation and Context Fidelity: Completely evicting “unimportant” KV pairs can introduce safety breaches, hallucinations, or context loss (Yang et al., 28 Feb 2024). Storing quantized versions of evicted pairs mitigates these risks, retaining enough information to maintain generation and safety performance.
  • Zero-Perturbation and Consistency: Merging-based approaches can introduce output inconsistency (“attention sag”) due to mismatches in attention mass pre/post merge. KeepKV (Tian et al., 14 Apr 2025) introduces the Electoral Votes mechanism and Zero Inference-Perturbation Merging (ZIP), ensuring that, at merge time, attention outputs remain exactly unchanged and cumulative influence is preserved. Error bounds for future steps are analytically characterized via the accuracy of attention predictions.
  • Layerwise and Modal Consistency: For VLMs, methods such as AirCache (Huang et al., 31 Mar 2025) employ statistics like strength and skewness of importance to allocate per-layer budgets, further improving efficiency by focusing token preservation in critical layers where attention distributions are highly non-uniform.

6. Computational and Practical Impact

  • Memory and Throughput Gains: Practically, KV cache compression methods enable LLMs with context windows exceeding 128k–170k tokens to operate on commodity and production GPUs. For example, CommVQ’s 1-bit quantization allows running LLaMA-3.1-8B with 128K context on a single RTX 4090 (Li et al., 23 Jun 2025). Instances of >80% KV cache reduction and ~5× throughput improvements are demonstrated on industry-standard tasks and hardware (Liu et al., 23 May 2024, Rehg, 30 Sep 2024, Li et al., 23 Jun 2025).
  • Minimal Accuracy Loss: Mixed-precision, adaptive, semantic, and headwise methods regularly match or nearly match full-cache baselines, even with <10% of the original tokens or bits retained, across tasks including line retrieval, code completion, multi-turn dialogue, and factual QA (Yang et al., 28 Feb 2024, Yang et al., 20 Oct 2024, Li et al., 11 Dec 2024, Zhou et al., 19 Dec 2024, Yankun et al., 21 Feb 2025, Liu et al., 1 Feb 2025).
  • Integration and Scalability: Techniques such as paged attention caches and vectorized quantization fuse compression into the decoding and attention steps with negligible runtime overhead, often requiring only a lightweight codebook and block table management. Notably, methods like KV-Compress have been integrated into production-grade inference stacks (e.g., vLLM (Rehg, 30 Sep 2024)) to enable batch size scaling and session concurrency under fixed memory budgets.

7. Frontiers and Research Directions

  • Continual and Multi-turn Compression: Some approaches extend budgeted KV cache allocation across multi-turn or streaming conversation contexts (Behnam et al., 19 Feb 2025), seeking to balance history retention and computational tractability.
  • Offline and Query-Agnostic Compression: Methods like KVzip (Kim et al., 29 May 2025) use context reconstruction scores in a query-agnostic fashion, facilitating cache reuse across divergent future queries and substantially outperforming query-aware token selection methods when cache is reused in batch or serving contexts.
  • Open Problems: Challenges remain in the accurate prediction of attention score evolution for multi-step merging schemes, aggressive compression at sub-1% budget ratios, avoiding privacy leakage during context reconstruction-based scoring, and further composability between quantization, low-rank, sparsity, and semantic-aware frameworks.

In summary, the field has rapidly evolved to support highly efficient LLM inference under challenging memory and throughput constraints, with algorithmic solutions that preserve semantic, syntactic, and generative fidelity by combining importance-based selection, adaptive compression budgets, and innovative quantization and merging strategies. The corpus demonstrates that with appropriate methodological choices, up to 98% cache size reduction is achievable, and >400× effective key compression ratios can be approached while maintaining nearly full downstream performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)