Papers
Topics
Authors
Recent
Search
2000 character limit reached

KV-Cache Compression Techniques

Updated 2 February 2026
  • KV-cache compression techniques are algorithmic approaches that reduce memory and compute bottlenecks in autoregressive models by applying quantization, low-rank factorization, token pruning, and cross-layer sharing.
  • Methods such as low-bit quantization and residual vector quantization can achieve up to 98% memory reduction with minimal accuracy loss, demonstrating robust improvements in throughput and efficiency.
  • System-level optimizations like fused kernel design and hardware-aware co-design integrate these techniques into production environments, balancing cache efficiency with low-latency, high-throughput performance.

Key–Value (KV) cache compression techniques are algorithmic strategies and system-level frameworks developed to mitigate the memory, bandwidth, and compute bottlenecks imposed by the exponential growth of KV cache during inference of large-scale autoregressive models, including language, vision, and multi-modal transformers. The KV cache stores past attention keys and values for each layer and token, enabling efficient sequential decoding but leading to quadratic-to-linear scaling with sequence length and model depth. As context and batch sizes increase, managing the KV cache becomes critical for memory efficiency, throughput, and scalability, especially in resource-constrained or production environments. A diverse spectrum of techniques has been developed to reduce the KV cache footprint while preserving model quality and computational efficiency; these include quantization, low-rank factorization, cross-layer/state-space sharing, token pruning and dynamic retention, vector quantization, hybrid systems, and system–hardware co-designs.

1. Taxonomy and Mathematical Foundations

KV-cache compression methods can be organized across four principal axes: storage precision (quantization), architectural redundancy (low-rank/channel compression, cross-layer/attention sharing), selective information retention (token pruning/eviction), and algorithm–system integration (blockwise encoding, fused kernels). The full KV cache for a decoder of LL layers, HH heads, head dimension dd, and sequence length TT requires O(LHTd)O(L\,H\,T\,d) elements, typically in float16 or float32. For very long contexts, this can exceed available GPU memory, as in VAR models with upwards of 90 GB needed at T>10,000T\gt10,000 and standard LLMs at T>128,000T\gt128,000 tokens (Qin et al., 12 Apr 2025).

The technical approaches map onto the following categories:

2. Quantization and Vector Quantization Techniques

Quantization remains a foundational strategy. Scalar quantization methods (per-channel, per-token) assign each key/value element to a codebook entry, typically storing as bb-bit integers, with symmetric or affine scaling. More advanced schemes address outliers and distributional heterogeneity through:

  • Low-Bit Quantization with Matrix/Tensor Decomposition (DecoQuant (Liu et al., 2024)): Perform an MPO (matrix product operator) factorization to separate outliers into a small auxiliary tensor (TST_S) stored at full precision, allowing the central tensor (TLT_L) to be quantized aggressively. A fused dequantization–GeMM kernel achieves 4×4\times (75%) reduction in cache size at $4$ bits, with <0.3<0.3 point drop in accuracy for LLaMA-7B and OPT-6.7B.
  • Residual Vector Quantization (RVQ) (Kumar, 2024): Divide channel dimension into groups (dgd_g), and iteratively quantize sub-vectors in each group using a sequence of vector quantizers; T=8 depth suffices to recover nearly all accuracy, yielding 5.5×5.5\times compression vs. fp16. Non-contiguous grouping (stride-based) further improves key compression. Light attention block finetuning can close the remaining performance gap.
  • Commutative Vector Quantization (CommVQ) (Li et al., 23 Jun 2025): Applies additive quantization with a learned encoder–codebook pair that, when specifically structured, commutes with rotary position embedding (RoPE), enabling fused attention and rapid decoding. Achieves 87.5%87.5\% compression at $2$-bit, 93.75%93.75\% at $1$-bit, with nearly lossless quality on long-context tasks, enabled by Triton kernels fusing decode and attention.
  • Importance-Aware Mixed Precision Quantization in Latent Space (SVDq) (Yankun et al., 21 Feb 2025): Project K to the SVD basis, assign higher bitwidth to dominant singular vectors whose energy decays rapidly, and combine with token sparsity for up to 410×410\times key cache compression at near-lossless performance. The quantization error is an order of magnitude lower than per-channel quantization in the original basis.

These quantization methods routinely require efficient in-situ dequantization, integrated with attention matmul or fused with Huffman encoding for further entropy-based reduction, as in PackKV (15×15\times19×19\times raw reduction with <5%<5\% accuracy drop, up to 171%171\% throughput improvement versus cuBLAS matvec) (Jiang et al., 30 Dec 2025).

3. Low-Rank, Latent, and Cross-Layer Compression

Low-rank and latent-dimension approaches explicitly decompose the KV transformation or the cache for storage and reconstruction efficiency:

  • Channel Shrinking via SVD/Factorization (CSKV, Palu, ReCalKV): SVD-based replacement of key/value projections by ABA B, where ARdin×rA\in\mathbb{R}^{d_{in}\times r}, BRr×dkvB\in\mathbb{R}^{r\times d_{kv}}, rdkvr \ll d_{kv} (Wang et al., 2024, Chang et al., 2024, Yan et al., 30 May 2025). Layerwise fine-tuning of A,BA,B via MSE between original and reconstructed K/V enables 80%80\% channel reduction with >90%>90\% accuracy retention, extendable to 95%95\% saving by post-quantization.
  • Group/Head-Similarity Aware SVD (ReCalKV): Headwise grouping via CKA similarity, followed by group-SVD, is used for keys; values use offline calibration and matrix fusion with the downstream output projection to remove extra computation (Yan et al., 30 May 2025). ReCalKV consistently outperforms Palu at high compression ratios ($50$–70%70\%), showing more gradual performance degradation.
  • Cross-Layer SVD and Latent Sharing (xKV, CommonKV, CLLA): Merge K/V or their latent bottleneck representations across contiguous layers via SVD (xKV) or joint projection (CommonKV, CLLA) (Chang et al., 24 Mar 2025, Wang et al., 22 Aug 2025, Yang et al., 2024). Empirically, dominant singular vectors remain aligned across layers, enabling aggressive per-group reduction (G=2 or 4) and 6.8×6.8\times higher compression rates than previous inter-layer sharing methods, without accuracy loss.
  • Latent Attention and Mixture-of-Experts Integration (CLLA): Projects hidden representations to a small latent via WcW^c, applies per-group quantization, and shares latents across layer groups, yielding $2$–5.2%5.2\% storage (CLLA-quant, $4$-bit) while maintaining or improving accuracy (Yang et al., 2024).

For all these methods, quantization and pruning/eviction can be stacked without interference, enabling compound savings up to 98%98\% (Wang et al., 22 Aug 2025).

4. Token Pruning, Adaptive Retention, and Task-Aware Compression

Selective eviction of less important tokens from the cache is critical, particularly in long-context or retrieval scenarios where quadratic memory scaling is prohibitive:

  • Per-Token Importance Scoring and Adaptive Retention: Variously measures based on average attention score, gradient-based saliency, or layer-wise/attention-head-specific statistics, applied as hard budget (Static: H2O, SnapKV), learned patterns (ZigZagKV), or adaptive dynamic policies (Liu et al., 8 Aug 2025, Zhou et al., 2024, Zhang et al., 2024).
  • Dynamic Budgeting (DBudgetKV, DynamicKV): Establishes global and per-layer budgets that are updated dynamically at inference in response to attention patterns or performance proxies. DBudgetKV uses an attention-row Frobenius norm proxy to halt pruning prior to observable degradation, enabling lossless retention on a per-input basis, robust to domain, context length, and model size (Ni et al., 24 Feb 2025). DynamicKV trains an adaptive per-layer retention curve, redistributing tokens according to task and input properties, often matching or outperforming fixed methods at 1.7%1.7\%6.9%6.9\% cache (Zhou et al., 2024).
  • Hybrid and Per-Head Adaptive Pipelines (LeanKV): Allocates higher precision to keys versus values, assigns token precision/budget via headwise dynamic sparsity, and employs a unified page-based on-GPU memory manager to efficiently compact and repack variable-precision entries (Zhang et al., 2024). LeanKV traces out a Pareto-optimal frontier, yielding $3$–11×11\times compression with <5%<5\% loss and $2$–7×7\times throughput improvement.

Token-pruning methods show high efficiency and low memory at moderate compression, with ablations indicating that extremely aggressive pruning only becomes viable with adaptive, per-layer schemes (Liu et al., 8 Aug 2025, Ni et al., 24 Feb 2025, Zhou et al., 2024).

5. Systems-Level Techniques and Hardware–Aware Design

A major challenge in deploying advanced KV-cache compression arises from the need for high-throughput, low-latency decoding and compatibility with production-grade serving stacks:

  • Blockwise Bitpacking and Entropy Coding: PackKV, KVComp, and similar frameworks combine aggressive quantization with bit-packing and optionally Huffman (or ANS/FSE) coding (Jiang et al., 30 Dec 2025, Jiang et al., 30 Aug 2025). By exploiting block permutation invariance of attention, repacking, and compressed storage layout, PackKV achieves 1519×15-19\times reduction, robust performance, and up to 175%175\% throughput gain versus cuBLAS matvec, with negligible decompression overhead.
  • Fused Kernel Design: Modern methods implement single-pass kernels on GPU that jointly decompress, dequantize, and perform matrix-vector multiplies for attention (reconstruction-free pipelines), removing global memory roundtrips and exploiting coalesced loads (Jiang et al., 30 Dec 2025, Jiang et al., 30 Aug 2025, Li et al., 23 Jun 2025).
  • Unified Page-Table and Memory Management: LeanKV synchronizes per-head, per-request allocation and recycling of variable-precision pages via parallel prefix-sum and circular lists, achieving negligible latency overhead and high cache utilization (Zhang et al., 2024).
  • Negative-Sample and Latency Prediction: Production evaluations reveal that naive application of compression may not yield throughput improvements in Flash-/PagedAttention environments, and can elongate output rather than merely reducing memory (Gao et al., 31 Mar 2025). Automated throughput and response-length predictors, as in “rethink-kv-compression,” are now essential for adaptive request routing and minimizing production tail-latency.

The net result is that the best compression strategies now seek not just memory reduction, but alignment of encoding formats, cache-growth dynamics, and dequantization–attention throughput with system and hardware constraints.

6. Empirical Performance and Trade-offs

Empirical studies report the following salient findings:

7. Limitations, Open Problems, and Research Directions

Despite the substantial progress in KV-cache compression, ongoing research and operational deployments highlight unresolved challenges:

  • Input and Task Adaptivity: Static token-retention or quantization budgets fail to exploit context- and task-specific information-density profiles, leading to either wasted memory or quality loss. Dynamic approaches (DynamicKV, LeanKV, DBudgetKV) address this at the cost of added complexity or occasional regulatory errors (Zhou et al., 2024, Zhang et al., 2024, Ni et al., 24 Feb 2025).
  • System Integration and Production Robustness: The real-world throughput and latency gains of compression are nontrivial to realize and may be nullified by attention kernel or page-fragmentation mismatches, or by increased output length (Gao et al., 31 Mar 2025).
  • Hybrid and Unified Techniques: Future methods are anticipated to unify quantization, pruning, latent sharing, and blockwise encoding within a budget-aware, latency-controlled scheduler; reinforcement or Bayesian optimization may automate hyperparameter tuning (e.g., HACK extensions (Qin et al., 12 Apr 2025), LeanKV adaptive controllers (Zhang et al., 2024)).
  • Hardware and Algorithm Co-design: Exposing quantize, prune, and merge primitives to device libraries, designing bitwidth-reconfigurable datapaths, and leveraging kernel fusion remain key for scaling on next-generation hardware (Jiang et al., 30 Dec 2025, Jiang et al., 30 Aug 2025).
  • Negative-Sample Prediction and Robustness: Automated negative-sample evaluators and length/throughput predictors inform request routing and algorithm fallback strategies, providing resilience to task-specific or edge-case degradation (Gao et al., 31 Mar 2025).

In summary, KV-cache compression for modern autoregressive models encompasses a growing ecosystem of algorithmic, architectural, and system-level innovations. The research trajectory is toward ever more adaptive, robust, and hardware-conscious solutions, enabling unprecedented context lengths and throughput while maintaining the scientific rigor and performance required for state-of-the-art AI deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KV-cache Compression Techniques.