Papers
Topics
Authors
Recent
2000 character limit reached

KV Cache Slimming Techniques

Updated 19 November 2025
  • KV Cache Slimming is a collection of techniques that reduce the memory and compute footprint of transformer models by compressing key/value caches during inference.
  • It utilizes methods such as token grouping, attention-guided pruning, and mixed-precision quantization to achieve up to 40× compression with negligible accuracy drop.
  • These strategies enable longer context windows, higher batch sizes, and accelerated generation in both language and multimodal models by managing cache redundancy effectively.

Key-value (KV) cache slimming aggregates a diverse technological portfolio for reducing the memory and compute footprint of the key/value states in Transformer-based LLMs and multimodal models during inference. As LLMs scale to longer context windows and higher batch sizes, the cost of storing and manipulating KV caches becomes the primary bottleneck for throughput and deployability. KV cache slimming directly enables higher batch sizes, lower latency, and support for long-context tasks, with negligible accuracy loss. Techniques span token-level, head-level, channel-level, layer-level, and cross-modality strategies. This article reviews the state of the art, theoretical principles, algorithmic designs, and quantitative performance for representative approaches, with particular emphasis on recent advances in binary signatures (KVCrush), attention-guided retention, layer/lateral reuse, token grouping, and principled mixed-precision/quantization frameworks.

1. Principle and Taxonomy of KV Cache Slimming

The KV cache stores the key and value vectors corresponding to previously generated or input tokens across attention layers and heads: Kt(),Vt()RdK^{(\ell)}_t, V^{(\ell)}_t \in \mathbb{R}^d for each layer \ell and token position tt. Naively, this cache grows as O(BLseqNLNHDp)O(B L_{\text{seq}} N_L N_H D_p), where BB is batch size, LseqL_{\text{seq}} is sequence length, NLN_L is number of layers, NHN_H heads, DD dimensionality, pp precision. Without slimming, caches for modern LLMs quickly reach several terabytes.

KV cache slimming schemes fall into several categories:

Slimming Principle Method Type Key Example
Token-level retention and grouping Pruning/Eviction KVCrush, SAGE-KV
Head-level dynamic budgeting Per-head pruning RazorAttention, Task-KV
Channel-level compression Dim. pruning/low-rank CSKV, LeanK, KV-Latent
Layer-level cache sharing/reuse Shallow/deep merge MiniCache, KVSharer
Mixed-precision quantization Precision control MiKV, LeanKV
Attention-pattern analysis Guided retention SAGE-KV, Task-KV

Significant variants also target cross-modal caches in VLMs/LVLMs (Efficient LLaMA-3.2-Vision, InfiniPot-V, PureKV).

2. Binary Signature-Based Token Grouping and Pruning (KVCrush)

KVCrush introduces a novel binary signature dimension rt{0,1}Hr_t \in \{0,1\}^H per token, mapping the original high-dimensional Kt,VtRDK_t, V_t \in \mathbb{R}^D to HH bits via head-wise attention score thresholding. For head hh and token tt at layer \ell, define

rth={1A()[h,t]minA()[h,:]maxA()[h,:]minA()[h,:]τh 0otherwiser_{th} = \begin{cases} 1 & \dfrac{A^{(\ell)}[h, t] - \min A^{(\ell)}[h, :]}{\max A^{(\ell)}[h, :] - \min A^{(\ell)}[h, :]} \geq \tau_h \ 0 & \text{otherwise} \end{cases}

Tokens are then grouped using a single-anchor Hamming distance, bucketing signatures, and retaining only one representative per bucket (BrepB_{\text{rep}} buckets). Slimmed cache comprises only these representatives, with evicted tokens mapped to their group representative. This process maintains head-behavioral similarity and preserves the attention distribution at a compressed memory scale. Empirically, KVCrush reduces cache size by 4×4\times with <1%<1\% accuracy drop and <0.5%<0.5\% latency overhead on LongBench (Jha et al., 24 Feb 2025).

3. Layer-Wise and Depth-Dimension Cache Sharing and Merging

Layer-wise slimming leverages the high similarity of key/value states across adjacent (or strategically chosen) layers. MiniCache performs SLERP (spherical linear interpolation) between adjacent layers’ unit-norm directions after separating out magnitude and direction. Retained tokens with high inter-layer angular distance avoid merging to prevent semantic loss. Compression achieves $3.5$–5×5\times reduction with near-lossless accuracy (Liu et al., 23 May 2024).

KVSharer employs a strategy of dissimilar cache sharing, identifying pairs of layers with maximal Frobenius-norm distance in their average KV representations. Sharing dissimilar caches minimizes hidden-state interference, allowing up to 30%30\% compute reduction and 1.3×1.3\times1.65×1.65\times generation acceleration with >90%>90\% quality preservation (Yang et al., 24 Oct 2024).

LightTransfer (SimLayerKV) uses attention pattern detection (on the initial/final context tokens) to identify “lazy layers” which do not utilize the full cache. Those layers retain only a small window of tokens while aggressive slimming removes the rest, yielding up to 2.3×2.3\times throughput improvement and 5×5\times memory compression (Zhang et al., 17 Oct 2024).

4. Attention-Guided Pruning, Task-Aware Token/Head Budgeting

Several methods use attention matrices to identify tokens and heads with global (retrieval) versus strictly local focus:

  • RazorAttention separates “retrieval heads” (∼15% of heads, crucial to global context) from “local heads” and applies full caching to only the former. For local heads, a thin cache plus a compensation token (mean over discarded tokens) ensures little loss. This approach realizes >70%>70\% reduction with negligible accuracy loss (Tang et al., 22 Jul 2024).
  • Task-KV dynamically allocates differentiated budgets based on the distance from the “semantic center.” Heads far from the center (“heterogeneous”) receive full KV, non-heterogeneous heads receive recent tokens, global sinks, and middle activations. This achieves $2$–3×3\times reduction with nearly full task accuracy across long-context QA, summarization, and reasoning (He et al., 25 Jan 2025).
  • SAGE-KV and DBudgetKV exploit one-shot post-prefill attention scores to rank token importance and perform a single top-kk selection at both the token and head levels, resulting in efficient cache compression with $2$–4×4\times memory savings and insignificant accuracy drop (Wang et al., 11 Mar 2025, Ni et al., 24 Feb 2025).

5. Channel-Level Slimming and Static/Adaptive Dimensionality Pruning

Significant redundancy exists across the channel dimension in the KV cache.

  • LeanK and CSKV both leverage per-channel static sparsity. LeanK uses a staged training process to learn binary channel masks for keys that match hardware alignment constraints. Up to 70%70\% K-cache and $16$–18%18\% V-cache compression is achieved without accuracy loss, and $1.3$–1.6×1.6\times kernel speedup is realized (Zhang et al., 4 Aug 2025). CSKV applies low-rank decomposition (SVD or ASVD) to compress projection matrices, with a local/global bi-branch cache for recent and older tokens to preserve performance. KV cache can be minimized by $80$–95%95\% with negligible loss (Wang et al., 16 Sep 2024).
  • KV-Latent downsamples key and value dimensions into a latent space, allowing for reduced KV cache width. The method addresses rotary positional embedding frequency stability at low dimension by redensifying the frequency schedule, thereby avoiding the noise of high frequencies. Reducing value dimensionality degrades perplexity more than reducing keys. 2×2\times7×7\times memory and $8$–17%17\% latency benefits are possible (Shi et al., 15 Jul 2025).

6. Mixed-Precision, Quantization, and Adaptive Storage Layouts

Mixed-precision approaches are central to high compression ratios:

  • MiKV and LeanKV assign precision per-token based on importance: highly ranked tokens remain at FP16 or higher, less important tokens are stored in INT2/INT4 or lower. Aggressive quantization of “unimportant” KV pairs substantially recovers degradation, with overall cache reduced to $20$–25%25\% of baseline, yet <1%<1\% accuracy drop (Yang et al., 28 Feb 2024, Zhang et al., 4 Dec 2024). Outlier-aware channel rescaling improves reliability at very low bits.
  • LeanKV’s on-GPU memory manager exploits per-token, per-head sparsity by dynamically compacting the fragmented memory list into contiguous blocks for high/low precision, mapping the performance-neutral (K8V4, K4V2) and up-to-11×11\times slimmed KV cache routinely for models up to $70$B parameters (Zhang et al., 4 Dec 2024).
  • Transform-coding (KVTC (Staniszewski et al., 3 Nov 2025)) applies PCA-based decorrelation, dynamic bit allocation, and DEFLATE entropy coding over combined K/V vectors, achieving up to 40×40\times compression with nearly no loss and 8×8\times reduction in Time-to-First-Token.

7. Multimodal and Streaming Contexts: Visual, Video, and Cross-Modality Caches

For vision-language (LVLM) and video models, attention sparsity and functional specialization is even more prominent:

  • Efficient LLaMA-3.2-Vision replicates attention-based retention strategies in cross-attention layers, selecting only the top-kk visual tokens after the first layer, and achieving $50$–75%75\% reduction with <0.5%<0.5\% performance degradation (Lee et al., 1 Apr 2025).
  • InfiniPot-V establishes a length-independent memory cap for streaming video by alternating cache append and compression passes, using temporal-axis redundancy (TaR) and value-norm (VaN) scoring to select tokens. Up to 94%94\% cache reduction with matched accuracy and real-time generation is reported (Kim et al., 18 Jun 2025).
  • PureKV employs cross-layer importance estimation (using lower-layer attention scores to proxy high-layer token importance), combined with a spatial-temporal sparse attention purifier for video streams. This architecture synchronizes with block-sparse/FlashAttention kernels, enabling 5×5\times slimming and 3.2×3.2\times prefill acceleration in VideoLLaMA2, with negligible degradation and established superiority over prior baselines (Jiang et al., 29 Oct 2025).

8. Query-Agnostic and Multi-Query Cache Reusability

KVzip tackles the challenge of multi-query cache reusability by quantifying importance via context reconstruction. Importance is scored by the LLM’s effectiveness at reconstructing the original prompt conditioned on retained KV pairs. Tokens with low reconstruction attention are evicted. The offline scoring procedure is computationally efficient and KVzip retains $3$–5×5\times compression with negligible global QA and reasoning loss (Kim et al., 29 May 2025). In multi-query, long-context settings, query-agnostic slimming sustains 99%99\% accuracy where competitive methods degrade sharply.

9. Compensation for Saliency Shift and Marginal Tokens

Permanent eviction can destroy future relevance of previously unimportant tokens (“saliency shift”) and overcompress “marginal” tokens that contribute to accumulated context. SmallKV introduces a small auxiliary model whose attention patterns are matched to the main model, forming a similarity map in attention indices; it then (1) restores global importance (preventing premature loss), and (2) approximates marginal tokens’ scores, compensating outputs via blending coefficients. This mechanism preserves accuracy down to 5%5\% cache budgets, and throughput up to 2.56×2.56\times (Zhao et al., 3 Aug 2025).

10. Trade-offs, Overhead, and Best Practices

Slimming benefits are typically realized at 2×2\times5×5\times memory reduction and 1.3×1.3\times5×5\times throughput gains, with practical configurations driven by context length, batch size, hardware capabilities (e.g., alignment requirements), and tolerance for <1%<1\% task degradation. Overheads remain sub-1%1\% latency at 4×4\times slimming (KVCrush, MiniCache, RazorAttention, LeanK), compaction and quantization are vectorizable and non-blocking, and compensation or cross-layer estimation introduces negligible kernel interference. When stacking methods (e.g., KVCrush + mixed precision), $8$–16×16\times total savings can be achieved. Hyperparameters (chunk size, anchor choice, bucketing, channel mask, retention ratio) are robust to moderate variation; per-task adaptation is recommended only under adversarial long-context workloads.

Common best practices include reserving sliding windows for recent tokens, freezing early layers, combining static-dynamic budgeting, and monitoring core metrics on target hardware with in-domain prompt mixes. Integration is plug-and-play for most frameworks (vLLM, FlashAttention, XAttention), and mixing slimming with streaming attention and head pruning is orthogonal in effect.

References

These works collectively define the best practices and boundaries for current KV cache slimming research in large language and multimodal models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to KV Cache Slimming.