KV Cache Slimming Techniques
- KV Cache Slimming is a collection of techniques that reduce the memory and compute footprint of transformer models by compressing key/value caches during inference.
- It utilizes methods such as token grouping, attention-guided pruning, and mixed-precision quantization to achieve up to 40× compression with negligible accuracy drop.
- These strategies enable longer context windows, higher batch sizes, and accelerated generation in both language and multimodal models by managing cache redundancy effectively.
Key-value (KV) cache slimming aggregates a diverse technological portfolio for reducing the memory and compute footprint of the key/value states in Transformer-based LLMs and multimodal models during inference. As LLMs scale to longer context windows and higher batch sizes, the cost of storing and manipulating KV caches becomes the primary bottleneck for throughput and deployability. KV cache slimming directly enables higher batch sizes, lower latency, and support for long-context tasks, with negligible accuracy loss. Techniques span token-level, head-level, channel-level, layer-level, and cross-modality strategies. This article reviews the state of the art, theoretical principles, algorithmic designs, and quantitative performance for representative approaches, with particular emphasis on recent advances in binary signatures (KVCrush), attention-guided retention, layer/lateral reuse, token grouping, and principled mixed-precision/quantization frameworks.
1. Principle and Taxonomy of KV Cache Slimming
The KV cache stores the key and value vectors corresponding to previously generated or input tokens across attention layers and heads: for each layer and token position . Naively, this cache grows as , where is batch size, is sequence length, is number of layers, heads, dimensionality, precision. Without slimming, caches for modern LLMs quickly reach several terabytes.
KV cache slimming schemes fall into several categories:
| Slimming Principle | Method Type | Key Example |
|---|---|---|
| Token-level retention and grouping | Pruning/Eviction | KVCrush, SAGE-KV |
| Head-level dynamic budgeting | Per-head pruning | RazorAttention, Task-KV |
| Channel-level compression | Dim. pruning/low-rank | CSKV, LeanK, KV-Latent |
| Layer-level cache sharing/reuse | Shallow/deep merge | MiniCache, KVSharer |
| Mixed-precision quantization | Precision control | MiKV, LeanKV |
| Attention-pattern analysis | Guided retention | SAGE-KV, Task-KV |
Significant variants also target cross-modal caches in VLMs/LVLMs (Efficient LLaMA-3.2-Vision, InfiniPot-V, PureKV).
2. Binary Signature-Based Token Grouping and Pruning (KVCrush)
KVCrush introduces a novel binary signature dimension per token, mapping the original high-dimensional to bits via head-wise attention score thresholding. For head and token at layer , define
Tokens are then grouped using a single-anchor Hamming distance, bucketing signatures, and retaining only one representative per bucket ( buckets). Slimmed cache comprises only these representatives, with evicted tokens mapped to their group representative. This process maintains head-behavioral similarity and preserves the attention distribution at a compressed memory scale. Empirically, KVCrush reduces cache size by with accuracy drop and latency overhead on LongBench (Jha et al., 24 Feb 2025).
3. Layer-Wise and Depth-Dimension Cache Sharing and Merging
Layer-wise slimming leverages the high similarity of key/value states across adjacent (or strategically chosen) layers. MiniCache performs SLERP (spherical linear interpolation) between adjacent layers’ unit-norm directions after separating out magnitude and direction. Retained tokens with high inter-layer angular distance avoid merging to prevent semantic loss. Compression achieves $3.5$– reduction with near-lossless accuracy (Liu et al., 23 May 2024).
KVSharer employs a strategy of dissimilar cache sharing, identifying pairs of layers with maximal Frobenius-norm distance in their average KV representations. Sharing dissimilar caches minimizes hidden-state interference, allowing up to compute reduction and – generation acceleration with quality preservation (Yang et al., 24 Oct 2024).
LightTransfer (SimLayerKV) uses attention pattern detection (on the initial/final context tokens) to identify “lazy layers” which do not utilize the full cache. Those layers retain only a small window of tokens while aggressive slimming removes the rest, yielding up to throughput improvement and memory compression (Zhang et al., 17 Oct 2024).
4. Attention-Guided Pruning, Task-Aware Token/Head Budgeting
Several methods use attention matrices to identify tokens and heads with global (retrieval) versus strictly local focus:
- RazorAttention separates “retrieval heads” (∼15% of heads, crucial to global context) from “local heads” and applies full caching to only the former. For local heads, a thin cache plus a compensation token (mean over discarded tokens) ensures little loss. This approach realizes reduction with negligible accuracy loss (Tang et al., 22 Jul 2024).
- Task-KV dynamically allocates differentiated budgets based on the distance from the “semantic center.” Heads far from the center (“heterogeneous”) receive full KV, non-heterogeneous heads receive recent tokens, global sinks, and middle activations. This achieves $2$– reduction with nearly full task accuracy across long-context QA, summarization, and reasoning (He et al., 25 Jan 2025).
- SAGE-KV and DBudgetKV exploit one-shot post-prefill attention scores to rank token importance and perform a single top- selection at both the token and head levels, resulting in efficient cache compression with $2$– memory savings and insignificant accuracy drop (Wang et al., 11 Mar 2025, Ni et al., 24 Feb 2025).
5. Channel-Level Slimming and Static/Adaptive Dimensionality Pruning
Significant redundancy exists across the channel dimension in the KV cache.
- LeanK and CSKV both leverage per-channel static sparsity. LeanK uses a staged training process to learn binary channel masks for keys that match hardware alignment constraints. Up to K-cache and $16$– V-cache compression is achieved without accuracy loss, and $1.3$– kernel speedup is realized (Zhang et al., 4 Aug 2025). CSKV applies low-rank decomposition (SVD or ASVD) to compress projection matrices, with a local/global bi-branch cache for recent and older tokens to preserve performance. KV cache can be minimized by $80$– with negligible loss (Wang et al., 16 Sep 2024).
- KV-Latent downsamples key and value dimensions into a latent space, allowing for reduced KV cache width. The method addresses rotary positional embedding frequency stability at low dimension by redensifying the frequency schedule, thereby avoiding the noise of high frequencies. Reducing value dimensionality degrades perplexity more than reducing keys. – memory and $8$– latency benefits are possible (Shi et al., 15 Jul 2025).
6. Mixed-Precision, Quantization, and Adaptive Storage Layouts
Mixed-precision approaches are central to high compression ratios:
- MiKV and LeanKV assign precision per-token based on importance: highly ranked tokens remain at FP16 or higher, less important tokens are stored in INT2/INT4 or lower. Aggressive quantization of “unimportant” KV pairs substantially recovers degradation, with overall cache reduced to $20$– of baseline, yet accuracy drop (Yang et al., 28 Feb 2024, Zhang et al., 4 Dec 2024). Outlier-aware channel rescaling improves reliability at very low bits.
- LeanKV’s on-GPU memory manager exploits per-token, per-head sparsity by dynamically compacting the fragmented memory list into contiguous blocks for high/low precision, mapping the performance-neutral (K8V4, K4V2) and up-to- slimmed KV cache routinely for models up to $70$B parameters (Zhang et al., 4 Dec 2024).
- Transform-coding (KVTC (Staniszewski et al., 3 Nov 2025)) applies PCA-based decorrelation, dynamic bit allocation, and DEFLATE entropy coding over combined K/V vectors, achieving up to compression with nearly no loss and reduction in Time-to-First-Token.
7. Multimodal and Streaming Contexts: Visual, Video, and Cross-Modality Caches
For vision-language (LVLM) and video models, attention sparsity and functional specialization is even more prominent:
- Efficient LLaMA-3.2-Vision replicates attention-based retention strategies in cross-attention layers, selecting only the top- visual tokens after the first layer, and achieving $50$– reduction with performance degradation (Lee et al., 1 Apr 2025).
- InfiniPot-V establishes a length-independent memory cap for streaming video by alternating cache append and compression passes, using temporal-axis redundancy (TaR) and value-norm (VaN) scoring to select tokens. Up to cache reduction with matched accuracy and real-time generation is reported (Kim et al., 18 Jun 2025).
- PureKV employs cross-layer importance estimation (using lower-layer attention scores to proxy high-layer token importance), combined with a spatial-temporal sparse attention purifier for video streams. This architecture synchronizes with block-sparse/FlashAttention kernels, enabling slimming and prefill acceleration in VideoLLaMA2, with negligible degradation and established superiority over prior baselines (Jiang et al., 29 Oct 2025).
8. Query-Agnostic and Multi-Query Cache Reusability
KVzip tackles the challenge of multi-query cache reusability by quantifying importance via context reconstruction. Importance is scored by the LLM’s effectiveness at reconstructing the original prompt conditioned on retained KV pairs. Tokens with low reconstruction attention are evicted. The offline scoring procedure is computationally efficient and KVzip retains $3$– compression with negligible global QA and reasoning loss (Kim et al., 29 May 2025). In multi-query, long-context settings, query-agnostic slimming sustains accuracy where competitive methods degrade sharply.
9. Compensation for Saliency Shift and Marginal Tokens
Permanent eviction can destroy future relevance of previously unimportant tokens (“saliency shift”) and overcompress “marginal” tokens that contribute to accumulated context. SmallKV introduces a small auxiliary model whose attention patterns are matched to the main model, forming a similarity map in attention indices; it then (1) restores global importance (preventing premature loss), and (2) approximates marginal tokens’ scores, compensating outputs via blending coefficients. This mechanism preserves accuracy down to cache budgets, and throughput up to (Zhao et al., 3 Aug 2025).
10. Trade-offs, Overhead, and Best Practices
Slimming benefits are typically realized at – memory reduction and – throughput gains, with practical configurations driven by context length, batch size, hardware capabilities (e.g., alignment requirements), and tolerance for task degradation. Overheads remain sub- latency at slimming (KVCrush, MiniCache, RazorAttention, LeanK), compaction and quantization are vectorizable and non-blocking, and compensation or cross-layer estimation introduces negligible kernel interference. When stacking methods (e.g., KVCrush + mixed precision), $8$– total savings can be achieved. Hyperparameters (chunk size, anchor choice, bucketing, channel mask, retention ratio) are robust to moderate variation; per-task adaptation is recommended only under adversarial long-context workloads.
Common best practices include reserving sliding windows for recent tokens, freezing early layers, combining static-dynamic budgeting, and monitoring core metrics on target hardware with in-domain prompt mixes. Integration is plug-and-play for most frameworks (vLLM, FlashAttention, XAttention), and mixing slimming with streaming attention and head pruning is orthogonal in effect.
References
- "KVCrush: Key value cache size-reduction using similarity in head‑behaviour" (Jha et al., 24 Feb 2025)
- "LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation" (Zhang et al., 17 Oct 2024)
- "Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads" (He et al., 25 Jan 2025)
- "MiniCache: KV Cache Compression in Depth Dimension for LLMs" (Liu et al., 23 May 2024)
- "Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference" (Joo et al., 28 May 2025)
- "Efficient LLaMA‑3.2‑Vision by Trimming Cross-attended Visual Features" (Lee et al., 1 Apr 2025)
- "RazorAttention: Efficient KV Cache Compression Through Retrieval Heads" (Tang et al., 22 Jul 2024)
- "LLMs Know What to Drop: Self‑Attention Guided KV Cache Eviction for Efficient Long‑Context Inference" (Wang et al., 11 Mar 2025)
- "KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing" (Yang et al., 24 Oct 2024)
- "CSKV: Training-Efficient Channel Shrinking for KV Cache in Long‑Context Scenarios" (Wang et al., 16 Sep 2024)
- "LeanK: Learnable K Cache Channel Pruning for Efficient Decoding" (Zhang et al., 4 Aug 2025)
- "KV‑Latent: Dimensional-level KV Cache Reduction with Frequency‑aware Rotary Positional Embedding" (Shi et al., 15 Jul 2025)
- "DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance" (Ni et al., 24 Feb 2025)
- "PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models" (Jiang et al., 29 Oct 2025)
- "KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction" (Kim et al., 29 May 2025)
- "SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference" (Zhao et al., 3 Aug 2025)
- "KV Cache Transform Coding for Compact Storage in LLM Inference" (Staniszewski et al., 3 Nov 2025)
- "Unifying KV Cache Compression for LLMs with LeanKV" (Zhang et al., 4 Dec 2024)
- "No Token Left Behind: Reliable KV Cache Compression via Importance‑Aware Mixed Precision Quantization" (Yang et al., 28 Feb 2024)
- "InfiniPot‑V: Memory-Constrained KV Cache Compression for Streaming Video Understanding" (Kim et al., 18 Jun 2025)
These works collectively define the best practices and boundaries for current KV cache slimming research in large language and multimodal models.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free