KV Cache Pruning in Transformers
- KV cache pruning is a technique that selectively removes or compresses cached key and value tensors in transformer models to reduce memory growth and computational overhead.
- It employs diverse strategies such as channel, token, and modality-based pruning using attention scores, mutual information, and optimization-driven criteria.
- Empirical benchmarks show memory reductions up to 5.7× and latency improvements of 19–67% while maintaining accuracy during long-context and multimodal inference.
KV cache pruning refers to the selective removal or compression of cached key (K) and value (V) tensors in transformer-based models to control memory growth and reduce computational burden during autoregressive inference, especially for long contexts or high-dimensional multimodal inputs. As large language and vision-LLMs scale to longer sequences and multi-image inputs, naive caching of all keys and values at every layer results in prohibitive memory consumption and increasingly inefficient attention computation. A broad ecosystem of methods—unifying structured and unstructured sparsity, learned and training-free criteria, and static and dynamic scheduling—has emerged to address this bottleneck, often yielding substantial acceleration and memory savings with limited or negligible accuracy degradation.
1. Theoretical Basis and Types of KV Cache Pruning
The optimal retention of key-value states in transformers relies on formalizing a trade-off between memory footprint, computational efficiency, and preservation of model performance. Broadly, KV pruning methods exploit redundancy at different levels and axes of the cache:
- Dimensionality axis: Removal of unimportant feature channels within each token (channel pruning).
- Temporal (token) axis: Discarding or merging less salient tokens as measured by query-key interaction or attention-based heuristics (token eviction).
- Structural (block-wise or page-wise) axis: Aligned removal of fixed-size memory blocks or pages for compatibility with paged memory systems.
- Modality-specific axis: Differentiated pruning for multi-modal models, accounting for distinct distributional statistics between modalities. The decision criteria include attention mass, perturbation of attention outputs, mutual information, singular value energy, and hardware-driven metrics (e.g., accumulative similarity, magnitude, or optimizer-derived norms).
Notably, token importance has historically been predicted using self-attention scores, token ages, or heuristic proxies (norms, cumulative gradients). Recent works formalize pruning as an optimization problem with closed-form scores derived from surrogate perturbation analysis (e.g., "OBCache" (Gu et al., 9 Oct 2025)), per-head significance statistics ("LeanKV" (Zhang et al., 2024)), or attention decomposition for modality-aware filtering (CSP (Pei et al., 2024)).
2. Methodological Advances: Structured, Channel, and Modality-Aware Pruning
Channel and Structural Pruning
Both channel-wise and block-structured pruning address redundancy within the feature and memory organization of the cache:
- Channel pruning (ThinK, KVPruner, LeanK, SparK, Mustafar): Selectively removes feature channels per head, often using query-aware or norm-based scores. ThinK (Xu et al., 2024) uses a query-dependent importance score; KVPruner (Lv et al., 2024) exploits global perplexity sensitivities to set per-block/channel budgets; LeanK (Zhang et al., 4 Aug 2025) learns static masks through a two-stage process optimized for hardware alignment. Mustafar (Joo et al., 28 May 2025) utilizes unstructured sparsity, applying elementwise pruning across both K and V with bitmap-based encoding for custom sparse attention kernel efficiency.
- Block/page pruning (PagedEviction): PagedEviction (Chitty-Venkata et al., 4 Sep 2025) aligns all pruning with the fixed paging of vLLM and similar systems, revising blocks/pages based on static per-block value-to-key norm ratios, guaranteeing no kernel modification is needed.
- Joint pruning and quantization: Methods such as Titanus (Chen et al., 23 May 2025) and XStreamVGGT (Su et al., 3 Jan 2026) cascade channel/token-level pruning with mixed-bit quantization, leveraging hardware-software codesign to maximize data transfer and storage efficiency.
Modality-Aware and Multimodal Pruning
Multimodal vision-language and video models require pruning mechanisms sensitive to the distributional differences between text and visual tokens:
- Cross/self attention decomposition (CSP): "Cross-Self Pruning" (CSP) (Pei et al., 2024) partitions the attention matrix into intra-modality and inter-modality blocks, separately scoring and masking tokens based on self/cross attention mass. This prevents over-pruning of visual tokens that typically have weaker attention statistical profiles.
- Cosine-similarity–driven hierarchical pruning (SharpV): In SharpV (Qin et al., 11 Nov 2025), visual tokens or layers’ KV caches are pruned when their representation drifts—measured via cosine similarity—from original visual embeddings, supporting compatibility with hardware kernels that do not return attention matrices.
- Adaptive visual token survival for hallucination mitigation (PruneHal): PruneHal (Sun et al., 22 Oct 2025) adaptively discards low-attention visual tokens at layer-specific thresholds, dynamically adjusting retention based on the observed distribution of attention mass to enhance factual reliability without loss of visual coverage.
3. Pruning Criteria, Algorithms, and Scheduling
Different classes of methods characterize token or channel importance via diverse criteria:
| Criterion / Method | Pruning Target | Key Mathematical/Algorithmic Feature |
|---|---|---|
| Query-driven attention contribution ("ThinK" (Xu et al., 2024), "SparK" (Liao et al., 21 Aug 2025)) | Channel | Prune based on query-key contribution; dynamic or approximate averaging in local window |
| Global perplexity- or Taylor-based ("KVPruner" (Lv et al., 2024)) | Channel, block | Per-channel sensitivity (L1, L2, Taylor, 0–1) aggregated at block level, with LoRA-based recovery |
| Cross/self attention mass decomposition (CSP (Pei et al., 2024)) | Token (modality) | Prune using separate cross- and self-attention blocks for text/vision; n-softmax for distribution fix |
| Frobenius-norm attention preservation (DBudgetKV (Ni et al., 24 Feb 2025)) | Token | Prune maximally while maintaining specific attention-matrix norm (performance-neutral) |
| Hessian/Taylor second-order output sensitivity ("OBCache" (Gu et al., 9 Oct 2025)) | Token | Closed-form pruning scores derived from second-order OBD (Optimal Brain Damage) expansion |
| Mutual information / cosine similarity to input ("SharpV" (Qin et al., 11 Nov 2025)) | Layer, token | Remove KV cache once drift from original visual features exceeds threshold |
| Binary "head-behaviour" clustering (KVCrush (Jha et al., 24 Feb 2025)) | Token (proxy) | Cluster binary head-importance codes, retain a set of token representatives by Hamming similarity |
| Accumulative similarity (UniCAIM (Xu et al., 10 Apr 2025)) | Token (hardware) | On-chip in-place accumulative attention for static/dynamic in-memory pruning |
Implementation schedules may be static (set up front, e.g., offline mask learning in LeanK (Zhang et al., 4 Aug 2025)), dynamic (as in DBudgetKV (Ni et al., 24 Feb 2025), which adapts pruning per layer and input using attention-based stopping metrics), or staged (e.g., the two-phase pipeline of CSP (Pei et al., 2024), with user-tunable hyperparameters Ks, Kc, and recent buffer R).
In quantized-pruning schemes (Zhang et al., 2024), token count and quantization bitwidth are jointly selected under a fixed memory budget constraint to maximize downstream performance; empirical findings consistently show that "more tokens at lower precision" (e.g., 4× the tokens at 4 bits) outperforms "fewer tokens at higher precision" given the same total memory.
4. Hardware and Systems Integration
A substantial challenge for deployment is achieving memory and compute savings without disrupting backend acceleration, e.g., FlashAttention, block-sparse kernels, or compute-in-memory arrays:
- Paged and block-aligned methods: PagedEviction (Chitty-Venkata et al., 4 Sep 2025) and LeanKV (Zhang et al., 2024) operate natively on paged/block memory layouts, matching attention kernels’ expectations.
- Bitmap and sparse format encoding: Mustafar (Joo et al., 28 May 2025) introduces bitmap-based sparse layouts, combined with custom SpMV-based attention kernels to exploit unstructured sparsity patterns at runtime.
- CIM/CAM hardware with static-dynamic hybrid schedules: UniCAIM (Xu et al., 10 Apr 2025) implements both static charge–domain and O(1) dynamic CAM-based pruning in a unified FeFET array, supporting dynamic top-k token selection and in-situ similarity accumulation.
- Portability and compatibility with efficient accelerators: PureKV (Jiang et al., 29 Oct 2025) and SharpV (Qin et al., 11 Nov 2025) design token selection and cache-reduction schemes which do not require access to attention matrices, providing invariance against fused or block-sparse kernels.
5. Empirical Impact, Benchmarks, and Trade-offs
KV cache pruning consistently achieves drastic memory and throughput improvements across model, benchmark, and hardware configurations:
- Compression/acceleration factors: Memory reductions of 2.7×–5.7× (LeanKV (Zhang et al., 2024)), up to 50–70% (Mustafar (Joo et al., 28 May 2025), CSP (Pei et al., 2024), XStreamVGGT (Su et al., 3 Jan 2026)), and up to 11× under tolerable accuracy drop (Zhang et al., 2024); latency improvements of 19–67% (CSP, Mustafar, XStreamVGGT).
- Accuracy preservation: Most schemes retain full-cache performance under moderate pruning (≤70% for LeanK/ThinK/Mustafar, ≤25–40% cache retention for CSP/PureKV under multimodal setups), with degradation only under extreme pruning ratios.
- Generalization and stability: Methods such as quantized-pruning (Zhang et al., 2024) and KVCrush (Jha et al., 24 Feb 2025) are robust under varying token budgets, quantization, and model scales, with plug-and-play integration with existing token eviction and quantization pipelines.
- Benchmarks: LongBench, MileBench, RULER, Needle-in-a-Haystack, and complex multi-modal datasets remain standard fixtures for quality assessment.
A range of practical guidelines have been empirically validated, such as the maintenance of a non-pruned recent token window (R=64–128 (Pei et al., 2024)), task- and modality-adaptive allocation (balancing cross/self tokens, Layer/Head-wise thresholds), and choice of per-layer, per-head dynamic schedules based on calibration runs.
6. Toward Unified and Future-Principled Pruning
KV cache pruning is evolving towards unified policies combining multi-axis sparsification (token, channel, modality, and layer), multi-tier quantization, and dynamic, input-adaptive budget mechanisms. Representative directions include:
- Joint token-channel-precision schedules as in LeanKV (Zhang et al., 2024), integrating hetero-KV (keys vs values), token saliency, and per-head precision allocation.
- Performance-neutral, dynamic budgets (DBudgetKV (Ni et al., 24 Feb 2025)) that halt pruning just before any measurable attention- or performance degradation, automatically tuning cache size per input/task.
- Output- and information-aware saliency metrics (OBCache (Gu et al., 9 Oct 2025), SharpV (Qin et al., 11 Nov 2025)) that go beyond attention weight proxies to explicitly predict the change in output or information bottleneck for each pruning candidate.
- Algorithm–hardware co-design in compute-in-memory, parallel prefix-sum allocators, and compressed-index updaters to close the gap between theoretical memory savings and realized throughput on modern accelerator platforms.
Limitations persist regarding the brittleness of heuristics under domain shift, instability under aggressive pruning (especially in early layers or for exact-match generative QA), and the need for further theoretical work connecting saliency estimation with global generation fidelity. Nevertheless, KV cache pruning has emerged as a key enabler for scaling sequence length, batch size, and multimodal context windows in practical LLM and VLM deployÂments while tightly controlling serving costs and latency.
References:
Cross-Self Pruning (CSP) (Pei et al., 2024) KVPruner (Lv et al., 2024) ThinK (Xu et al., 2024) PureKV (Jiang et al., 29 Oct 2025) PagedEviction (Chitty-Venkata et al., 4 Sep 2025) Titanus (Chen et al., 23 May 2025) Mustafar (Joo et al., 28 May 2025) LeanK (Zhang et al., 4 Aug 2025) UniCAIM (Xu et al., 10 Apr 2025) KVCrush (Jha et al., 24 Feb 2025) XStreamVGGT (Su et al., 3 Jan 2026) PruneHal (Sun et al., 22 Oct 2025) LeanKV (Zhang et al., 2024) DBudgetKV (Ni et al., 24 Feb 2025) More Tokens, Lower Precision (Zhang et al., 2024) SharpV (Qin et al., 11 Nov 2025) SparK (Liao et al., 21 Aug 2025) OBCache (Gu et al., 9 Oct 2025)