KV Cache Pruning in Transformers

Updated 7 January 2026

KV cache pruning is a technique that selectively removes or compresses cached key and value tensors in transformer models to reduce memory growth and computational overhead.
It employs diverse strategies such as channel, token, and modality-based pruning using attention scores, mutual information, and optimization-driven criteria.
Empirical benchmarks show memory reductions up to 5.7× and latency improvements of 19–67% while maintaining accuracy during long-context and multimodal inference.

KV cache pruning refers to the selective removal or compression of cached key (K) and value (V) tensors in transformer-based models to control memory growth and reduce computational burden during autoregressive inference, especially for long contexts or high-dimensional multimodal inputs. As large language and vision-LLMs scale to longer sequences and multi-image inputs, naive caching of all keys and values at every layer results in prohibitive memory consumption and increasingly inefficient attention computation. A broad ecosystem of methods—unifying structured and unstructured sparsity, learned and training-free criteria, and static and dynamic scheduling—has emerged to address this bottleneck, often yielding substantial acceleration and memory savings with limited or negligible accuracy degradation.

1. Theoretical Basis and Types of KV Cache Pruning

The optimal retention of key-value states in transformers relies on formalizing a trade-off between memory footprint, computational efficiency, and preservation of model performance. Broadly, KV pruning methods exploit redundancy at different levels and axes of the cache:

Dimensionality axis: Removal of unimportant feature channels within each token (channel pruning).
Temporal (token) axis: Discarding or merging less salient tokens as measured by query-key interaction or attention-based heuristics (token eviction).
Structural (block-wise or page-wise) axis: Aligned removal of fixed-size memory blocks or pages for compatibility with paged memory systems.
Modality-specific axis: Differentiated pruning for multi-modal models, accounting for distinct distributional statistics between modalities. The decision criteria include attention mass, perturbation of attention outputs, mutual information, singular value energy, and hardware-driven metrics (e.g., accumulative similarity, magnitude, or optimizer-derived norms).

Notably, token importance has historically been predicted using self-attention scores, token ages, or heuristic proxies (norms, cumulative gradients). Recent works formalize pruning as an optimization problem with closed-form scores derived from surrogate perturbation analysis (e.g., "OBCache" (Gu et al., 9 Oct 2025)), per-head significance statistics ("LeanKV" (Zhang et al., 2024)), or attention decomposition for modality-aware filtering (CSP (Pei et al., 2024)).

2. Methodological Advances: Structured, Channel, and Modality-Aware Pruning

Channel and Structural Pruning

Both channel-wise and block-structured pruning address redundancy within the feature and memory organization of the cache:

Channel pruning (ThinK, KVPruner, LeanK, SparK, Mustafar): Selectively removes feature channels per head, often using query-aware or norm-based scores. ThinK (Xu et al., 2024) uses a query-dependent importance score; KVPruner (Lv et al., 2024) exploits global perplexity sensitivities to set per-block/channel budgets; LeanK (Zhang et al., 4 Aug 2025) learns static masks through a two-stage process optimized for hardware alignment. Mustafar (Joo et al., 28 May 2025) utilizes unstructured sparsity, applying elementwise pruning across both K and V with bitmap-based encoding for custom sparse attention kernel efficiency.
Block/page pruning (PagedEviction): PagedEviction (Chitty-Venkata et al., 4 Sep 2025) aligns all pruning with the fixed paging of vLLM and similar systems, revising blocks/pages based on static per-block value-to-key norm ratios, guaranteeing no kernel modification is needed.
Joint pruning and quantization: Methods such as Titanus (Chen et al., 23 May 2025) and XStreamVGGT (Su et al., 3 Jan 2026) cascade channel/token-level pruning with mixed-bit quantization, leveraging hardware-software codesign to maximize data transfer and storage efficiency.

Modality-Aware and Multimodal Pruning

Multimodal vision-language and video models require pruning mechanisms sensitive to the distributional differences between text and visual tokens:

Cross/self attention decomposition (CSP): "Cross-Self Pruning" (CSP) (Pei et al., 2024) partitions the attention matrix into intra-modality and inter-modality blocks, separately scoring and masking tokens based on self/cross attention mass. This prevents over-pruning of visual tokens that typically have weaker attention statistical profiles.
Cosine-similarity–driven hierarchical pruning (SharpV): In SharpV (Qin et al., 11 Nov 2025), visual tokens or layers’ KV caches are pruned when their representation drifts—measured via cosine similarity—from original visual embeddings, supporting compatibility with hardware kernels that do not return attention matrices.
Adaptive visual token survival for hallucination mitigation (PruneHal): PruneHal (Sun et al., 22 Oct 2025) adaptively discards low-attention visual tokens at layer-specific thresholds, dynamically adjusting retention based on the observed distribution of attention mass to enhance factual reliability without loss of visual coverage.

3. Pruning Criteria, Algorithms, and Scheduling

Different classes of methods characterize token or channel importance via diverse criteria:

Criterion / Method	Pruning Target	Key Mathematical/Algorithmic Feature
Query-driven attention contribution ("ThinK" (Xu et al., 2024), "SparK" (Liao et al., 21 Aug 2025))	Channel	Prune based on query-key contribution; dynamic or approximate averaging in local window
Global perplexity- or Taylor-based ("KVPruner" (Lv et al., 2024))	Channel, block	Per-channel sensitivity (L1, L2, Taylor, 0–1) aggregated at block level, with LoRA-based recovery
Cross/self attention mass decomposition (CSP (Pei et al., 2024))	Token (modality)	Prune using separate cross- and self-attention blocks for text/vision; n-softmax for distribution fix
Frobenius-norm attention preservation (DBudgetKV (Ni et al., 24 Feb 2025))	Token	Prune maximally while maintaining specific attention-matrix norm (performance-neutral)
Hessian/Taylor second-order output sensitivity ("OBCache" (Gu et al., 9 Oct 2025))	Token	Closed-form pruning scores derived from second-order OBD (Optimal Brain Damage) expansion
Mutual information / cosine similarity to input ("SharpV" (Qin et al., 11 Nov 2025))	Layer, token	Remove KV cache once drift from original visual features exceeds threshold
Binary "head-behaviour" clustering (KVCrush (Jha et al., 24 Feb 2025))	Token (proxy)	Cluster binary head-importance codes, retain a set of token representatives by Hamming similarity
Accumulative similarity (UniCAIM (Xu et al., 10 Apr 2025))	Token (hardware)	On-chip in-place accumulative attention for static/dynamic in-memory pruning

Implementation schedules may be static (set up front, e.g., offline mask learning in LeanK (Zhang et al., 4 Aug 2025)), dynamic (as in DBudgetKV (Ni et al., 24 Feb 2025), which adapts pruning per layer and input using attention-based stopping metrics), or staged (e.g., the two-phase pipeline of CSP (Pei et al., 2024), with user-tunable hyperparameters K^s, K^c, and recent buffer R).

In quantized-pruning schemes (Zhang et al., 2024), token count and quantization bitwidth are jointly selected under a fixed memory budget constraint to maximize downstream performance; empirical findings consistently show that "more tokens at lower precision" (e.g., 4× the tokens at 4 bits) outperforms "fewer tokens at higher precision" given the same total memory.

4. Hardware and Systems Integration

A substantial challenge for deployment is achieving memory and compute savings without disrupting backend acceleration, e.g., FlashAttention, block-sparse kernels, or compute-in-memory arrays:

Paged and block-aligned methods: PagedEviction (Chitty-Venkata et al., 4 Sep 2025) and LeanKV (Zhang et al., 2024) operate natively on paged/block memory layouts, matching attention kernels’ expectations.
Bitmap and sparse format encoding: Mustafar (Joo et al., 28 May 2025) introduces bitmap-based sparse layouts, combined with custom SpMV-based attention kernels to exploit unstructured sparsity patterns at runtime.
CIM/CAM hardware with static-dynamic hybrid schedules: UniCAIM (Xu et al., 10 Apr 2025) implements both static charge–domain and O(1) dynamic CAM-based pruning in a unified FeFET array, supporting dynamic top-k token selection and in-situ similarity accumulation.
Portability and compatibility with efficient accelerators: PureKV (Jiang et al., 29 Oct 2025) and SharpV (Qin et al., 11 Nov 2025) design token selection and cache-reduction schemes which do not require access to attention matrices, providing invariance against fused or block-sparse kernels.

5. Empirical Impact, Benchmarks, and Trade-offs

KV cache pruning consistently achieves drastic memory and throughput improvements across model, benchmark, and hardware configurations:

Compression/acceleration factors: Memory reductions of 2.7×–5.7× (LeanKV (Zhang et al., 2024)), up to 50–70% (Mustafar (Joo et al., 28 May 2025), CSP (Pei et al., 2024), XStreamVGGT (Su et al., 3 Jan 2026)), and up to 11× under tolerable accuracy drop (Zhang et al., 2024); latency improvements of 19–67% (CSP, Mustafar, XStreamVGGT).
Accuracy preservation: Most schemes retain full-cache performance under moderate pruning (≤70% for LeanK/ThinK/Mustafar, ≤25–40% cache retention for CSP/PureKV under multimodal setups), with degradation only under extreme pruning ratios.
Generalization and stability: Methods such as quantized-pruning (Zhang et al., 2024) and KVCrush (Jha et al., 24 Feb 2025) are robust under varying token budgets, quantization, and model scales, with plug-and-play integration with existing token eviction and quantization pipelines.
Benchmarks: LongBench, MileBench, RULER, Needle-in-a-Haystack, and complex multi-modal datasets remain standard fixtures for quality assessment.

A range of practical guidelines have been empirically validated, such as the maintenance of a non-pruned recent token window (R=64–128 (Pei et al., 2024)), task- and modality-adaptive allocation (balancing cross/self tokens, Layer/Head-wise thresholds), and choice of per-layer, per-head dynamic schedules based on calibration runs.

6. Toward Unified and Future-Principled Pruning

KV cache pruning is evolving towards unified policies combining multi-axis sparsification (token, channel, modality, and layer), multi-tier quantization, and dynamic, input-adaptive budget mechanisms. Representative directions include:

Joint token-channel-precision schedules as in LeanKV (Zhang et al., 2024), integrating hetero-KV (keys vs values), token saliency, and per-head precision allocation.
Performance-neutral, dynamic budgets (DBudgetKV (Ni et al., 24 Feb 2025)) that halt pruning just before any measurable attention- or performance degradation, automatically tuning cache size per input/task.
Output- and information-aware saliency metrics (OBCache (Gu et al., 9 Oct 2025), SharpV (Qin et al., 11 Nov 2025)) that go beyond attention weight proxies to explicitly predict the change in output or information bottleneck for each pruning candidate.
Algorithm–hardware co-design in compute-in-memory, parallel prefix-sum allocators, and compressed-index updaters to close the gap between theoretical memory savings and realized throughput on modern accelerator platforms.

Limitations persist regarding the brittleness of heuristics under domain shift, instability under aggressive pruning (especially in early layers or for exact-match generative QA), and the need for further theoretical work connecting saliency estimation with global generation fidelity. Nevertheless, KV cache pruning has emerged as a key enabler for scaling sequence length, batch size, and multimodal context windows in practical LLM and VLM deployments while tightly controlling serving costs and latency.

References:

Cross-Self Pruning (CSP) (Pei et al., 2024) KVPruner (Lv et al., 2024) ThinK (Xu et al., 2024) PureKV (Jiang et al., 29 Oct 2025) PagedEviction (Chitty-Venkata et al., 4 Sep 2025) Titanus (Chen et al., 23 May 2025) Mustafar (Joo et al., 28 May 2025) LeanK (Zhang et al., 4 Aug 2025) UniCAIM (Xu et al., 10 Apr 2025) KVCrush (Jha et al., 24 Feb 2025) XStreamVGGT (Su et al., 3 Jan 2026) PruneHal (Sun et al., 22 Oct 2025) LeanKV (Zhang et al., 2024) DBudgetKV (Ni et al., 24 Feb 2025) More Tokens, Lower Precision (Zhang et al., 2024) SharpV (Qin et al., 11 Nov 2025) SparK (Liao et al., 21 Aug 2025) OBCache (Gu et al., 9 Oct 2025)

Markdown Upgrade to Chat

References (18)

OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference (2025)

Unifying KV Cache Compression for Large Language Models with LeanKV (2024)

Cross-Self KV Cache Pruning for Efficient Vision-Language Inference (2024)

ThinK: Thinner Key Cache by Query-Driven Pruning (2024)

KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models (2024)

LeanK: Learnable K Cache Channel Pruning for Efficient Decoding (2025)

Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference (2025)

PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference (2025)

Titanus: Enabling KV Cache Pruning and Quantization On-the-Fly for LLM Acceleration (2025)

10.

XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression (2026)

11.

Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning (2025)

12.

PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning (2025)

13.

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning (2025)

14.

DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance (2025)

15.

KVCrush: Key value cache size-reduction using similarity in head-behaviour (2025)

16.

UniCAIM: A Unified CAM/CIM Architecture with Static-Dynamic KV Cache Pruning for Efficient Long-Context LLM Inference (2025)

17.

More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression (2024)

18.

PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KV Cache Pruning.