Papers
Topics
Authors
Recent
Search
2000 character limit reached

TRIM-KV: Efficient Transformer Cache Techniques

Updated 4 December 2025
  • TRIM-KV is a dual-method approach that prunes redundant visual tokens in LVLMs and employs learnable retention gates in LLMs for efficient KV cache management.
  • The visual token pruning method ranks tokens by attention weights to retain only the most informative keys and values, reducing memory usage and latency with minimal accuracy loss.
  • Retention-gated cache uses exponential decay of token importance with learned gates to evict less relevant tokens, enabling scalable and robust autoregressive decoding in long-context settings.

TRIM-KV refers to two distinct, state-of-the-art methods for key-value (KV) cache reduction in modern transformer architectures: (1) pruning redundant visual tokens in cross-attention-based large vision-LLMs (LVLMs) to optimize visual feature caching without retraining (Lee et al., 1 Apr 2025), and (2) learnable retention-based KV eviction for memory-bounded inference in LLMs, using retention gates that score and decay each token’s importance, enabling efficient and robust autoregressive decoding in long-context settings (Bui et al., 3 Dec 2025).

1. Cross-attention-based TRIM-KV in LVLMs

In cross-attention LVLMs (e.g., LLaMA-3.2-Vision-Instruct), visual tokens, produced by a vision encoder, are injected into the LLM’s hidden states at each cross-attention layer. For nkn_k image tokens and hidden dimension dd, cross-attention layers compute queries QRn×dQ \in \mathbb{R}^{n \times d} from text, and keys/values K,VRnk×dK, V \in \mathbb{R}^{n_k \times d} from visual features via learned projections. These visual key–value (KV) pairs are cached at inference, leading to a memory bottleneck as nkn_k becomes large.

Mathematical formulation

For each head, scaled dot-product attention is computed as: A=softmax(QKd)A = \mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d}} \right) where ARn×nkA \in \mathbb{R}^{n \times n_k}. The output is: Attention(Q,K,V)=AV\mathrm{Attention}(Q, K, V) = A V Empirical analysis reveals that AA is highly sparse: a subset of visual tokens dominate attention (striped patterns), and head-wise, this sparsity pattern is consistent across layers beyond the initial blocks.

Attention-guided token importance

The cumulative importance of each visual token ii in head dd0 is: dd1 where dd2 is the attention weight for visual token dd3 by query dd4 in head dd5; dd6 is the number of query tokens. These dd7 are computed from the first cross-attention layer only.

TRIM-KV visual token pruning algorithm

Tokens are pruned by head-wise ranking:

  1. For each of dd8 heads, select the top-dd9 visual tokens by QRn×dQ \in \mathbb{R}^{n \times d}0 (QRn×dQ \in \mathbb{R}^{n \times d}1).
  2. The union QRn×dQ \in \mathbb{R}^{n \times d}2 (where QRn×dQ \in \mathbb{R}^{n \times d}3 are top-QRn×dQ \in \mathbb{R}^{n \times d}4 indices in head QRn×dQ \in \mathbb{R}^{n \times d}5) identifies tokens to retain.
  3. The pruned KV cache in all subsequent cross-attention layers uses only these tokens: QRn×dQ \in \mathbb{R}^{n \times d}6

Empirical performance

In LLaMA-3.2-11B-Vision-Instruct, TRIM-KV with QRn×dQ \in \mathbb{R}^{n \times d}7 maintains benchmark accuracy (e.g., SEED-Bench, MME, MMVP, LLaVA-Bench) within 0.3–1.2% of baseline, even at 40% token retention. First-token inference latency decreases by 4.1% (batch=1) and up to 19.7% (batch=32) for 50% features; memory requirements for the visual KV cache decline proportionally (Lee et al., 1 Apr 2025).

Random pruning and fixed spatial sampling underperform attention-guided selection by 5–11% on core benchmarks, demonstrating the necessity of data-driven token importance ranking.

2. Retention-gated TRIM-KV for Self-attention in LLMs

In decoder-only transformers, each new token’s key and value vectors are appended to a KV cache growing linearly in context window QRn×dQ \in \mathbb{R}^{n \times d}8, with quadratic attention computation in QRn×dQ \in \mathbb{R}^{n \times d}9. This presents severe scalability constraints for long-range reasoning, prolonged dialog, and generative tasks.

Retention gate mechanism

Each attention head is augmented with a lightweight retention gate K,VRnk×dK, V \in \mathbb{R}^{n_k \times d}0, where K,VRnk×dK, V \in \mathbb{R}^{n_k \times d}1 is the hidden state at timestep K,VRnk×dK, V \in \mathbb{R}^{n_k \times d}2. The decayed importance of token K,VRnk×dK, V \in \mathbb{R}^{n_k \times d}3 at future position K,VRnk×dK, V \in \mathbb{R}^{n_k \times d}4 is modeled as an exponential decay: K,VRnk×dK, V \in \mathbb{R}^{n_k \times d}5 with higher K,VRnk×dK, V \in \mathbb{R}^{n_k \times d}6 meaning slower decay of importance (tokens “stick” in cache longer).

Training protocol

Starting from a frozen pretrained backbone, only the retention gates K,VRnk×dK, V \in \mathbb{R}^{n_k \times d}7 are trained:

  • Distillation/quality loss: The retention-gated model’s outputs (K,VRnk×dK, V \in \mathbb{R}^{n_k \times d}8) are matched to the original model’s outputs (K,VRnk×dK, V \in \mathbb{R}^{n_k \times d}9) via KL divergence plus cross-entropy on next-token prediction.
  • Capacity loss: For fixed budget nkn_k0, encourage the sum of decayed retentions at each timestep nkn_k1 not to exceed nkn_k2: nkn_k3 Combined objective: nkn_k4.

Inference and eviction

At inference, only the nkn_k5 scores are used for cache eviction within a fixed-size buffer:

  1. For each new token, compute nkn_k6 via the gate.
  2. Append new nkn_k7, nkn_k8, nkn_k9 to the cache.
  3. When capacity is exceeded, evict token A=softmax(QKd)A = \mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d}} \right)0 with minimal A=softmax(QKd)A = \mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d}} \right)1.
  4. Apply standard attention over the retained set.

Algorithmic complexity is A=softmax(QKd)A = \mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d}} \right)2 for cache, A=softmax(QKd)A = \mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d}} \right)3 for eviction, with MLP-based A=softmax(QKd)A = \mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d}} \right)4 calculation fused with QKV projection (Bui et al., 3 Dec 2025).

Quantitative results

Across mathematical reasoning (AIME24, GSM8K, MATH-500), procedural long text generation, and long-memory benchmarks (LongMemEval, LongBench, SCBench), TRIM-KV consistently surpasses heuristic baselines (StreamingLLM, H2O, SnapKV, R-KV) and learnable retrieval (SeerAttn-R) at all budgets. Notably, at A=softmax(QKd)A = \mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d}} \right)5 tokens, TRIM-KV yields higher accuracy than full-cache inference (74% vs 65.5% on AIME24). On LongBench-v2, TRIM-KV exceeds full-cache by +6.5% overall accuracy, indicating retention-based regularization can suppress noise from uninformative tokens.

Throughput matches or exceeds other efficient schemes: TRIM-KV achieves 130.5 tokens/s on a single H200 (batch=4, 1K gen, 32K context), compared to 68.4 tok/s (FullKV), 124.7 tok/s (SnapKV).

3. Interpretability and Emergent Behaviors

Retention scores A=softmax(QKd)A = \mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d}} \right)6 exhibit interpretable and structured layer/head-specific patterns:

  • Sliding window behavior in early layers (favoring recent tokens).
  • Persistent “sink” tokens in later layers (e.g., start or paragraph markers).
  • “A-shaped” retention around syntactic pivots in question answering.
  • Certain heads prioritize mathematical operators, numerics, or function words, while filler tokens are rapidly evicted (A=softmax(QKd)A = \mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d}} \right)7).

These heuristics are not hand-crafted but emerge solely from data-driven distillation and the capacity constraint.

4. Limitations and Practical Constraints

Both approaches have notable limitations:

  • Retention gates in LLMs are trained post hoc, with no co-adaptation of the underlying backbone; the attention mechanism remains independent of A=softmax(QKd)A = \mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d}} \right)8.
  • Efficient hardware support is limited: current implementations assume uniform per-head cache length, and variable-length per-head support is not native to kernel libraries such as FlashAttention.
  • The exponential decay schedule is empirically effective but not necessarily optimal; richer retention dynamics may further improve performance, especially for retrieval-heavy or compressible language tasks where pure eviction-based retention may drop essential content if capacity A=softmax(QKd)A = \mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d}} \right)9 is set too low.
  • The multimodal (vision-language) application is currently based on attention-based heuristics for token selection, not on learnable or adaptive image token retention.

5. Future Directions

Proposed research avenues include:

  • End-to-end pretraining or finetuning where the retention gates and attention weights co-adapt, enhancing the utility of ARn×nkA \in \mathbb{R}^{n \times n_k}0 and potentially yielding further accuracy gains under strict relational or multimodal context.
  • Adaptive per-head and per-layer memory budgeting for KV caches subject to global resource constraints.
  • Generalization to richer retention functions (beyond exponential) or context-dependent re-strengthening of retention if tokens receive renewed attention.
  • Application to multimodal architectures (joint image and text), tool-use, and retrieval-augmented models.
  • Advancement of hardware/library support for non-uniform, dynamic-length cache management.

6. Comparative Overview

Visual Token Pruning (Lee et al., 1 Apr 2025) Retention-Gated Cache (Bui et al., 3 Dec 2025)
Context Cross-attention, LVLMs Self-attention, LLMs
Method Score-and-prune based on attention maps Learnable gates, exponential decay, eviction
Training No retraining required Only gates trained (distillation + capacity loss)
Resource effect ~50% memory/computation reduction Strict O(M) memory, scalable to long contexts
Performance ARn×nkA \in \mathbb{R}^{n \times n_k}12% accuracy drop at 40–60% tokens Matches/exceeds full KV cache at modest budgets

A plausible implication is that attention-guided trimming and retention-based cache bounding are converging towards hardware-efficient, highly scalable inference kernels for both text-only and multimodal foundation models, with emergent interpretability.

7. References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TRIM-KV.