Papers
Topics
Authors
Recent
2000 character limit reached

TRIM-KV: Efficient Transformer Cache Techniques

Updated 4 December 2025
  • TRIM-KV is a dual-method approach that prunes redundant visual tokens in LVLMs and employs learnable retention gates in LLMs for efficient KV cache management.
  • The visual token pruning method ranks tokens by attention weights to retain only the most informative keys and values, reducing memory usage and latency with minimal accuracy loss.
  • Retention-gated cache uses exponential decay of token importance with learned gates to evict less relevant tokens, enabling scalable and robust autoregressive decoding in long-context settings.

TRIM-KV refers to two distinct, state-of-the-art methods for key-value (KV) cache reduction in modern transformer architectures: (1) pruning redundant visual tokens in cross-attention-based large vision-LLMs (LVLMs) to optimize visual feature caching without retraining (Lee et al., 1 Apr 2025), and (2) learnable retention-based KV eviction for memory-bounded inference in LLMs, using retention gates that score and decay each token’s importance, enabling efficient and robust autoregressive decoding in long-context settings (Bui et al., 3 Dec 2025).

1. Cross-attention-based TRIM-KV in LVLMs

In cross-attention LVLMs (e.g., LLaMA-3.2-Vision-Instruct), visual tokens, produced by a vision encoder, are injected into the LLM’s hidden states at each cross-attention layer. For nkn_k image tokens and hidden dimension dd, cross-attention layers compute queries QRn×dQ \in \mathbb{R}^{n \times d} from text, and keys/values K,VRnk×dK, V \in \mathbb{R}^{n_k \times d} from visual features via learned projections. These visual key–value (KV) pairs are cached at inference, leading to a memory bottleneck as nkn_k becomes large.

Mathematical formulation

For each head, scaled dot-product attention is computed as: A=softmax(QKd)A = \mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d}} \right) where ARn×nkA \in \mathbb{R}^{n \times n_k}. The output is: Attention(Q,K,V)=AV\mathrm{Attention}(Q, K, V) = A V Empirical analysis reveals that AA is highly sparse: a subset of visual tokens dominate attention (striped patterns), and head-wise, this sparsity pattern is consistent across layers beyond the initial blocks.

Attention-guided token importance

The cumulative importance of each visual token ii in head hh is: pih=j=0m1αi,jhp_i^h = \sum_{j=0}^{m-1} \alpha_{i,j}^h where αi,jh\alpha_{i,j}^h is the attention weight for visual token ii by query jj in head hh; mm is the number of query tokens. These pihp_i^h are computed from the first cross-attention layer only.

TRIM-KV visual token pruning algorithm

Tokens are pruned by head-wise ranking:

  1. For each of HH heads, select the top-kk visual tokens by pihp_i^h (k=Krationkk = \lceil K_\text{ratio} \cdot n_k \rceil).
  2. The union T=h=1HThT = \bigcup_{h=1}^{H} T_h (where ThT_h are top-kk indices in head hh) identifies tokens to retain.
  3. The pruned KV cache in all subsequent cross-attention layers uses only these tokens: Kpruned=K[T],Vpruned=V[T]K_\text{pruned} = K[T], \quad V_\text{pruned} = V[T]

Empirical performance

In LLaMA-3.2-11B-Vision-Instruct, TRIM-KV with Kratio0.5K_\text{ratio} \approx 0.5 maintains benchmark accuracy (e.g., SEED-Bench, MME, MMVP, LLaVA-Bench) within 0.3–1.2% of baseline, even at 40% token retention. First-token inference latency decreases by 4.1% (batch=1) and up to 19.7% (batch=32) for 50% features; memory requirements for the visual KV cache decline proportionally (Lee et al., 1 Apr 2025).

Random pruning and fixed spatial sampling underperform attention-guided selection by 5–11% on core benchmarks, demonstrating the necessity of data-driven token importance ranking.

2. Retention-gated TRIM-KV for Self-attention in LLMs

In decoder-only transformers, each new token’s key and value vectors are appended to a KV cache growing linearly in context window TT, with quadratic attention computation in TT. This presents severe scalability constraints for long-range reasoning, prolonged dialog, and generative tasks.

Retention gate mechanism

Each attention head is augmented with a lightweight retention gate g(xt)βt[0,1]g(x_t) \to \beta_t \in [0,1], where xtx_t is the hidden state at timestep tt. The decayed importance of token ii at future position tt is modeled as an exponential decay: αt,i=βiti\alpha_{t,i} = \beta_i^{t-i} with higher β\beta meaning slower decay of importance (tokens “stick” in cache longer).

Training protocol

Starting from a frozen pretrained backbone, only the retention gates θ\theta are trained:

  • Distillation/quality loss: The retention-gated model’s outputs (qθq_\theta) are matched to the original model’s outputs (pp) via KL divergence plus cross-entropy on next-token prediction.
  • Capacity loss: For fixed budget MM, encourage the sum of decayed retentions at each timestep tt not to exceed MM: Lcap=1T(TM)t=1Tmax(0,i=1tβitiM)\mathcal{L}_\text{cap} = \frac{1}{T(T-M)} \sum_{t=1}^T \max\left(0, \sum_{i=1}^t \beta_i^{t-i} - M\right) Combined objective: Ltotal(θ)=Lquality+λcapLcap\mathcal{L}_\text{total}(\theta) = \mathcal{L}_\text{quality} + \lambda_\text{cap} \mathcal{L}_\text{cap}.

Inference and eviction

At inference, only the β\beta scores are used for cache eviction within a fixed-size buffer:

  1. For each new token, compute βt+1\beta_{t+1} via the gate.
  2. Append new kk, vv, β\beta to the cache.
  3. When capacity is exceeded, evict token jj with minimal βjt+1j\beta_j^{t+1-j}.
  4. Apply standard attention over the retained set.

Algorithmic complexity is O(Md)O(Md) for cache, O(M)O(M) for eviction, with MLP-based β\beta calculation fused with QKV projection (Bui et al., 3 Dec 2025).

Quantitative results

Across mathematical reasoning (AIME24, GSM8K, MATH-500), procedural long text generation, and long-memory benchmarks (LongMemEval, LongBench, SCBench), TRIM-KV consistently surpasses heuristic baselines (StreamingLLM, H2O, SnapKV, R-KV) and learnable retrieval (SeerAttn-R) at all budgets. Notably, at M=4096M=4096 tokens, TRIM-KV yields higher accuracy than full-cache inference (74% vs 65.5% on AIME24). On LongBench-v2, TRIM-KV exceeds full-cache by +6.5% overall accuracy, indicating retention-based regularization can suppress noise from uninformative tokens.

Throughput matches or exceeds other efficient schemes: TRIM-KV achieves 130.5 tokens/s on a single H200 (batch=4, 1K gen, 32K context), compared to 68.4 tok/s (FullKV), 124.7 tok/s (SnapKV).

3. Interpretability and Emergent Behaviors

Retention scores β\beta exhibit interpretable and structured layer/head-specific patterns:

  • Sliding window behavior in early layers (favoring recent tokens).
  • Persistent “sink” tokens in later layers (e.g., start or paragraph markers).
  • “A-shaped” retention around syntactic pivots in question answering.
  • Certain heads prioritize mathematical operators, numerics, or function words, while filler tokens are rapidly evicted (β0\beta \approx 0).

These heuristics are not hand-crafted but emerge solely from data-driven distillation and the capacity constraint.

4. Limitations and Practical Constraints

Both approaches have notable limitations:

  • Retention gates in LLMs are trained post hoc, with no co-adaptation of the underlying backbone; the attention mechanism remains independent of β\beta.
  • Efficient hardware support is limited: current implementations assume uniform per-head cache length, and variable-length per-head support is not native to kernel libraries such as FlashAttention.
  • The exponential decay schedule is empirically effective but not necessarily optimal; richer retention dynamics may further improve performance, especially for retrieval-heavy or compressible language tasks where pure eviction-based retention may drop essential content if capacity MM is set too low.
  • The multimodal (vision-language) application is currently based on attention-based heuristics for token selection, not on learnable or adaptive image token retention.

5. Future Directions

Proposed research avenues include:

  • End-to-end pretraining or finetuning where the retention gates and attention weights co-adapt, enhancing the utility of β\beta and potentially yielding further accuracy gains under strict relational or multimodal context.
  • Adaptive per-head and per-layer memory budgeting for KV caches subject to global resource constraints.
  • Generalization to richer retention functions (beyond exponential) or context-dependent re-strengthening of retention if tokens receive renewed attention.
  • Application to multimodal architectures (joint image and text), tool-use, and retrieval-augmented models.
  • Advancement of hardware/library support for non-uniform, dynamic-length cache management.

6. Comparative Overview

Visual Token Pruning (Lee et al., 1 Apr 2025) Retention-Gated Cache (Bui et al., 3 Dec 2025)
Context Cross-attention, LVLMs Self-attention, LLMs
Method Score-and-prune based on attention maps Learnable gates, exponential decay, eviction
Training No retraining required Only gates trained (distillation + capacity loss)
Resource effect ~50% memory/computation reduction Strict O(M) memory, scalable to long contexts
Performance <<2% accuracy drop at 40–60% tokens Matches/exceeds full KV cache at modest budgets

A plausible implication is that attention-guided trimming and retention-based cache bounding are converging towards hardware-efficient, highly scalable inference kernels for both text-only and multimodal foundation models, with emergent interpretability.

7. References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TRIM-KV.