TRIM-KV: Efficient Transformer Cache Techniques

Updated 4 December 2025

TRIM-KV is a dual-method approach that prunes redundant visual tokens in LVLMs and employs learnable retention gates in LLMs for efficient KV cache management.
The visual token pruning method ranks tokens by attention weights to retain only the most informative keys and values, reducing memory usage and latency with minimal accuracy loss.
Retention-gated cache uses exponential decay of token importance with learned gates to evict less relevant tokens, enabling scalable and robust autoregressive decoding in long-context settings.

TRIM-KV refers to two distinct, state-of-the-art methods for key-value (KV) cache reduction in modern transformer architectures: (1) pruning redundant visual tokens in cross-attention-based large vision-LLMs (LVLMs) to optimize visual feature caching without retraining (Lee et al., 1 Apr 2025), and (2) learnable retention-based KV eviction for memory-bounded inference in LLMs, using retention gates that score and decay each token’s importance, enabling efficient and robust autoregressive decoding in long-context settings (Bui et al., 3 Dec 2025).

1. Cross-attention-based TRIM-KV in LVLMs

In cross-attention LVLMs (e.g., LLaMA-3.2-Vision-Instruct), visual tokens, produced by a vision encoder, are injected into the LLM’s hidden states at each cross-attention layer. For $n_k$ image tokens and hidden dimension $d$ , cross-attention layers compute queries $Q \in \mathbb{R}^{n \times d}$ from text, and keys/values $K, V \in \mathbb{R}^{n_k \times d}$ from visual features via learned projections. These visual key–value (KV) pairs are cached at inference, leading to a memory bottleneck as $n_k$ becomes large.

Mathematical formulation

For each head, scaled dot-product attention is computed as: $A = \mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d}} \right)$ where $A \in \mathbb{R}^{n \times n_k}$ . The output is: $\mathrm{Attention}(Q, K, V) = A V$ Empirical analysis reveals that $A$ is highly sparse: a subset of visual tokens dominate attention (striped patterns), and head-wise, this sparsity pattern is consistent across layers beyond the initial blocks.

Attention-guided token importance

The cumulative importance of each visual token $i$ in head $h$ is: $p_i^h = \sum_{j=0}^{m-1} \alpha_{i,j}^h$ where $\alpha_{i,j}^h$ is the attention weight for visual token $i$ by query $j$ in head $h$ ; $m$ is the number of query tokens. These $p_i^h$ are computed from the first cross-attention layer only.

TRIM-KV visual token pruning algorithm

Tokens are pruned by head-wise ranking:

For each of $H$ heads, select the top- $k$ visual tokens by $p_i^h$ ( $k = \lceil K_\text{ratio} \cdot n_k \rceil$ ).
The union $T = \bigcup_{h=1}^{H} T_h$ (where $T_h$ are top- $k$ indices in head $h$ ) identifies tokens to retain.
The pruned KV cache in all subsequent cross-attention layers uses only these tokens: $K_\text{pruned} = K[T], \quad V_\text{pruned} = V[T]$

Empirical performance

In LLaMA-3.2-11B-Vision-Instruct, TRIM-KV with $K_\text{ratio} \approx 0.5$ maintains benchmark accuracy (e.g., SEED-Bench, MME, MMVP, LLaVA-Bench) within 0.3–1.2% of baseline, even at 40% token retention. First-token inference latency decreases by 4.1% (batch=1) and up to 19.7% (batch=32) for 50% features; memory requirements for the visual KV cache decline proportionally (Lee et al., 1 Apr 2025).

Random pruning and fixed spatial sampling underperform attention-guided selection by 5–11% on core benchmarks, demonstrating the necessity of data-driven token importance ranking.

2. Retention-gated TRIM-KV for Self-attention in LLMs

In decoder-only transformers, each new token’s key and value vectors are appended to a KV cache growing linearly in context window $T$ , with quadratic attention computation in $T$ . This presents severe scalability constraints for long-range reasoning, prolonged dialog, and generative tasks.

Retention gate mechanism

Each attention head is augmented with a lightweight retention gate $g(x_t) \to \beta_t \in [0,1]$ , where $x_t$ is the hidden state at timestep $t$ . The decayed importance of token $i$ at future position $t$ is modeled as an exponential decay: $\alpha_{t,i} = \beta_i^{t-i}$ with higher $\beta$ meaning slower decay of importance (tokens “stick” in cache longer).

Training protocol

Starting from a frozen pretrained backbone, only the retention gates $\theta$ are trained:

Distillation/quality loss: The retention-gated model’s outputs ( $q_\theta$ ) are matched to the original model’s outputs ( $p$ ) via KL divergence plus cross-entropy on next-token prediction.
Capacity loss: For fixed budget $M$ , encourage the sum of decayed retentions at each timestep $t$ not to exceed $M$ : $\mathcal{L}_\text{cap} = \frac{1}{T(T-M)} \sum_{t=1}^T \max\left(0, \sum_{i=1}^t \beta_i^{t-i} - M\right)$ Combined objective: $\mathcal{L}_\text{total}(\theta) = \mathcal{L}_\text{quality} + \lambda_\text{cap} \mathcal{L}_\text{cap}$ .

Inference and eviction

At inference, only the $\beta$ scores are used for cache eviction within a fixed-size buffer:

For each new token, compute $\beta_{t+1}$ via the gate.
Append new $k$ , $v$ , $\beta$ to the cache.
When capacity is exceeded, evict token $j$ with minimal $\beta_j^{t+1-j}$ .
Apply standard attention over the retained set.

Algorithmic complexity is $O(Md)$ for cache, $O(M)$ for eviction, with MLP-based $\beta$ calculation fused with QKV projection (Bui et al., 3 Dec 2025).

Quantitative results

Across mathematical reasoning (AIME24, GSM8K, MATH-500), procedural long text generation, and long-memory benchmarks (LongMemEval, LongBench, SCBench), TRIM-KV consistently surpasses heuristic baselines (StreamingLLM, H2O, SnapKV, R-KV) and learnable retrieval (SeerAttn-R) at all budgets. Notably, at $M=4096$ tokens, TRIM-KV yields higher accuracy than full-cache inference (74% vs 65.5% on AIME24). On LongBench-v2, TRIM-KV exceeds full-cache by +6.5% overall accuracy, indicating retention-based regularization can suppress noise from uninformative tokens.

Throughput matches or exceeds other efficient schemes: TRIM-KV achieves 130.5 tokens/s on a single H200 (batch=4, 1K gen, 32K context), compared to 68.4 tok/s (FullKV), 124.7 tok/s (SnapKV).

3. Interpretability and Emergent Behaviors

Retention scores $\beta$ exhibit interpretable and structured layer/head-specific patterns:

Sliding window behavior in early layers (favoring recent tokens).
Persistent “sink” tokens in later layers (e.g., start or paragraph markers).
“A-shaped” retention around syntactic pivots in question answering.
Certain heads prioritize mathematical operators, numerics, or function words, while filler tokens are rapidly evicted ( $\beta \approx 0$ ).

These heuristics are not hand-crafted but emerge solely from data-driven distillation and the capacity constraint.

4. Limitations and Practical Constraints

Both approaches have notable limitations:

Retention gates in LLMs are trained post hoc, with no co-adaptation of the underlying backbone; the attention mechanism remains independent of $\beta$ .
Efficient hardware support is limited: current implementations assume uniform per-head cache length, and variable-length per-head support is not native to kernel libraries such as FlashAttention.
The exponential decay schedule is empirically effective but not necessarily optimal; richer retention dynamics may further improve performance, especially for retrieval-heavy or compressible language tasks where pure eviction-based retention may drop essential content if capacity $M$ is set too low.
The multimodal (vision-language) application is currently based on attention-based heuristics for token selection, not on learnable or adaptive image token retention.

5. Future Directions

Proposed research avenues include:

End-to-end pretraining or finetuning where the retention gates and attention weights co-adapt, enhancing the utility of $\beta$ and potentially yielding further accuracy gains under strict relational or multimodal context.
Adaptive per-head and per-layer memory budgeting for KV caches subject to global resource constraints.
Generalization to richer retention functions (beyond exponential) or context-dependent re-strengthening of retention if tokens receive renewed attention.
Application to multimodal architectures (joint image and text), tool-use, and retrieval-augmented models.
Advancement of hardware/library support for non-uniform, dynamic-length cache management.

6. Comparative Overview

	Visual Token Pruning (Lee et al., 1 Apr 2025)	Retention-Gated Cache (Bui et al., 3 Dec 2025)
Context	Cross-attention, LVLMs	Self-attention, LLMs
Method	Score-and-prune based on attention maps	Learnable gates, exponential decay, eviction
Training	No retraining required	Only gates trained (distillation + capacity loss)
Resource effect	~50% memory/computation reduction	Strict O(M) memory, scalable to long contexts
Performance	$<$ 2% accuracy drop at 40–60% tokens	Matches/exceeds full KV cache at modest budgets

A plausible implication is that attention-guided trimming and retention-based cache bounding are converging towards hardware-efficient, highly scalable inference kernels for both text-only and multimodal foundation models, with emergent interpretability.

7. References

“Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features” (Lee et al., 1 Apr 2025)
“Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs” (Bui et al., 3 Dec 2025)

PDF Markdown Chat (Pro)

References (2)

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features (2025)

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to TRIM-KV.