Response-Driven Visual Token Pruning

Updated 23 November 2025

The paper demonstrates that response-driven token pruning uses intermediate attention signals to dynamically reduce redundant visual tokens, achieving significant compute and memory efficiency.
A mutual-information guided framework and cross-modal attention-based saliency enable adaptive, per-sample token retention that tailors pruning to task complexity.
Empirical benchmarks show that methods like AutoPrune, RedVTP, and DyRate achieve over 60% FLOP reduction with minimal accuracy loss, underscoring practical efficiency gains.

Response-driven visual token pruning refers to a family of strategies for dynamically reducing the number of visual tokens in large Vision-LLMs (VLMs) during inference, where the selection and retention of tokens is directly determined by the model’s own intermediate "response"—typically its attention patterns or cross-modal information flow—instead of fixed heuristics or purely vision-side importance metrics. The motivation is to realize significant reductions in memory and computation cost while maintaining, or sometimes even improving, task accuracy and model robustness. Methods in this lineage exploit the observation that as the reasoning process unfolds, many visual patches become either redundant or irrelevant for subsequent multimodal computations, and that this can be identified from ongoing model activations.

1. Foundational Principles and Motivation

The core motivation for response-driven pruning is the overwhelming token redundancy inherent in modern VLMs, where the input image is typically mapped to hundreds or thousands of visual patch tokens. The quadratic complexity in both self-attention and cross-attention computations means that reducing the number of visual tokens early and aggressively translates directly to reduced inference latency and memory usage (Wang et al., 28 Sep 2025, Endo et al., 2024, Xu et al., 16 Nov 2025, Liang et al., 24 Jan 2025). Traditional "static" or vision-only pruning, based for example on CLIP [CLS] attention or uniform sampling, cannot adapt to sample-specific complexity or question-specific cues.

"Cognitive" inspiration is drawn from human visual reasoning, which exhibits an initial phase of broad exploration followed by selective narrowing as the information required for the task becomes clear; this is mirrored in models whose early-layer attention is diffuse but becomes localized as decoding proceeds (Wang et al., 28 Sep 2025).

2. Key Methodological Approaches

2.1 Mutual-Information Guided Pruning (AutoPrune)

AutoPrune computes the mutual information $I(\mathbf V, \mathbf T)$ between visual ( $\mathbf V$ ) and textual tokens ( $\mathbf T$ ) using cross-modal attention maps in an early model layer. $I$ is interpreted as a dynamically estimated complexity score: higher $I$ signals well-constrained queries ("pinned down" visual evidence), and lower $I$ indicates ambiguous, complex inputs needing more tokens. This scalar is then mapped to a logistic retention curve across model layers. Crucially, the shape and steepness of each curve are tailored per input by $I$ , so token retention is steep and early for simple queries, and more gradual for hard or ambiguous ones (Wang et al., 28 Sep 2025).

At each layer, a budget-constrained renormalization ensures that total token usage matches a global compute constraint, allowing the framework to adaptively allocate capacity without hand-designed per-sample rules.

In both autoregressive and diffusion-based settings, cross-modal attention from text to visual tokens—especially from current or masked response positions—is used to quantify informativeness. For autoregressive VLMs (e.g., FEATHER (Endo et al., 2024), DyRate (Liang et al., 24 Jan 2025)), pruning criteria are constructed from the attention assigned to each visual token by the LLM at specific layers, thereby capturing the evolving task-relevance as decoding unfolds. In diffusion VLMs (RedVTP (Xu et al., 16 Nov 2025)), attention from still-masked response tokens to each visual token is averaged immediately after the first diffusion step, with empirical evidence showing remarkable temporal stability of these importances, justifying a single-step, once-off pruning.

2.3 Adaptive and Multi-Stage Pruning Schedules

Response-driven approaches frequently implement pruning in multiple stages—early or mid-layer pruning followed by later, possibly deeper, pruning as the model's information flow matures (Endo et al., 2024, Liang et al., 24 Jan 2025, Wang et al., 28 Sep 2025). Some methods supplement saliency-based pruning with uniform sampling to mitigate positional or spatial bias (as in FEATHER (Endo et al., 2024)), or schedule the compression rate dynamically via a classifier that responds to instantaneous attention allocation (DyRate (Liang et al., 24 Jan 2025)).

In diffusion and parallel decoding models (RedVTP (Xu et al., 16 Nov 2025)), importance is inferred from the initial inference step and pruned tokens are dropped for all subsequent steps, yielding both batch and sequential efficiency gains.

3. Mathematical Formulation

The central mathematical tool for response-driven pruning is the use of intermediate attention weights, often aggregated across model layers and/or heads, to construct sample-specific importance scores. In the mutual-information framework (AutoPrune (Wang et al., 28 Sep 2025)), this is formalized as: $I(\mathbf V, \mathbf T) = \sum_{i,j} p(v_i, t_j) \log \frac{p(v_i, t_j)}{p(v_i) p(t_j)}$ where probabilities are derived from normalized attention weights.

Retention curves are defined by logistic functions whose parameters are functions of $I$ , constrained such that the integral under the curve matches a prescribed FLOPs or token budget.

In contrast, stepwise autoregressive approaches compute, at each generation step $t$ , a pruning ratio $r_t$ as a function $f(\mathbf v_t)$ , where $\mathbf v_t$ is the vector of current cross-attention allocations, typically through a learned classifier combined with Gumbel–Softmax sampling for differentiability: $r_t = f(\mathbf v_t) = \mathrm{GumbelSoftmax}(\mathrm{Softmax}(W \mathbf v_t + b))$

For diffusion models, RedVTP formalizes importance as: $s_i = \frac{1}{|\mathcal I_1(M)|} \sum_{j\in\mathcal I_1(M)} \bar{\mathbf{A}}_{j,i}$ with $\bar{\mathbf{A}}$ averaged over layers and heads, $i$ indexing visual tokens, and $j$ indexing masked response positions (Xu et al., 16 Nov 2025).

4. Empirical Benchmarks and Comparative Results

Across several large-scale VQA, multimodal QA, and autonomous driving tasks, response-driven methods demonstrate substantial gains in efficiency and competitiveness:

Method	Tokens Retained	Throughput Gain	Acc. Retention	Dataset/Model
AutoPrune	11% (64/576)	76.8% FLOP↓	96.7%	LLaVA-1.5-7B (Wang et al., 28 Sep 2025)
RedVTP	25%	186% TPUT↑	-4.15 pts	LLaDA-V (Xu et al., 16 Nov 2025)
DyRate	dynamic	66% FLOP↓	≈100% (EM)	LLaVA-1.5-7B (Liang et al., 24 Jan 2025)
FEATHER	≈3.3%	64% FLOP↓	–	RefCOCO (Endo et al., 2024)

Notably, AutoPrune outperforms PDrop by 9.1% at extreme sparsity and is consistently superior to baselines such as FastV, SparseVLM, VisionZip, and FasterVLM (Wang et al., 28 Sep 2025). RedVTP achieves up to 186% throughput improvement on diffusion models with negligible or no drop in accuracy (Xu et al., 16 Nov 2025). DyRate's classifier-driven, stepwise pruning retains full accuracy while halving compute during captioning and VQA (Liang et al., 24 Jan 2025).

5. Architectural Integration and Implementation

Response-driven pruning mechanisms are typically applied after the initial vision encoder and inside the language-model decoder or diffusion decoder stacks:

AutoPrune: computes attention-based MI at an early model layer, then schedules per-layer token retention using the derived logistic curve (Wang et al., 28 Sep 2025).
RedVTP: after first diffusion step, prunes via masked-token attention, then proceeds with only retained tokens for the full remaining inference trajectory (Xu et al., 16 Nov 2025).
DyRate: as each token is generated autoregressively, a lightweight predictor attuned to cross-modal attention shares sets the pruning aggressiveness dynamically (Liang et al., 24 Jan 2025).
FEATHER: removes RoPE bias in attention scoring and combines with uniform subsampling in a two-stage process (Endo et al., 2024).

Implementations are plug-and-play and do not require training or fine-tuning of the backbone models; they only access forward-pass attention statistics.

6. Theoretical and Practical Significance

Response-driven visual token pruning achieves several distinctive properties:

Dynamic Adaptivity: Token pruning is tightly coupled to current sample/task complexity, allowing models to economize for simple queries and allocate resources for harder cases (Wang et al., 28 Sep 2025, Liang et al., 24 Jan 2025).
Preservation of Reasoning Trajectory: By pruning based on actual contextual dependence and evolving cross-modal saliency, these methods avoid over-pruning critical evidence as can occur with static approaches. Empirically, moderate pruning sometimes increases accuracy by discarding distracting or noisy visual input (Wang et al., 28 Sep 2025, Xu et al., 16 Nov 2025).
Broad Applicability: Applicable to both autoregressive and diffusion-based VLMs, as well as various downstream tasks (VQA, captioning, navigation planning).
Efficiency–Accuracy Tradeoff: Consistently allows for compression rates above 80–90% while incurring less than 5% degradation in accuracy on standard multimodal benchmarks.

A plausible implication is that, as benchmarks become more fine-grained and challenging (e.g., requiring spatial localization or multi-turn reasoning), the competitive advantage of response-driven pruning over static or vision-only approaches will further widen, due to its task- and instance-level adaptivity.

7. Limitations and Open Directions

Current response-driven pruning approaches may risk over-pruning in domains requiring dense, fine-grained visual details (e.g., document QA at extreme sparsity levels), or may fragment spatial context when selections are made solely based on attention scores (Xu et al., 16 Nov 2025). Additionally, most methods operate per-query and do not account for multi-turn dialogue or evolving relevance over conversational context.

Further directions suggested in the literature include:

Merging or clustering retained tokens to improve spatial coherence (Xu et al., 16 Nov 2025).
Integrating lightweight learning or reinforcement learning to optimize token selection policies with explicit answer reward functions (Endo et al., 2024).
Combining with static, pre-trained token reducers for layered, two-phase pruning (Xu et al., 16 Nov 2025).

The field continues to evolve rapidly as models and benchmarks increase in scale and complexity. Response-driven token pruning remains a foundational paradigm for efficient and robust multimodal reasoning at scale.