RedVTP: Efficient DVLM Inference

Updated 23 November 2025

RedVTP is a training-free approach that prunes unimportant visual tokens using masked-token attention to accelerate DVLM inference.
It computes stable importance scores after the first diffusion step to retain top-scoring tokens, significantly reducing FLOPs and latency.
Empirical results show substantial throughput gains and minimal accuracy loss, validating RedVTP as an efficient, retraining-free method.

RedVTP is a training-free approach for accelerating inference in diffusion vision-LLMs (DVLMs) by pruning unimportant visual tokens using masked response token attention. Operating on models such as LLaDA-V and LaViDa, RedVTP harnesses the early-stage diffusion dynamics to maximize computational efficiency while maintaining—sometimes improving—generation accuracy. The algorithm is notable for its single-shot pruning protocol, in which importance scores derived from masked-token attention are computed after the initial diffusion step, and only the top-scoring visual tokens are retained for subsequent processing steps. This process yields substantial reductions in floating point operations (FLOPs), latency, and memory requirements, and achieves state-of-the-art throughput improvements without the need for model retraining (Xu et al., 16 Nov 2025).

1. Diffusion Vision-LLM (DVLM) Inference Pipeline

DVLMs integrate visual and linguistic modalities using transformer architectures with a parallel token decoding process enabled by diffusion-based unmasking. The model architecture comprises:

Vision Encoder: Splits the input image into $N$ patches, embedding each into $\mathbb{R}^{d_v}$ , then processing via transformer layers to yield $N$ visual token embeddings.
Projector: Maps visual token embeddings into a shared language space of dimension $d$ , generating the matrix $V \in \mathbb{R}^{N \times d}$ .
Text Encoder: Encodes an $m$ -token textual prompt into $T \in \mathbb{R}^{m \times d}$ .
Diffusion LLM (DLM): An $L$ -layer, $H$ -head transformer with bidirectional attention, responsible for unmasking a sequence of $\tau$ response tokens over $K$ steps.

Inference begins with a fully masked response $R_1 = [M, \dots, M] \in \mathbb{R}^{\tau \times d}$ . At each step $k$ ,

$X_k = [V; T; R_k] \in \mathbb{R}^{(N+m+\tau)\times d},$

where previously unmasked tokens remain static while masked positions are selectively unmasked based on the model's predictions. The computational complexity for a single layer on sequence length $n=N+m+\tau$ is $O(4nd^2 + 2n^2d + 2nd\mu)$ , with $\mu$ as the FFN hidden dimension. Since $N \gg m, \tau$ , reducing $N$ (the visual token count) quadratically impacts overall FLOPs due to self-attention costs.

2. Masked-Token–Guided Visual Token Importance Scoring

RedVTP's central innovation is in computing visual token importance scores based exclusively on attention from still-masked response tokens immediately after the first diffusion step. Let $A_k^{(l,h)} \in \mathbb{R}^{(N+m+\tau)\times(N+m+\tau)}$ be the attention matrix for head $h$ of layer $l$ at step $k$ . The averaged attention over all heads and layers is:

$\bar A_k = \frac{1}{HL} \sum_{h=1}^H \sum_{l=1}^L A_k^{(l,h)}.$

The importance score for visual token $i$ at step $k$ is defined as:

$S_k[i] = \frac{1}{|\mathcal{I}_k(M)|} \sum_{j\in \mathcal{I}_k(M)} \bar A_k[j, i], \quad i \in \mathcal{I}(V),$

where $\mathcal{I}_k(M)$ indexes still-masked response positions and $\mathcal{I}(V)$ indexes visual tokens.

Empirically, on datasets such as InfoVQA with LLaDA-V ( $K=16$ ), the cosine similarity between $S_1$ and $S_k$ for $k = 2, \dots, 16$ exceeds 0.95, indicating that importance rankings are stable after the first step and do not benefit from recalculation in subsequent steps.

3. RedVTP Pruning Procedure

Exploiting the stability of masked-token-based importance scores, RedVTP conducts pruning once, immediately after the first diffusion step. The protocol consists of:

Running the initial diffusion step ( $k=1$ ) on the complete set $[V; T; R_1]$ and recording all attention matrices $A_1^{(l,h)}$ .
Computing the averaged attention $\bar A_1$ and importance vector $S_1$ .
Selecting a retention ratio $r \in (0, 1]$ ; the top- $r$ fraction of visual tokens by importance ( $\mathcal{I}^{\text{keep}} = \operatorname{Top}(\mathcal{I}(V), S_1, r)$ ) are retained.
Forming $V^{\text{keep}}$ by extracting the corresponding visual token rows.
For subsequent steps $k=2, \dots, K$ , progressing the diffusion process on $[V^{\text{keep}};T;R_k]$ .

This single-shot pruning introduces only minor algorithmic overhead while remaining entirely training-free, as no additional learning or model adaptation occurs. The algorithmic pseudocode strictly follows these steps, ensuring transparent reproducibility.

4. Computational Complexity and Efficiency

Prior to pruning, the computational cost is

$\text{FLOPs} = L\,K\,(4nd^2 + 2n^2d + 2nd\mu),$

with $n=N+m+\tau$ . After pruning to $N_r = rN$ visual tokens, the sequence length becomes $n_r=N_r+m+\tau$ , and the remaining steps ( $K-1$ ) operate at this reduced size:

$\text{Total Cost} \approx L(4nd^2+2n^2d+2nd\mu) + (K-1)L(4n_r d^2+2n_r^2d+2n_r d\mu).$

Because the quadratic term $n^2$ dominates in large $N$ regimes, a reduction in $N$ induces significant computational savings, approaching linearity as $(K-1) \gg 1$ .

5. Empirical Findings

RedVTP is benchmarked on standard multimodal datasets, including Ai2D, DocVQA, RealworldQA, InfoVQA, MME, and MMBench. Summarized results for the LLaDA-V and LaViDa models are as follows:

Model	Token Retention (r)	Latency Reduction (%)	Throughput Gain (%)	Accuracy Change (%)
LLaDA-V	75%	23.11	32.35	+0.16
LLaDA-V	50%	32.04	52.75	–0.26
LLaDA-V	25%	44.57 (max 64.97)	91.66 (max 186)	–4.15
LaViDa+RedVTP	75%	10.70 (max 21.87)	12.61 (max 28.05)	–2.20

Notably, on InfoVQA (LLaDA-V, $r=25\%$ ), latency decreases from 17.31 s to 6.06 s and throughput increases from 1.79 to 5.11 tokens/s. When both LLaDA-V+RedVTP and LaViDa retain $\sim$ 25% of visual tokens, RedVTP yields a +23.72% average accuracy across six benchmarks relative to LaViDa.

6. Ablation and Comparative Evaluation

Ablation studies address alternative importance scoring and pruning strategies:

Importance scores based solely on attention from still-masked response tokens yield the highest average accuracy at fixed retention ratios ( $r=25\%$ ). Strategies using prompt-only, decoded-only, all-response tokens, prompt+all, visual-only, or vision-encoder-based (FoPru) scoring underperform by 1.2–6.5 accuracy points.
Random pruning inflicts substantial accuracy drops.
Progressive pruning, which performs re-scoring and token removal at each diffusion step, attains similar accuracy as RedVTP but increases latency up to 34.4% and lowers throughput by up to 25.5%, rendering it less efficient.

These ablation results demonstrate that masked-token-guided, single-step pruning constitutes the highest-utility approach for DVLM efficiency improvements.

7. Significance and Applicability

RedVTP provides an architecture-agnostic, parameter-free, and training-free mechanism for accelerating inference in diffusion-based vision-LLMs. By leveraging the empirical consistency of masked-token attention-derived visual token importance, RedVTP achieves up to 186% throughput increase and 64.97% latency reduction on canonical DVLM benchmarks, often with negligible or positive impact on accuracy. These properties enable scalable, efficient deployment of DVLMs in resource-constrained or latency-sensitive settings without retraining or complicated post-hoc adaptation (Xu et al., 16 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to RedVTP.

RedVTP: Efficient DVLM Inference

1. Diffusion Vision-LLM (DVLM) Inference Pipeline

2. Masked-Token–Guided Visual Token Importance Scoring

3. RedVTP Pruning Procedure

4. Computational Complexity and Efficiency

5. Empirical Findings

6. Ablation and Comparative Evaluation

7. Significance and Applicability

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RedVTP: Efficient DVLM Inference

1. Diffusion Vision-LLM (DVLM) Inference Pipeline

2. Masked-Token–Guided Visual Token Importance Scoring

3. RedVTP Pruning Procedure

4. Computational Complexity and Efficiency

5. Empirical Findings

6. Ablation and Comparative Evaluation

7. Significance and Applicability

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research