Papers
Topics
Authors
Recent
2000 character limit reached

RedVTP: Efficient DVLM Inference

Updated 23 November 2025
  • RedVTP is a training-free approach that prunes unimportant visual tokens using masked-token attention to accelerate DVLM inference.
  • It computes stable importance scores after the first diffusion step to retain top-scoring tokens, significantly reducing FLOPs and latency.
  • Empirical results show substantial throughput gains and minimal accuracy loss, validating RedVTP as an efficient, retraining-free method.

RedVTP is a training-free approach for accelerating inference in diffusion vision-LLMs (DVLMs) by pruning unimportant visual tokens using masked response token attention. Operating on models such as LLaDA-V and LaViDa, RedVTP harnesses the early-stage diffusion dynamics to maximize computational efficiency while maintaining—sometimes improving—generation accuracy. The algorithm is notable for its single-shot pruning protocol, in which importance scores derived from masked-token attention are computed after the initial diffusion step, and only the top-scoring visual tokens are retained for subsequent processing steps. This process yields substantial reductions in floating point operations (FLOPs), latency, and memory requirements, and achieves state-of-the-art throughput improvements without the need for model retraining (Xu et al., 16 Nov 2025).

1. Diffusion Vision-LLM (DVLM) Inference Pipeline

DVLMs integrate visual and linguistic modalities using transformer architectures with a parallel token decoding process enabled by diffusion-based unmasking. The model architecture comprises:

  • Vision Encoder: Splits the input image into NN patches, embedding each into Rdv\mathbb{R}^{d_v}, then processing via transformer layers to yield NN visual token embeddings.
  • Projector: Maps visual token embeddings into a shared language space of dimension dd, generating the matrix VRN×dV \in \mathbb{R}^{N \times d}.
  • Text Encoder: Encodes an mm-token textual prompt into TRm×dT \in \mathbb{R}^{m \times d}.
  • Diffusion LLM (DLM): An LL-layer, HH-head transformer with bidirectional attention, responsible for unmasking a sequence of τ\tau response tokens over KK steps.

Inference begins with a fully masked response R1=[M,,M]Rτ×dR_1 = [M, \dots, M] \in \mathbb{R}^{\tau \times d}. At each step kk,

Xk=[V;T;Rk]R(N+m+τ)×d,X_k = [V; T; R_k] \in \mathbb{R}^{(N+m+\tau)\times d},

where previously unmasked tokens remain static while masked positions are selectively unmasked based on the model's predictions. The computational complexity for a single layer on sequence length n=N+m+τn=N+m+\tau is O(4nd2+2n2d+2ndμ)O(4nd^2 + 2n^2d + 2nd\mu), with μ\mu as the FFN hidden dimension. Since Nm,τN \gg m, \tau, reducing NN (the visual token count) quadratically impacts overall FLOPs due to self-attention costs.

2. Masked-Token–Guided Visual Token Importance Scoring

RedVTP's central innovation is in computing visual token importance scores based exclusively on attention from still-masked response tokens immediately after the first diffusion step. Let Ak(l,h)R(N+m+τ)×(N+m+τ)A_k^{(l,h)} \in \mathbb{R}^{(N+m+\tau)\times(N+m+\tau)} be the attention matrix for head hh of layer ll at step kk. The averaged attention over all heads and layers is:

Aˉk=1HLh=1Hl=1LAk(l,h).\bar A_k = \frac{1}{HL} \sum_{h=1}^H \sum_{l=1}^L A_k^{(l,h)}.

The importance score for visual token ii at step kk is defined as:

Sk[i]=1Ik(M)jIk(M)Aˉk[j,i],iI(V),S_k[i] = \frac{1}{|\mathcal{I}_k(M)|} \sum_{j\in \mathcal{I}_k(M)} \bar A_k[j, i], \quad i \in \mathcal{I}(V),

where Ik(M)\mathcal{I}_k(M) indexes still-masked response positions and I(V)\mathcal{I}(V) indexes visual tokens.

Empirically, on datasets such as InfoVQA with LLaDA-V (K=16K=16), the cosine similarity between S1S_1 and SkS_k for k=2,,16k = 2, \dots, 16 exceeds 0.95, indicating that importance rankings are stable after the first step and do not benefit from recalculation in subsequent steps.

3. RedVTP Pruning Procedure

Exploiting the stability of masked-token-based importance scores, RedVTP conducts pruning once, immediately after the first diffusion step. The protocol consists of:

  • Running the initial diffusion step (k=1k=1) on the complete set [V;T;R1][V; T; R_1] and recording all attention matrices A1(l,h)A_1^{(l,h)}.
  • Computing the averaged attention Aˉ1\bar A_1 and importance vector S1S_1.
  • Selecting a retention ratio r(0,1]r \in (0, 1]; the top-rr fraction of visual tokens by importance (Ikeep=Top(I(V),S1,r)\mathcal{I}^{\text{keep}} = \operatorname{Top}(\mathcal{I}(V), S_1, r)) are retained.
  • Forming VkeepV^{\text{keep}} by extracting the corresponding visual token rows.
  • For subsequent steps k=2,,Kk=2, \dots, K, progressing the diffusion process on [Vkeep;T;Rk][V^{\text{keep}};T;R_k].

This single-shot pruning introduces only minor algorithmic overhead while remaining entirely training-free, as no additional learning or model adaptation occurs. The algorithmic pseudocode strictly follows these steps, ensuring transparent reproducibility.

4. Computational Complexity and Efficiency

Prior to pruning, the computational cost is

FLOPs=LK(4nd2+2n2d+2ndμ),\text{FLOPs} = L\,K\,(4nd^2 + 2n^2d + 2nd\mu),

with n=N+m+τn=N+m+\tau. After pruning to Nr=rNN_r = rN visual tokens, the sequence length becomes nr=Nr+m+τn_r=N_r+m+\tau, and the remaining steps (K1K-1) operate at this reduced size:

Total CostL(4nd2+2n2d+2ndμ)+(K1)L(4nrd2+2nr2d+2nrdμ).\text{Total Cost} \approx L(4nd^2+2n^2d+2nd\mu) + (K-1)L(4n_r d^2+2n_r^2d+2n_r d\mu).

Because the quadratic term n2n^2 dominates in large NN regimes, a reduction in NN induces significant computational savings, approaching linearity as (K1)1(K-1) \gg 1.

5. Empirical Findings

RedVTP is benchmarked on standard multimodal datasets, including Ai2D, DocVQA, RealworldQA, InfoVQA, MME, and MMBench. Summarized results for the LLaDA-V and LaViDa models are as follows:

Model Token Retention (r) Latency Reduction (%) Throughput Gain (%) Accuracy Change (%)
LLaDA-V 75% 23.11 32.35 +0.16
LLaDA-V 50% 32.04 52.75 –0.26
LLaDA-V 25% 44.57 (max 64.97) 91.66 (max 186) –4.15
LaViDa+RedVTP 75% 10.70 (max 21.87) 12.61 (max 28.05) –2.20

Notably, on InfoVQA (LLaDA-V, r=25%r=25\%), latency decreases from 17.31 s to 6.06 s and throughput increases from 1.79 to 5.11 tokens/s. When both LLaDA-V+RedVTP and LaViDa retain \sim25% of visual tokens, RedVTP yields a +23.72% average accuracy across six benchmarks relative to LaViDa.

6. Ablation and Comparative Evaluation

Ablation studies address alternative importance scoring and pruning strategies:

  • Importance scores based solely on attention from still-masked response tokens yield the highest average accuracy at fixed retention ratios (r=25%r=25\%). Strategies using prompt-only, decoded-only, all-response tokens, prompt+all, visual-only, or vision-encoder-based (FoPru) scoring underperform by 1.2–6.5 accuracy points.
  • Random pruning inflicts substantial accuracy drops.
  • Progressive pruning, which performs re-scoring and token removal at each diffusion step, attains similar accuracy as RedVTP but increases latency up to 34.4% and lowers throughput by up to 25.5%, rendering it less efficient.

These ablation results demonstrate that masked-token-guided, single-step pruning constitutes the highest-utility approach for DVLM efficiency improvements.

7. Significance and Applicability

RedVTP provides an architecture-agnostic, parameter-free, and training-free mechanism for accelerating inference in diffusion-based vision-LLMs. By leveraging the empirical consistency of masked-token attention-derived visual token importance, RedVTP achieves up to 186% throughput increase and 64.97% latency reduction on canonical DVLM benchmarks, often with negligible or positive impact on accuracy. These properties enable scalable, efficient deployment of DVLMs in resource-constrained or latency-sensitive settings without retraining or complicated post-hoc adaptation (Xu et al., 16 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RedVTP.