RedVTP: Efficient DVLM Inference
- RedVTP is a training-free approach that prunes unimportant visual tokens using masked-token attention to accelerate DVLM inference.
- It computes stable importance scores after the first diffusion step to retain top-scoring tokens, significantly reducing FLOPs and latency.
- Empirical results show substantial throughput gains and minimal accuracy loss, validating RedVTP as an efficient, retraining-free method.
RedVTP is a training-free approach for accelerating inference in diffusion vision-LLMs (DVLMs) by pruning unimportant visual tokens using masked response token attention. Operating on models such as LLaDA-V and LaViDa, RedVTP harnesses the early-stage diffusion dynamics to maximize computational efficiency while maintaining—sometimes improving—generation accuracy. The algorithm is notable for its single-shot pruning protocol, in which importance scores derived from masked-token attention are computed after the initial diffusion step, and only the top-scoring visual tokens are retained for subsequent processing steps. This process yields substantial reductions in floating point operations (FLOPs), latency, and memory requirements, and achieves state-of-the-art throughput improvements without the need for model retraining (Xu et al., 16 Nov 2025).
1. Diffusion Vision-LLM (DVLM) Inference Pipeline
DVLMs integrate visual and linguistic modalities using transformer architectures with a parallel token decoding process enabled by diffusion-based unmasking. The model architecture comprises:
- Vision Encoder: Splits the input image into patches, embedding each into , then processing via transformer layers to yield visual token embeddings.
- Projector: Maps visual token embeddings into a shared language space of dimension , generating the matrix .
- Text Encoder: Encodes an -token textual prompt into .
- Diffusion LLM (DLM): An -layer, -head transformer with bidirectional attention, responsible for unmasking a sequence of response tokens over steps.
Inference begins with a fully masked response . At each step ,
where previously unmasked tokens remain static while masked positions are selectively unmasked based on the model's predictions. The computational complexity for a single layer on sequence length is , with as the FFN hidden dimension. Since , reducing (the visual token count) quadratically impacts overall FLOPs due to self-attention costs.
2. Masked-Token–Guided Visual Token Importance Scoring
RedVTP's central innovation is in computing visual token importance scores based exclusively on attention from still-masked response tokens immediately after the first diffusion step. Let be the attention matrix for head of layer at step . The averaged attention over all heads and layers is:
The importance score for visual token at step is defined as:
where indexes still-masked response positions and indexes visual tokens.
Empirically, on datasets such as InfoVQA with LLaDA-V (), the cosine similarity between and for exceeds 0.95, indicating that importance rankings are stable after the first step and do not benefit from recalculation in subsequent steps.
3. RedVTP Pruning Procedure
Exploiting the stability of masked-token-based importance scores, RedVTP conducts pruning once, immediately after the first diffusion step. The protocol consists of:
- Running the initial diffusion step () on the complete set and recording all attention matrices .
- Computing the averaged attention and importance vector .
- Selecting a retention ratio ; the top- fraction of visual tokens by importance () are retained.
- Forming by extracting the corresponding visual token rows.
- For subsequent steps , progressing the diffusion process on .
This single-shot pruning introduces only minor algorithmic overhead while remaining entirely training-free, as no additional learning or model adaptation occurs. The algorithmic pseudocode strictly follows these steps, ensuring transparent reproducibility.
4. Computational Complexity and Efficiency
Prior to pruning, the computational cost is
with . After pruning to visual tokens, the sequence length becomes , and the remaining steps () operate at this reduced size:
Because the quadratic term dominates in large regimes, a reduction in induces significant computational savings, approaching linearity as .
5. Empirical Findings
RedVTP is benchmarked on standard multimodal datasets, including Ai2D, DocVQA, RealworldQA, InfoVQA, MME, and MMBench. Summarized results for the LLaDA-V and LaViDa models are as follows:
| Model | Token Retention (r) | Latency Reduction (%) | Throughput Gain (%) | Accuracy Change (%) |
|---|---|---|---|---|
| LLaDA-V | 75% | 23.11 | 32.35 | +0.16 |
| LLaDA-V | 50% | 32.04 | 52.75 | –0.26 |
| LLaDA-V | 25% | 44.57 (max 64.97) | 91.66 (max 186) | –4.15 |
| LaViDa+RedVTP | 75% | 10.70 (max 21.87) | 12.61 (max 28.05) | –2.20 |
Notably, on InfoVQA (LLaDA-V, ), latency decreases from 17.31 s to 6.06 s and throughput increases from 1.79 to 5.11 tokens/s. When both LLaDA-V+RedVTP and LaViDa retain 25% of visual tokens, RedVTP yields a +23.72% average accuracy across six benchmarks relative to LaViDa.
6. Ablation and Comparative Evaluation
Ablation studies address alternative importance scoring and pruning strategies:
- Importance scores based solely on attention from still-masked response tokens yield the highest average accuracy at fixed retention ratios (). Strategies using prompt-only, decoded-only, all-response tokens, prompt+all, visual-only, or vision-encoder-based (FoPru) scoring underperform by 1.2–6.5 accuracy points.
- Random pruning inflicts substantial accuracy drops.
- Progressive pruning, which performs re-scoring and token removal at each diffusion step, attains similar accuracy as RedVTP but increases latency up to 34.4% and lowers throughput by up to 25.5%, rendering it less efficient.
These ablation results demonstrate that masked-token-guided, single-step pruning constitutes the highest-utility approach for DVLM efficiency improvements.
7. Significance and Applicability
RedVTP provides an architecture-agnostic, parameter-free, and training-free mechanism for accelerating inference in diffusion-based vision-LLMs. By leveraging the empirical consistency of masked-token attention-derived visual token importance, RedVTP achieves up to 186% throughput increase and 64.97% latency reduction on canonical DVLM benchmarks, often with negligible or positive impact on accuracy. These properties enable scalable, efficient deployment of DVLMs in resource-constrained or latency-sensitive settings without retraining or complicated post-hoc adaptation (Xu et al., 16 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free