LaViDa: A Large Diffusion Language Model for Multimodal Understanding (2505.16839v3)

Published 22 May 2025 in cs.CV

Abstract: Modern Vision-LLMs (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.

Summary

LaViDa: A Large Diffusion LLM for Multimodal Understanding

The paper introduces LaViDa, a family of Vision-LLMs (VLMs) based on diffusion models, presenting a novel alternative to autoregressive VLMs which are commonplace in the field. Vision-LLMs have demonstrated significant potential in various applications, thanks to their ability to process and integrate both visual and textual information. However, autoregressive VLMs, such as LLaVA, often suffer from slow inference speeds due to their sequential token generation approach. Diffusion models offer a promising substitute with their bidirectional context modeling and highly parallelizable inference process. This paper aims to bridge the gap in diffusion models’ application to multimodal tasks, exploiting these inherent advantages.

LaViDa integrates diffusion models with a vision encoder and leverages new techniques like complementary masking, prefix KV caching, and timestep shifting. This synergy allows LaViDa to achieve competitive or superior performance in multimodal benchmarks while inheriting the advantages of diffusion models such as faster inference and controllable generation. Notably, LaViDa surpasses Open-LLaVa-Next-Llama3-8B by +4.1 CIDEr in COCO captioning with a 1.92× speedup. Moreover, it demonstrates +59% improvement on constrained poem completion tasks, underscoring the potential of diffusion-based models in handling complex, bidirectional reasoning tasks — a noted weakness in AR models.

Technical Contributions

Complementary Masking: To enhance data efficiency, LaViDa introduces a masking scheme ensuring each token in the sequence contributes to learning. This innovation mitigates loss of crucial semantic content, particularly vital in multimodal instruction following tasks.
Prefix-DLM Caching: LaViDa employs a caching technique to expedite inference. By caching multimodal prompts, it addresses the prohibitive inference speed bottleneck observed in prior diffusion models, thus achieving substantial speed gains.
Timestep Shifting: This technique enables quality-optimized sampling at lower numbers of diffusion steps. By adapting the number of tokens unmasked per step, it enables customizable speed-quality tradeoffs — a significant advantage over deterministic sequential generation methods in autoregressive models.

Implications and Future Work

The results from LaViDa underscore its viability as a robust alternative to autoregressive VLMs, offering improved adaptability and speed-quality tradeoffs. The bidirectional nature of diffusion models enhances their handling of text infilling and constraint-based generation tasks, suggesting future developments in multimodal reasoning and structured output formatting.

This paper's contributions open a pathway for further research into leveraging discrete diffusion models for broader multimodal applications, including interactive systems needing real-time processing. The advantages of controllable generation in LaViDa could be further explored for applications in content creation, automated captioning, and interactive AI systems.

In conclusion, LaViDa demonstrates that diffusion models, equipped with strategically designed techniques, can effectively meet the demands of complex vision-language tasks, suggesting a promising direction for future AI developments across multimodal domains.

Related Papers

Tweets

https://twitter.com/susumuota/status/1935489758651691421

https://twitter.com/ADarmouni/status/1926427679064039674

YouTube

Show All Videos