Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

Published 22 May 2025 in cs.CV | (2505.16990v2)

Abstract: In this work, we propose Dimple, the first Discrete Diffusion Multimodal LLM (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only $\frac{\text{response length}}{3}$. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at https://github.com/yu-rp/Dimple.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a hybrid approach that leverages autoregressive training for vision-language alignment and diffusion training for parallel decoding.
It employs a novel 'Confident Decoding' strategy that dynamically adjusts token generation based on confidence scores to streamline inference.
Benchmarking reveals Dimple outperforms models like LLaVA-NEXT by 3.9% while using less training data, highlighting its efficiency.

Dimple: Discrete Diffusion Multimodal LLM with Parallel Decoding

Introduction

The development of LLMs has been traditionally dominated by autoregressive models, which predict tokens sequentially from previous tokens. The paper introduces Dimple, a Discrete Diffusion Multimodal LLM (DMLLM) intended to leverage discrete diffusion processes for sequence generation, resulting in improved training efficiency and response controllability. Unlike conventional autoregressive models, Dimple's diffusion-based approach allows direct control over token positions, semantic structures, and output formats, offering potential advantages in multimodal tasks.

Hybrid Training Paradigm

Dimple combines an autoregressive training phase with a diffusion training phase, addressing key inefficiencies observed with a pure diffusion approach such as training instability and length bias. Initially, Dimple undergoes autoregressive training to achieve vision-language alignment and instruction tuning. This phase enhances the model's ability to follow instructions and align different modalities effectively. The subsequent diffusion training phase restores parallel decoding capabilities, allowing for more flexible and efficient generation.

Figure 1: Performance after Alignment.

Inference Techniques

The paper outlines novel inference techniques that enhance decoding performance in diffusion models, which traditionally utilize full bidirectional attention. The "Confident Decoding" strategy dynamically adjusts the number of tokens decoded per iteration based on confidence scores. Such a strategy enables the model to decode multiple tokens simultaneously when confident, thus improving efficiency while maintaining response quality.

Additionally, the reimplementation of the "Prefilling" technique reduces complexity during inference by caching certain computational states, facilitating a more resource-efficient decoding process.

Benchmarking and Results

Dimple demonstrated competitive performance across various benchmarks relative to similar data-scale models while using less training data than state-of-the-art autoregressive models. Notably, Dimple surpassed LLaVA-NEXT by 3.9% in overall performance, illustrating the potential of combining autoregressive and diffusion strategies.

Advantages and Limitations

The hybrid model exhibits enhanced capabilities in structured reasoning and precise output formatting, tasks traditionally challenging for autoregressive models.

Figure 2: An example of Length Bias Phenomenon Collected from ChartQA. The response is generated by the model after 2-epoch diffusion tuning. The red and blue boxes on the left indicate the source locations of the numbers in the response.

However, Dimple still lags behind models trained on much larger datasets. Increasing training data volume and parameter counts represent future directions to fully harness the potential of DMLLM architectures.

Conclusion

Dimple validates the feasibility of discrete diffusion models in multimodal applications, offering benefits over traditional autoregressive methods while addressing some of their inherent challenges. The research opens up pathways for more efficient and controllable LLM development, suggesting that future work could involve scaling the datasets and refining inference strategies to maximize the model’s potential.

Markdown Report Issue