Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Published 29 May 2025 in cs.LG and cs.CV | (2505.23606v2)

Abstract: Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper introduces a unified discrete diffusion model that efficiently generates multimodal outputs while overcoming limitations of autoregressive approaches.
It employs a MaskGIT-style architecture with continuous-time Markov chains to jointly optimize text and image token reconstruction, improving generation quality.
The model achieves competitive benchmarks in text-to-image, image-to-text, and visual question answering, offering an order of magnitude faster inference.

Muddit: Liberating Multimodal Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Muddit is introduced as a unified discrete diffusion transformer for multimodal generation tasks, including text-to-image synthesis, image-to-text transformation, and visual question answering. The model capitalizes on pretrained text-to-image diffusion backbones, fostering alignment between text and image modalities without the efficiency constraints of conventional autoregressive models.

Unified Generative Models

Muddit addresses the dichotomy between autoregressive (AR) and diffusion-based generative models, giving rise to a unified architecture capable of parallel processing across modalities. Traditional AR models are limited by sequential decoding inefficiencies, whereas Muddit leverages a discrete diffusion framework, allowing simultaneous generation of text and image tokens in a more scalable fashion.

Figure 1: Four types of unified generative models. More details can be found in Sec.~\ref{sec:related_work}.

Muddit builds upon discrete diffusion models, which override the inefficiencies of token-by-token generation by initializing its backbone with pretrained models such as Meissonic. By embedding strong visual priors, Muddit enhances the generation quality and maintains competitive performance across varied tasks.

Architecture and Training

Muddit harnesses a MaskGIT-style architecture, utilizing discrete diffusion to obscure tokens progressively, employing continuous-time Markov chains for both text and image modalities. During training, Muddit optimizes via a masked token predictor aiming to reconstruct original token sequences by integrating visual and textual priors into a shared generation space.

Figure 2: The training and inference architecture of Muddit.

The architecture supports integrated multimodal generation under a unified loss function, ensuring robust training across vision tasks by refining text and image representations simultaneously. The model's adaptability is demonstrated through flexible conditionality in multimodal contexts, from high-resolution image synthesis to comprehensive textual analyses.

Experimental Results

Text-to-Image Generation

Muddit showcases its efficacy on the GenEval benchmark by matching the performance of larger diffusion models despite a smaller parameter footprint. The model's ability to capture intricate compositional details signifies its potential as a scalable solution, unifying multimodal generation within a discrete framework.

Figure 3: Samples of Text-to-Image Generation by Muddit.

Image-to-Text Generation

Muddit's efficiency extends to image-to-text generation, outperforming multimodal models across standard benchmarks like MS-COCO and VQAv2. It effectively utilizes pretrained components to yield coherent captions while maintaining the fidelity of visual information.

Figure 4: Samples of Image-to-Text Generation by Muddit.

Visual Question Answering

In visual question answering, Muddit demonstrates its capability to synthesize answers based on complex image contexts, employing a unified generative strategy to address both visual understanding and textual interpretation.

Figure 5: Samples of Visual Question Answering by Muddit.

Inference Efficiency

Muddit's discrete diffusion mechanism provides substantial speed benefits in generation tasks by enabling parallel processing. Compared to autoregressive models, Muddit achieves an order of magnitude faster inference without sacrificing generative performance.

Figure 6: Inference speed comparison.

Although discrete diffusion necessitates innovative strategies like KV-cache development for enhanced efficiency, Muddit's performance evinces a promising direction for real-time multimodal interactions.

Conclusion

Muddit advances the discrete diffusion paradigm by aligning visual and textual inputs in multimodal generation tasks, offering a versatile model capable of leveraging pretrained visual backbones for superior performance. Its unified approach proves effective across varied synthesis tasks, endorsing discrete diffusion as a scalable multimodal strategy. Future developments in discrete diffusion caching could further optimize Muddit’s generative capabilities, promoting its application in complex multimodal environments.

Markdown Report Issue