Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Published 6 Mar 2026 in cs.CV | (2603.06577v1)

Abstract: While recent multimodal LLMs (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal LLM built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper presents a novel mask-based discrete diffusion method that unifies multimodal processing by modeling joint discrete token distributions.
It leverages a progressive training pipeline and adaptive decoding strategies to efficiently align and generate text, image, and speech data.
Empirical evaluations show competitive performance in speech recognition, visual tasks, and cross-modal generation with reduced inference steps.

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Motivation and Context

Multimodal intelligence research has prioritized unified models capable of seamlessly processing and generating data across modalities such as text, images, and speech. Most multimodal LLMs (MLLMs) leverage autoregressive architectures, but recent advances in discrete diffusion models have demonstrated efficacy in multiple domains, especially in generative and understanding tasks. Omni-Diffusion presents the first mask-based discrete diffusion approach for any-to-any multimodal comprehension and generation, directly modeling the joint distribution over discrete multimodal tokens, thus eliminating modality-specific architectures and aligning semantic representations intrinsically.

Figure 1: Overview of Omni-Diffusion, demonstrating unified handling and generation for arbitrary multimodal input/output, including complex modalities integration.

Model Architecture and Unified Probabilistic Formulation

Omni-Diffusion builds upon a pre-trained diffusion LLM backbone, employing modality-specific tokenizers (MAGVIT-v2 for images, SenseVoiceSmall for speech, Dream-7B for text). All modalities are tokenized into sequences, then concatenated with modality start/end delimiters to form a unified token sequence $x_0 \in \mathbb{R}^L$ . The mask-based diffusion process replaces a fraction of $x_0$ with [MASK] tokens, and the model is trained via cross-entropy to recover original tokens, providing intrinsic multimodal alignment without modality-specific optimization.

Figure 2: Omni-Diffusion’s architecture, illustrating unified token prediction for text, image, and speech using mask-based diffusion.

Progressive Multimodal Alignment and Training Techniques

Achieving robust alignment across modalities requires progressive expansion of modalities during training. Omni-Diffusion employs a three-stage pipeline:

Stage 1: Visual-language pre-alignment for text-to-image and image captioning.
Stage 2: Joint language-speech-visual training using ASR, TTS, and continuance of Stage 1 datasets.
Stage 3: Fine-tuning on the constructed Speech-Driven Visual Interaction (SDVI) dataset for tasks like speech-driven visual QA and speech-to-image, further enhancing cross-modal fusion.

An attenuated tail-pad masking strategy mitigates pad token overfitting, enabling variable-length generation. The SDVI dataset ensures diverse cross-modal tasks, with careful processing to avoid undesirable speech synthesis scenarios.

Figure 3: Three-stage progressive training pipeline for multimodal alignment in Omni-Diffusion.

Inference and Decoding Strategies

Omni-Diffusion’s inference leverages entropy-based decoding (sampling based on token confidence via entropy), repetition penalty, classifier-free guidance, and specialized mechanisms:

Position Penalty: Decodes image tokens preferentially from central positions, suppressing edge repetition.
Special Token Pre-Infilling: Handles dialogue tasks by guiding segment-wise modality generation.
Adaptive Token-Length Assignment: Establishes optimal sequence length for speech/text conversion, improving efficiency and quality.

Empirical Evaluation and Numerical Results

Speech Tasks

Omni-Diffusion achieves a Word Error Rate (WER) of 7.05 on LibriSpeech and 3.07 on LibriTTS, surpassing AnyGPT (WER 8.50) and closely approaching specialized CosyVoice (WER 2.89), demonstrating competitive generality and performance in speech recognition and synthesis.

Visual Tasks

On VQA benchmarks (POPE, MME, Seed-2-Plus) and text-to-image generation (MSCOCO), Omni-Diffusion achieves:

POPE: 76.6, MME-Perception: 1216.7, Seed-2-Plus: 34.5
CLIP-T: 0.235, CLIP-I: 0.667

These scores reflect strong parity with specialized visual LLMs and diffusion-based generation models, particularly in multimodal settings.

Image generation conditioned on speech and text yield similar CLIP scores, confirming robust cross-modal alignment. Generated samples from spoken interactions and image synthesis tasks visually validate semantic consistency and detail quality.

Figure 4: Omni-Diffusion samples for spoken interaction with visual content, showcasing robust multimodal reasoning.

Figure 5: Omni-Diffusion generated samples for text-to-image and speech-to-image tasks, demonstrating cross-modality fidelity.

Inpainting and Parallel Decoding

Without task-specific fine-tuning, Omni-Diffusion performs image inpainting simply by masking the unknown region, producing harmonious completions aligned to text prompts.

Figure 6: Output samples from inpainting, highlighting the model's native mask-based generation capability.

Discrete diffusion models support parallel decoding—Omni-Diffusion maintains strong generation quality (CLIP-T/CLIP-I and WER metrics) even as the number of inference steps is reduced to as low as 10, substantially enhancing sampling efficiency over autoregressive baselines.

Figure 7: Example images generated under varying time steps for the same prompt, illustrating sampling efficiency.

Qualitative Generation and Modality Consistency

Additional qualitative samples demonstrate vivid generation for both text-to-image and speech-to-image scenarios, indicating effective semantic transfer and modality fusion. The model reliably produces consistent outputs given equivalent text or speech prompts.

Figure 8: Text-to-image samples displaying high content alignment and detail.

Figure 9: Speech-conditioned image generation, retaining semantic coherence across modalities.

Theoretical and Practical Implications

Omni-Diffusion’s mask-based discrete diffusion paradigm offers:

Intrinsic multimodal semantic alignment by jointly modeling discrete token distributions, avoiding external output adapters.
Efficient token generation via parallel decoding, benefiting latency-constrained and real-time applications.
Generalization across comprehension and generation tasks, enabling seamless any-to-any input/output modality transfer.
Task-agnostic inpainting and editing, facilitating flexible downstream multimodal manipulation.

The approach demonstrates that discrete diffusion models can serve as competitive, scalable foundation models for multimodal intelligence, challenging the dominance of autoregressive architectures and providing a path towards universal, modality-agnostic AI systems.

Future Directions

This work suggests several promising avenues:

Extension to additional modalities (video, structured data, haptics) using generalized tokenization frameworks.
Algorithmic refinement for adaptive masking and sampling strategies, further reducing inference steps.
Investigation into large-scale in-context learning and integration with reinforcement learning from human feedback (RLHF) for robustness and controllability.
Exploration of real-time deployment and streaming scenarios enabled by parallel decoding.

Conclusion

Omni-Diffusion defines a unified framework for multimodal understanding and generation via mask-based discrete diffusion, directly modeling joint distributions over multimodal tokens. Results indicate competitive or superior performance to autoregressive and modality-specialized baselines, efficient parallel decoding, and robust alignment across tasks and modalities. This diffusion-centric approach establishes a solid foundation for future multimodal AI development, both from architectural and practical perspectives (2603.06577).

Markdown Report Issue