Papers
Topics
Authors
Recent
2000 character limit reached

MMaDA: Unified Multimodal Diffusion Model

Updated 29 October 2025
  • MMaDA is a unified multimodal diffusion language model that processes text and images through a shared token space to support integrated reasoning and generation tasks.
  • It utilizes a dense 8B-parameter transformer with discrete tokenization, achieving reconstruction of masked tokens via a diffusion-based objective.
  • The model employs mixed chain-of-thought fine-tuning and UniGRPO reinforcement learning, delivering near state-of-the-art performance on diverse benchmarks.

Multimodal Large Diffusion LLMs (MMaDA) represent a unified class of foundation models pioneering the use of discrete diffusion architectures for large-scale multimodal AI tasks. Designed to address the fragmentation of architectures between textual and vision-language applications, MMaDA provides a single, modality-agnostic backbone for language, vision, and compositional reasoning and generation tasks. Key technical innovations include unified diffusion-based token-masked prediction, mixed chain-of-thought (CoT) fine-tuning, and a policy-gradient-based reinforcement learning algorithm (UniGRPO) adapted to non-autoregressive, parallel decoding. The result is a single model capable of achieving near state-of-the-art across textual reasoning, multimodal understanding, and text-to-image generation benchmarks, while establishing a practical framework for scaling such unified diffusion architectures.

1. Modality-Agnostic Diffusion Architecture

MMaDA adopts a single dense transformer with 8B unified parameters to process text and images in a completely shared token space. Both text and images are tokenized as sequences of discrete codes:

  • Text tokens use the LLaDA tokenizer.
  • Images are quantized using a pretrained image tokenizer (from Show-o/MAGVIT-v2), encoding a 512×512512 \times 512 image into 1024 discrete tokens from an 8192-codebook (stride 16).

The learning objective is to reconstruct randomly masked tokens (across either modality) as in discrete Masked Diffusion Models (MDM). At each denoising timestep tt, a subset of tokens in xtx_t are masked: $\mathcal{L}_{\text{unify} (\theta) = - \mathbb{E}_{t, x_0, x_t} \left[ \frac{1}{t} \sum_{i=1}^L I[x_t^i = [\text{MASK}]] \log p_\theta(x_0^i | x_t) \right]$ where x0x_0 is the ground truth token sequence and xtx_t is the sequence with tokens masked according to a diffusion posterior.

This modality-agnostic, unified objective enables direct joint training and inference for text, VQA, and image generation without the need for specialized branches or heads for each data type.

2. Mixed Long Chain-of-Thought (CoT) Fine-Tuning

To bridge complex multimodal reasoning (spanning text, vision, and generation), MMaDA employs a mixed long chain-of-thought (CoT) fine-tuning protocol. The central concept is to curate and align stepwise reasoning traces (“chains of thought”) across all training tasks using a unified instruction format:

1
<special_token> <reasoning_process> <special_token> <result>
The training corpus contains both language-chain traces and comparable multimodal or text-to-image reasoning steps, filtered for logical rigor and diversity. During fine-tuning, masking is applied to both reasoning and result segments, forcing the model to reconstruct missing steps from available context: $\mathcal{L}_{\text{Mixed-SFT} = -\mathbb{E}_{t, p_0, r_0, r_t} \left[ \frac{1}{t} \sum_{i=1}^{L'} I[r_t^i = [\text{MASK}]] \log p_\theta(r_0^i | p_0, r_t) \right]$ This strategy ensures consistent, cross-modal alignment of reasoning abilities and supports stable RL cold starts.

3. UniGRPO: Unified Policy Gradient Reinforcement Learning

Traditional reinforcement learning algorithms (e.g., PPO) expect an autoregressive chain rule for sequence models, which is incompatible with parallel diffusion models. MMaDA introduces UniGRPO, a unified policy-gradient RL that operates over masked diffusion objectives.

At every RL update:

  • For a question/answer pair, answer tokens are randomly masked, and the policy is evaluated only on masked positions.
  • The surrogate loss aggregates token log-likelihoods in masked regions, with group-normalized advantages and clipped importance weights (inspired by PPO/GRPO).
  • KL regularization penalizes deviations from a reference policy (pretrained or prior checkpoint).

The general UniGRPO objective is: $\mathcal{J}_{\text{UniGRPO}(\theta) = \mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left( \min \big( r_{i,t}'(\theta) \hat{A}_{i,t}, \, \text{clip}(r_{i,t}'(\theta), 1-\varepsilon,1+\varepsilon)\hat{A}_{i,t} \big) - \beta D_{\text{KL}(\pi'_\theta || \pi'_{\text{ref}}) \right) \right]$ with diversified reward modeling for each modality: correctness for reasoning, CLIP/ImageReward for image generation, and joint rewards for multimodal tasks.

This approach unifies RL objectives across all domains and enables efficient post-training alignment of capabilities.

4. Unified Training, Sampling, and Inference

The model undergoes multi-phase training:

  • Pretraining: Diffusion objective on large-scale web text and image datasets.
  • Mixed Long-CoT SFT: Instruction fine-tuning on reasoning, multimodal, and generative CoT datasets.
  • UniGRPO RL: Policy-gradient post-training using diversified rewards and random masking schedules.

For inference:

  • Text generation uses semi-autoregressive remasking—sequences are partitioned into blocks, with low-confidence tokens jointly denoised per step.
  • Image generation employs fully parallel blockwise masked denoising, optimal for high-throughput compositional synthesis.

Sampling efficiency emerges as a key benefit: strong outputs are achievable with substantially fewer denoising steps than image diffusion or AR LLM counterparts.

5. Empirical Results and Benchmark Performance

MMaDA-8B achieves highly competitive results across textual, multimodal, and generative tasks:

Task Metric MMaDA-8B Competing SOTA (range)
POPE (VQA) (%) 86.1 80.0–85.9
Flickr30K (captioning) (CIDEr) 67.6 52.3–62.5
MMMU / MMB (understanding) (%) 30.2 / 68.5 26.7–35.6 / 60.6–64.3
Text2Img (GenEval overall) (score) 0.63 0.49–0.61
CLIP Score (image gen) 32.46 23.15–32.12
MMLU (textual reasoning) (%) 68.4 64.5–70.3

Ablation studies confirm that both the mixed CoT fine-tuning and UniGRPO RL post-training provide incremental improvements across all test domains. Compared to AR LLM baselines and previous diffusion-based LLMs (LLaDA-8B), MMaDA exhibits superior or comparable results in all metrics, while handling both reasoning and generation in a single architecture.

6. Architectural Limitations and Successors

Several limitations are intrinsic to the primary MMaDA design:

  • Task coverage is limited to image-level understanding and low-resolution (512×512512 \times 512) image generation; object-level grounding, high-res generation, image editing, and interleaved multimodal generation (e.g., reasoning with feedback loops) are not supported.
  • Parameter usage is inefficient: all 8B parameters are always loaded, with no explicit branching or dynamic freezing, increasing inference cost for lighter generation tasks.
  • Sampling employs basic parallel blockwise decoding with confidence-based or greedy token unmasking. This can cause neighboring token clusters to be revealed together, reducing visual quality and fidelity (notably, FID of 32.85, much worse than later stratified sampling).
  • Training cost is high, as joint multimodal training does not exploit parameter reuse or modular branching.

Subsequent architectures such as Lavida-O address these constraints via Elastic Mixture-of-Transformers branching, token compression, stratified sampling, and support for a broader range of multimodal and interleaved tasks (Li et al., 23 Sep 2025).

7. Research Impact and Open Source Availability

MMaDA sets a new paradigm for unified diffusion-based multimodal foundation modeling, closing the gap between autoregressive and diffusion approaches for cross-domain reasoning and generation. It demonstrates that a single model can offer competitive performance on text, vision, and compositional tasks without sacrificing efficiency or extensibility.

The open-source release of code, pretrained models, and full evaluation pipelines at https://github.com/Gen-Verse/MMaDA provides an accessible foundation for future innovation in scaling, extending, or specializing diffusion-based multimodal architectures across diverse domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MMaDA.