MMaDA: Unified Multimodal Diffusion Model

Updated 29 October 2025

MMaDA is a unified multimodal diffusion language model that processes text and images through a shared token space to support integrated reasoning and generation tasks.
It utilizes a dense 8B-parameter transformer with discrete tokenization, achieving reconstruction of masked tokens via a diffusion-based objective.
The model employs mixed chain-of-thought fine-tuning and UniGRPO reinforcement learning, delivering near state-of-the-art performance on diverse benchmarks.

Multimodal Large Diffusion LLMs (MMaDA) represent a unified class of foundation models pioneering the use of discrete diffusion architectures for large-scale multimodal AI tasks. Designed to address the fragmentation of architectures between textual and vision-language applications, MMaDA provides a single, modality-agnostic backbone for language, vision, and compositional reasoning and generation tasks. Key technical innovations include unified diffusion-based token-masked prediction, mixed chain-of-thought (CoT) fine-tuning, and a policy-gradient-based reinforcement learning algorithm (UniGRPO) adapted to non-autoregressive, parallel decoding. The result is a single model capable of achieving near state-of-the-art across textual reasoning, multimodal understanding, and text-to-image generation benchmarks, while establishing a practical framework for scaling such unified diffusion architectures.

1. Modality-Agnostic Diffusion Architecture

MMaDA adopts a single dense transformer with 8B unified parameters to process text and images in a completely shared token space. Both text and images are tokenized as sequences of discrete codes:

Text tokens use the LLaDA tokenizer.
Images are quantized using a pretrained image tokenizer (from Show-o/MAGVIT-v2), encoding a $512 \times 512$ image into 1024 discrete tokens from an 8192-codebook (stride 16).

The learning objective is to reconstruct randomly masked tokens (across either modality) as in discrete Masked Diffusion Models (MDM). At each denoising timestep $t$ , a subset of tokens in $x_t$ are masked: $\mathcal{L}_{\text{unify} (\theta) = - \mathbb{E}_{t, x_0, x_t} \left[ \frac{1}{t} \sum_{i=1}^L I[x_t^i = [\text{MASK}]] \log p_\theta(x_0^i | x_t) \right]$ where $x_0$ is the ground truth token sequence and $x_t$ is the sequence with tokens masked according to a diffusion posterior.

This modality-agnostic, unified objective enables direct joint training and inference for text, VQA, and image generation without the need for specialized branches or heads for each data type.

2. Mixed Long Chain-of-Thought (CoT) Fine-Tuning

To bridge complex multimodal reasoning (spanning text, vision, and generation), MMaDA employs a mixed long chain-of-thought (CoT) fine-tuning protocol. The central concept is to curate and align stepwise reasoning traces (“chains of thought”) across all training tasks using a unified instruction format:

1	<special_token> <reasoning_process> <special_token> <result>

The training corpus contains both language-chain traces and comparable multimodal or text-to-image reasoning steps, filtered for logical rigor and diversity. During fine-tuning, masking is applied to both reasoning and result segments, forcing the model to reconstruct missing steps from available context: $\mathcal{L}_{\text{Mixed-SFT} = -\mathbb{E}_{t, p_0, r_0, r_t} \left[ \frac{1}{t} \sum_{i=1}^{L'} I[r_t^i = [\text{MASK}]] \log p_\theta(r_0^i | p_0, r_t) \right]$ This strategy ensures consistent, cross-modal alignment of reasoning abilities and supports stable RL cold starts.

3. UniGRPO: Unified Policy Gradient Reinforcement Learning

Traditional reinforcement learning algorithms (e.g., PPO) expect an autoregressive chain rule for sequence models, which is incompatible with parallel diffusion models. MMaDA introduces UniGRPO, a unified policy-gradient RL that operates over masked diffusion objectives.

At every RL update:

For a question/answer pair, answer tokens are randomly masked, and the policy is evaluated only on masked positions.
The surrogate loss aggregates token log-likelihoods in masked regions, with group-normalized advantages and clipped importance weights (inspired by PPO/GRPO).
KL regularization penalizes deviations from a reference policy (pretrained or prior checkpoint).

The general UniGRPO objective is: $\mathcal{J}_{\text{UniGRPO}(\theta) = \mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left( \min \big( r_{i,t}'(\theta) \hat{A}_{i,t}, \, \text{clip}(r_{i,t}'(\theta), 1-\varepsilon,1+\varepsilon)\hat{A}_{i,t} \big) - \beta D_{\text{KL}(\pi'_\theta || \pi'_{\text{ref}}) \right) \right]$ with diversified reward modeling for each modality: correctness for reasoning, CLIP/ImageReward for image generation, and joint rewards for multimodal tasks.

This approach unifies RL objectives across all domains and enables efficient post-training alignment of capabilities.

4. Unified Training, Sampling, and Inference

The model undergoes multi-phase training:

Pretraining: Diffusion objective on large-scale web text and image datasets.
Mixed Long-CoT SFT: Instruction fine-tuning on reasoning, multimodal, and generative CoT datasets.
UniGRPO RL: Policy-gradient post-training using diversified rewards and random masking schedules.

For inference:

Text generation uses semi-autoregressive remasking—sequences are partitioned into blocks, with low-confidence tokens jointly denoised per step.
Image generation employs fully parallel blockwise masked denoising, optimal for high-throughput compositional synthesis.

Sampling efficiency emerges as a key benefit: strong outputs are achievable with substantially fewer denoising steps than image diffusion or AR LLM counterparts.

5. Empirical Results and Benchmark Performance

MMaDA-8B achieves highly competitive results across textual, multimodal, and generative tasks:

Task	Metric	MMaDA-8B	Competing SOTA (range)
POPE (VQA)	(%)	86.1	80.0–85.9
Flickr30K (captioning)	(CIDEr)	67.6	52.3–62.5
MMMU / MMB (understanding)	(%)	30.2 / 68.5	26.7–35.6 / 60.6–64.3
Text2Img (GenEval overall)	(score)	0.63	0.49–0.61
CLIP Score (image gen)		32.46	23.15–32.12
MMLU (textual reasoning)	(%)	68.4	64.5–70.3

Ablation studies confirm that both the mixed CoT fine-tuning and UniGRPO RL post-training provide incremental improvements across all test domains. Compared to AR LLM baselines and previous diffusion-based LLMs (LLaDA-8B), MMaDA exhibits superior or comparable results in all metrics, while handling both reasoning and generation in a single architecture.

6. Architectural Limitations and Successors

Several limitations are intrinsic to the primary MMaDA design:

Task coverage is limited to image-level understanding and low-resolution ( $512 \times 512$ ) image generation; object-level grounding, high-res generation, image editing, and interleaved multimodal generation (e.g., reasoning with feedback loops) are not supported.
Parameter usage is inefficient: all 8B parameters are always loaded, with no explicit branching or dynamic freezing, increasing inference cost for lighter generation tasks.
Sampling employs basic parallel blockwise decoding with confidence-based or greedy token unmasking. This can cause neighboring token clusters to be revealed together, reducing visual quality and fidelity (notably, FID of 32.85, much worse than later stratified sampling).
Training cost is high, as joint multimodal training does not exploit parameter reuse or modular branching.

Subsequent architectures such as Lavida-O address these constraints via Elastic Mixture-of-Transformers branching, token compression, stratified sampling, and support for a broader range of multimodal and interleaved tasks (Li et al., 23 Sep 2025).

7. Research Impact and Open Source Availability

MMaDA sets a new paradigm for unified diffusion-based multimodal foundation modeling, closing the gap between autoregressive and diffusion approaches for cross-domain reasoning and generation. It demonstrates that a single model can offer competitive performance on text, vision, and compositional tasks without sacrificing efficiency or extensibility.

The open-source release of code, pretrained models, and full evaluation pipelines at https://github.com/Gen-Verse/MMaDA provides an accessible foundation for future innovation in scaling, extending, or specializing diffusion-based multimodal architectures across diverse domains.

PDF Markdown Chat (Pro)

References (1)

Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MMaDA.