Masked Diffusion Modeling (MDM)
- Masked Diffusion Modeling (MDM) is a discrete diffusion framework that uses an absorbing-mask process to progressively replace tokens with a mask symbol and iteratively denoise them under bidirectional context.
- It supports parallel token recovery and non-autoregressive generation with various scheduling strategies, effectively handling modalities like text, images, and molecules.
- Recent advances address optimization challenges through learned masking schedules, variance reduction techniques, and extended state representations, enhancing training efficiency and output quality.
Masked diffusion modeling (MDM) is a discrete diffusion framework in which a forward absorbing-mask process progressively replaces tokens with a special symbol, while generation starts from a fully masked sequence and iteratively denoises it back into data. In its now-standard form, MDM operates on discrete sequences rather than continuous latents, predicts masked positions under bidirectional context, and supports parallel token updates, making it a non-autoregressive alternative to left-to-right factorization for text, images, molecules, genomics, and speech (Cardei et al., 28 Apr 2026, Nie et al., 2024).
1. Formal definition and core probabilistic structure
The canonical MDM forward process is an absorbing masking process over a discrete vocabulary augmented with a mask token. For a token , one common formulation is
with decreasing from approximately $1$ to $0$; in several text formulations, is used (Nie et al., 2024, Chao et al., 24 May 2025). Once a token is masked, it remains masked for the rest of the forward trajectory, so the latent state is partially observed rather than continuously perturbed.
Generation reverses this corruption. Starting from the fully masked sequence, the denoiser predicts clean-token distributions for masked positions, typically factorized positionwise as , while already unmasked positions are carried forward unchanged (Hong et al., 13 May 2026). A representative continuous-time training objective is
which is a denoising cross-entropy over masked positions under a scheduler-dependent weighting (Hong et al., 13 May 2026).
Two structural properties follow from this construction. First, MDMs expose explicit masked versus visible states, which makes the reverse process a mask-and-fill procedure rather than a continuous denoising trajectory (Yoo et al., 11 May 2026). Second, because all currently masked positions can be predicted simultaneously under bidirectional context, MDMs admit parallel token recovery rather than strictly serial next-token decoding (Nie et al., 2024). This parallelism is one of the framework’s defining attractions, but it also introduces factorization and scheduling issues that recur across later variants.
2. Competing theoretical interpretations
A major theoretical line argues that standard MDMs are analytically much closer to masked models than to conventional time-dependent diffusion. Under the absorbing-mask formulation, both training and sampling can be reorganized around the masked pattern rather than the diffusion time, and in the time-agnostic setting the optimal predictor depends only on the masked sequence. On this view, MDMs are “secretly time-agnostic masked models,” and the paper derives a first-hitting sampler that is theoretically equivalent to the original continuous-time generation process while reporting up to a speedup (Zheng et al., 2024). The same work also identifies a numerical issue in low-precision Gumbel-max categorical sampling, arguing that float32 truncation lowers effective temperature and can make generative perplexity comparisons with autoregressive models unfair (Zheng et al., 2024).
A second line retains the diffusion formalism but reinterprets MDMs as autoregressive models over latent orders. With multivariate noise schedules, the continuous-time objective decomposes into weighted autoregressive losses over decoding permutations, so masking schedules induce a non-uniform distribution over orders. In that setting, MDMs are characterized as learned-order autoregressive models rather than schedule-invariant any-order denoisers (Garg et al., 24 Nov 2025). A related framework, order-expressive MDM (OeMDM), makes the scheduler explicit enough to subsume standard MDM, block diffusion, and autoregressive generation in a single formalism, while learnable-order MDM (LoMDM) jointly learns the backbone and generation order from scratch through one objective (Hong et al., 2 Feb 2026).
A third interpretation places MDM sampling inside discrete optimal transport. In this view, marginal kinetic energy, conditional kinetic energy, and geodesic energy are equivalent under MDM structure, and an optimal schedule satisfies
0
This produces a closed-form coupling between the mask schedule and a geometric interpolation schedule, and motivates post-training schedule tuning through a two-parameter Beta-CDF family (Chen et al., 17 Sep 2025). The practical claim is narrower than the time-agnostic critique: schedule design can matter substantially for low-step sampling even if the underlying denoiser is unchanged (Chen et al., 17 Sep 2025).
These perspectives are not identical. One body of work emphasizes schedule invariance in standard univariate, time-agnostic formulations, while another shows that multivariate or expressive schedulers break that invariance and encode order directly (Zheng et al., 2024, Garg et al., 24 Nov 2025). This suggests that the theoretical status of “diffusion time” in MDM depends strongly on the scheduler class and parameterization under consideration.
3. Optimization, scaling, and training pathologies
The first text-scaling study of MDMs reports a scaling law comparable to autoregressive models, but with a substantial constant compute gap. Under IsoFLOP analysis, the validation loss follows a power law with a scaling rate comparable to ARMs, while requiring roughly 1 more pre-training compute to reach similar loss. The same study trains text MDMs up to 2B parameters, introduces unsupervised classifier-free guidance for conditional inference, and reports that with 3 times more pre-training time MDMs can match ARMs in performance while being 4 times faster during sampling under an accelerated inference setting (Nie et al., 2024).
A distinct training-efficiency analysis attributes slow MDM optimization primarily to language locality bias rather than to any-order generation itself. The argument is that low-context masked examples rapidly saturate and become wasted compute, while high-context examples contain many redundant visible tokens because predictive information is concentrated locally. The proposed remedy is bell-shaped time sampling, especially a Gaussian distribution centered near 5; on LM1B, the resulting training recipe reaches the same validation NLL about 6 faster on sentence-packed data, 7 faster without packing, and 8 faster on OpenWebText (Hong et al., 13 May 2026).
Training instability has also been analyzed through variance decomposition. One study shows that MDM training variance splits into masking pattern noise, masking rate noise, and data noise, whereas ARMs only incur the data term. On that basis it proposes six variance-reduction methods, with P-POTS as a Pareto-optimal timestep sampler and MIRROR as a negatively correlated masking scheme. Across OpenScience, GSM8K, and HiTab, P-POTS+MIRROR improves mean accuracy from 9 to 0, from 1 to 2, and from 3 to 4, while also reducing run-to-run variability to near ARM levels (Jia et al., 22 Nov 2025).
Taken together, these results locate the main optimization bottlenecks in timestep allocation, masking-induced variance, and the compute cost of repeated denoising. They do not imply a single failure mode. Rather, they show that MDM training quality depends on how probability mass is distributed over corruption levels and mask patterns, not only on model size.
4. Sampling order, inference policies, and test-time control
Generation quality in MDMs is highly sensitive to the order in which masked positions are revealed. One approach treats unmasking as a KL-regularized Markov decision process with an explicit heuristic reference policy such as max-confidence. The learned scheduler controls which position is revealed next while the frozen MDM supplies the token distribution. On LLaDA-8B-Instruct, this learned policy improves over heuristic baselines across Sudoku, Zebra, GSM8K, and Math500; on Sudoku in particular it reaches 5, compared with 6 for max-confidence and 7 for random (Hong et al., 7 Oct 2025).
A complementary mechanistic analysis studies MDM samplers on random walks over graphs, where validity can be checked exactly. That work proves that lowest-entropy parallel unmasking is not uniformly better than random parallel sampling; the relative ranking depends on graph structure. It also introduces a bisection sampler that is exact under perfect training and has 8 parallel depth for order-9 walks, and reports improved speed-quality tradeoffs in preliminary OpenWebText experiments (Bansal et al., 22 Jun 2026). The core claim is not merely that some sampler is better, but that parallel updates are governed by conditional-independence structure rather than local uncertainty alone.
Speculative and backtracking samplers extend test-time control further. DualDiffusion combines a fast approximate drafter with a slower verifier, amortizing verifier cost over multiple draft steps. On MMLU, it preserves near-verifier accuracy at substantially lower latency, reporting 0 accuracy at 1 seconds versus 2 at 3 seconds for full LLaDA verification; on GSM8K, the speedup remains but accuracy degradation is much larger, indicating that the verification rule is not yet sufficient for strongly sequential reasoning (Goyal et al., 6 Apr 2026). MDM-VGB instead turns inference into a reward-guided walk on a masked-state graph in which arbitrary positions may be unmasked or remasked. The method is proved robust to verifier noise, achieves quadratic complexity, and reports strong gains on structured benchmarks such as Sudoku and QM9, especially in its momentum variant (Jeon et al., 26 Jun 2026).
These results collectively establish that sampling policy is a first-class component of MDM performance. The denoiser alone does not determine generation quality; order selection, remasking, speculative verification, and separator-based parallelization can all change the effective inference regime.
5. Structural variants and extensions of the masking state space
A central criticism of vanilla absorbing-mask decoding is that intermediate clean-state predictions are discarded. Self-Conditioned Masked Diffusion Models (SCMDM) address this by conditioning each denoising step on the previous clean-state estimate. SCMDM is explicitly presented as a post-training retrofit rather than a new model class: it uses a two-pass training approximation, requires minimal architectural change, introduces no recurrent latent-state pathway, uses no auxiliary reference model, and adds no extra denoiser evaluations at sampling time. On OpenWebText-trained models, it reduces GPT-2-large generative perplexity from 4 to 5, improves CIFAR-10 FID from 6 to 7, and also improves small-molecule generation and genomic distribution fidelity (Cardei et al., 28 Apr 2026).
Other variants enrich the latent state itself. MDM-Prime replaces binary masked/unmasked token states with partial masking over sub-tokens, reducing redundant idle steps; in the reported OpenWebText setting, about 8 of 9 standard MDM sampling steps are idle, whereas Prime introduces intermediate states such as $1$0. It reports OpenWebText perplexity $1$1 and CIFAR-10 FID $1$2 (Chao et al., 24 May 2025). Infinite Mask Diffusion Model (IMDM) replaces the single deterministic mask token with a stochastic infinite-state mask, motivated by a theoretical lower bound on factorization error in few-step generation. In a synthetic $1$3 task, standard MDM achieves validity $1$4 while IMDM reaches $1$5, and with appropriate distillation IMDM improves few-step results on LM1B and OpenWebText (Yoo et al., 11 May 2026). Di$1$6O pushes this logic to one-step generation by distilling an MDM teacher into a single-step student through on-policy token-level distribution matching and entropy-aware token initialization, reporting one-step FID $1$7 on ImageNet-256 and one-step HPSv2 $1$8 for text-to-image generation (Zhu et al., 19 Mar 2025).
Length flexibility and editability motivate another family of extensions. FlexMDM starts from the empty string rather than a fixed fully masked canvas and generates by inserting masks and then unmasking them, while preserving exact any-order inference under the stochastic-interpolant formulation. The paper reports much better length modeling than fixed-length MDMs and shows that retrofitting LLaDA-8B into FlexMDM improves GSM8K from $1$9 to $0$0 and code infilling from $0$1 to $0$2 (Kim et al., 31 Aug 2025). For high-resolution text-to-image synthesis, Nemotron-Labs-Diffusion-Image adds token editing so already-unmasked tokens can be revised, and introduces Grouped Cross-Entropy for large vocabularies; it reports GenEval $0$3, DPG $0$4, and HPSv3 $0$5 (Li et al., 29 Jun 2026). In molecular graphs, MELD learns per-element corruption trajectories for atoms and bonds to avoid “state-clashing,” and reports that chemical validity on ZINC250K increases from $0$6 to $0$7 (Seo et al., 22 May 2025). In streaming speech, VocalNet-MDM adapts MDM through hierarchical block-wise masking and iterative self-distillation, reporting a $0$8–$0$9 decoding speedup and a 0 reduction in first-chunk latency relative to autoregressive baselines (Cheng et al., 9 Feb 2026).
6. Empirical footprint across domains
The empirical record shows that MDM is not confined to one modality or one architectural recipe. Reported results span unconditional and conditional language generation, image synthesis, molecular and genomic modeling, streaming speech, reasoning-oriented constrained decoding, and even self-supervised representation learning.
| Domain | Representative reported result | Source |
|---|---|---|
| Text language modeling | OpenWebText perplexity 1 with MDM-Prime | (Chao et al., 24 May 2025) |
| Post-training text refinement | GPT-2-large generative perplexity 2 with SCMDM | (Cardei et al., 28 Apr 2026) |
| Streaming speech | 3–4 decoding speedup and 5 lower first-chunk latency | (Cheng et al., 9 Feb 2026) |
| High-resolution text-to-image | GenEval 6, DPG 7, HPSv3 8 | (Li et al., 29 Jun 2026) |
| Molecular graphs | ZINC250K validity 9 with MELD | (Seo et al., 22 May 2025) |
| Semantic segmentation pretraining | GlaS 0 Dice / 1 IoU with full labels | (Pan et al., 2023) |
In text, MDMs have been positioned as scalable bidirectional alternatives to ARMs. A 2B text MDM trained at scale outperforms same-data TinyLlama on four of eight zero-shot benchmarks, achieves competitive math reasoning with 3B Llama-2 on GSM8K, and reportedly breaks the reverse curse that affects much larger autoregressive models on synthetic relational reversal tasks (Nie et al., 2024). At the same time, the time-agnostic critique argues that once categorical sampling is corrected, generative perplexity remains far behind strong ARMs and prior claims of superiority are not apples-to-apples (Zheng et al., 2024). The controversy is therefore not whether MDMs can generate text at all, but whether their empirical advantage survives corrected samplers and matched compute.
Outside text generation, the masking formalism has been repurposed in a more explicitly self-supervised direction. “Masked Diffusion as Self-supervised Representation Learner” replaces Gaussian corruption with masking and uses SSIM as the pretraining objective for semantic segmentation. On GlaS, it reports 4 Dice / 5 IoU with full labels and 6 Dice / 7 IoU with only 8 labels, while also outperforming DDPM and MAE baselines on multiple datasets (Pan et al., 2023). This use of the term “MDM” is conceptually adjacent but not identical to the generative discrete-sequence formulation dominant in text and multimodal generation.
Across these domains, several recurring limitations remain visible: fixed-length canvases, inability to revise unmasked tokens, factorization error in few-step decoding, order sensitivity, high training variance, and expensive iterative inference. Much of the recent literature can be read as a sequence of targeted responses to those bottlenecks: richer mask state spaces, learned schedules, remasking and editing, variance-aware training, and post-training or distilled acceleration.