Masked Diffusion Models

Updated 3 July 2026

Masked diffusion is a generative modeling approach that progressively masks and reconstructs data using iterative denoising.
It integrates parallel sampling with variational and energy-based training to efficiently generate both discrete and continuous outputs.
Its flexible framework extends to diverse domains such as language, images, graphs, and molecules, serving as a competitive alternative to autoregressive models.

Masked diffusion refers to a family of generative modeling techniques in which data is progressively corrupted by masking components (such as tokens, pixels, or graph elements), and a neural network is trained to reverse this corruption through iterative denoising. It is broadly applicable to both discrete data (text, sequences, graphs) and continuous domains (images) and unifies reconstruction-based learning, parallel generation, and flexible factorization of dependencies. Masked diffusion models (MDMs) have established themselves as competitive or superior alternatives to autoregressive models across a range of domains, with clear mathematical foundations, variational objectives, and efficient parallel sampling mechanisms.

1. Mathematical Foundation and Process

Masked diffusion is defined by an absorbing Markov process in which a clean sample $x_0$ is corrupted over discrete or continuous steps by replacing portions of the data (tokens, patches, atoms, bonds) with a special mask symbol $m$ . For discrete data, the forward noising process is typically

$q(x_t^i | x_0^i) = (1 - \alpha_t) \cdot \delta_{x_0^i}(x_t^i) + \alpha_t \cdot \delta_{m}(x_t^i),$

where $\alpha_t$ is a monotonically increasing masking schedule and $\delta$ denotes the Kronecker delta. This process can be implemented in continuous or discretized timesteps, with increasing rates of corruption as $t$ increases. In images, similar mechanisms apply at the level of patches or pixels (Zheng et al., 2023, Lei et al., 2023, Pan et al., 2023).

The reverse process is constructed either as a learned neural approximation to the exact posterior $q(x_{t-1}|x_t, x_0)$ , or as a parametric family with explicit variational bounds. In the simplest case, encoder-only Transformers or denoising networks predict the original components at masked positions, with each denoising step applied in parallel across the masked slots (Sahoo et al., 2024).

A central feature is the element-wise absorbing mask: once a position is masked, it remains masked until it is unmasked in the reverse process. This process can be extended to support fine-grained control over which elements are masked, variable-length generation, and partial or sub-token masking (see Section 4).

2. Variational Objectives, Energy Perspectives, and Theoretical Guarantees

Training of masked diffusion models is grounded in a variational bound (ELBO) on the data likelihood. For fixed schedules, the loss reduces to a weighted mixture of cross-entropy losses on the masked positions:

$\mathcal{L} = \int_0^1 \frac{d\alpha_t}{1-\alpha_t} \;\mathbb{E}_{q(x_t|x_0)} \left[ \sum_i -\log p_\theta(x_0^i | x_t) \right] dt,$

where $p_\theta$ is the learned denoising model (Sahoo et al., 2024, Zhang et al., 27 Oct 2025). This loss is equivalent to a mixture of masked language modeling (MLM) objectives with varying mask rates, which aligns MDM training with widespread semi-supervised and self-supervised reconstruction approaches (Sahoo et al., 2024).

From an energy minimization perspective, MDMs are shown to solve discrete optimal transport (OT) problems, with kinetic, conditional kinetic, and geodesic energies provably equivalent and minimized if the masking schedule obeys a trigonometric law $\alpha_t^* = \sin^2(\frac{\pi}{2} \gamma_t)$ , where $m$ 0 is a suitably parameterized schedule (e.g., Beta CDF) (Chen et al., 17 Sep 2025).

These formulations provide both path-space KL guarantees and provably optimal schedules for denoising and sampling. They also allow efficient search for schedule parameters to adapt MDMs for low-step sampling or task-specific unmasking distributions.

3. Sampling, Parallelism, and Efficient Generation

The absorbing-mask structure of masked diffusion models allows sampling (generation) of data in highly parallel or any-order regimes. Since every masked position can, in principle, be denoised independently conditioned on the observed (unmasked) context, MDMs admit:

Fully parallel sampling, where all masks are revealed in one or a small number of steps, akin to non-autoregressive decoding (Sahoo et al., 2024, Shah et al., 28 Nov 2025);
Any-order or blockwise sampling, where a model can reveal tokens in arbitrary order or in multiple blocks, balancing global structure and local context (Kim et al., 31 Aug 2025, Hong et al., 2 Feb 2026);
Semi-autoregressive and adaptive orderings, including learned unmasking policies that dictate context-dependent generation orders for improved quality (Hong et al., 2 Feb 2026).

Recent advances introduce variance-reduction methods for training—such as Pareto-optimal $m$ 1-sampling (P-POTS), MiRROred mask sampling (MIRROR), stratified $m$ 2-sampling, and importance correction—to mitigate the intrinsic variance introduced by mask-pattern and mask-rate randomness, bringing training stability on par with autoregressive baselines (Jia et al., 22 Nov 2025).

Sampling efficiency has also been dramatically improved with the First-Hitting Sampler (FHS), which interprets discrete masking as a continuous-time Markov process and allows token-by-token unmasking at the (theoretically) optimal stopping times, yielding $m$ 320 $m$ 4 wall-clock acceleration (Zheng et al., 2024). Hybrid speculative decoding architectures further reduce function evaluations per sample by combining non-causal "draft" and causal "target" transformer heads (Campbell et al., 4 Oct 2025).

4. Extensions and Domains: Graphs, Images, Language, Length Variability

Masked diffusion has been productively specialized and extended for diverse domains:

Molecular and graph generation: Standard MDMs suffer state-clashing, where multiple molecules' masked states collapse, producing multimodal posteriors unresolvable by unimodal reverse models. Element-wise learnable noise scheduling (as in MELD) assigns distinct masking rates to each atom and bond, dramatically improving chemical validity and conditional property alignment in molecular benchmarks (Seo et al., 22 May 2025).
Partial masking and sub-token granularity: By decomposing each token into sub-tokens (e.g., base-b mapping), partial masking enables finer-grained denoising, reducing idle steps and achieving both superior likelihood and sample quality (Prime scheme) (Chao et al., 24 May 2025).
Flexible-length generation: Flexible Masked Diffusion Models (FlexMDMs) extend MDMs to support variable-length sequences by introducing discrete-time insertion events in addition to unmasking, with theoretical guarantees for exact transport and strong improvements in length fidelity and downstream performance (Kim et al., 31 Aug 2025).
Image generation and representation: Masked diffusion matches or exceeds purely continuous diffusion in efficiency; training vision transformers with masking accelerates convergence, reduces compute by 2 $m$ 5–4 $m$ 6, matches FID scores of state-of-the-art models, and enhances representation learning in self-supervised schemes (Zheng et al., 2023, Lei et al., 2023, Hansen-Estruch et al., 2024, Pan et al., 2023).
Language and sequence modeling: In natural language, context-dependent generation order, multi-domain curriculum learning, and self-conditioning adaptations yield state-of-the-art perplexity, outperforming prior non-AR LMs and approaching or matching autoregressive quality (Hong et al., 2 Feb 2026, Cardei et al., 28 Apr 2026, Sahoo et al., 2024, Kocabay et al., 20 Mar 2026).

Domain generalization is robust; masked diffusion is now applied to recommendation, vision-language captioning, genomics, and protein design, with explicit architectures for domain-specific schedules and losses (Shah et al., 28 Nov 2025, Feng et al., 30 Oct 2025, Seo et al., 22 May 2025, Pan et al., 2023).

5. Order Dependence, Dependency Modeling, and Limitations

A critical axis along which masked diffusion models operate is the generation order and the dependency structure between predicted positions.

Order-expressive frameworks allow the masking/unmasking schedule to depend on context, sequence difficulty, or explicit learned policies, unifying ARMs, block diffusion, and masking within a single variational objective (OeMDM framework) (Hong et al., 2 Feb 2026).
Variational extensions with latent variables (VMD) address inherent limitations of factorized masked diffusion: standard MDMs struggle with joint consistency when predicting multiple interdependent tokens at once. Introducing a global latent variable $m$ 7 or block-wise latents enables learning of joint posteriors, significantly improving global consistency for complex dependencies (e.g., Sudoku, reasoning) (Zhang et al., 27 Oct 2025).
Parallelism versus dependency tradeoff: While full parallelism benefits efficiency, it can harm the modeling of strong inter-token dependencies in settings where constraints or structure require joint sampling. This motivates adaptively choosing the number of tokens unmasked per step or explicit augmentation with dependency-aware architectures (Zhang et al., 27 Oct 2025, Chao et al., 24 May 2025).

Several empirical studies highlight potential drawbacks and open questions, including late-step multimodality (residual ambiguity in heavily masked states), entropy collapse from numerical truncations in categorical sampling, and the need for explicit positional or order-awareness in highly structured domains (Zheng et al., 2024, Seo et al., 22 May 2025).

6. Engineering, Training, and Practical Considerations

Masked diffusion models are efficiently implemented using encoder-only Transformers for text or ViT/UNet hybrids for images. Modern engineering practices—large batch sizes, AdamW, cosine learning rate schedules, T5-style relative positions, and gradient checkpointing—enable scaling to billion-parameter regimes and stable optimization (Sahoo et al., 2024, Lei et al., 2023).

Key engineering improvements include:

Two-stage pretraining (random masking) and fine-tuning (no masking) for rapid convergence and sample efficiency (Lei et al., 2023);
Masked transformer architectures leveraging asymmetric encoder-decoder splits, lightweight decoders, and MAE-style auxiliary loss terms to boost efficiency and generation quality (Zheng et al., 2023, Hansen-Estruch et al., 2024);
Post-training self-conditioning, soft-masking, and variance-reduction techniques for stability, faster adaptation, and resilience to limited data or domain shifts (Cardei et al., 28 Apr 2026, Jia et al., 22 Nov 2025, Hersche et al., 20 Oct 2025);
Parallel blockwise decoding, speculative sampling, and first-hitting time strategies for high-throughput inference (Campbell et al., 4 Oct 2025, Zheng et al., 2024);
Element-wise and schedule-learned masking rates for graphs and molecules, increasing generation fidelity and property alignment (Seo et al., 22 May 2025).

Empirically, MDMs recover or surpass performance of strong autoregressive baselines as measured by perplexity, FID, accuracy, or task-specific metrics in language, vision, molecular, and recommendation benchmarks (Sahoo et al., 2024, Chen et al., 17 Sep 2025, Shah et al., 28 Nov 2025, Kim et al., 31 Aug 2025, Zhang et al., 27 Oct 2025).

7. Future Directions and Open Challenges

Open research avenues in masked diffusion center on:

Further order adaptivity: Joint optimization of unmasking order and model parameters; leveraging content or context-aware schedule networks (Hong et al., 2 Feb 2026);
Advanced dependency modeling: Integration of hierarchical or multi-scale latents, attention over learned posets, or mixture posterior parameterizations for highly structured domains (Zhang et al., 27 Oct 2025, Seo et al., 22 May 2025);
Numerical stability: Broader adoption of float64 Gumbel-max sampling and quantification of sampling-induced entropy shifts (Zheng et al., 2024);
Task-adaptive scheduling: Automated or energy-inspired selection of mask and interpolant schedules for resource-constrained or domain-specific sampling (Chen et al., 17 Sep 2025);
Application to variable and complex data types: Extension to variable-length, multimodal, and hybrid continuous–discrete domains; cross-domain curriculum learning and joint pretraining (Kim et al., 31 Aug 2025, Feng et al., 30 Oct 2025);
Refinement of theoretical connections: Deepening links with optimal transport, path-space variational inference, and discrete stochastic process theory (Chen et al., 17 Sep 2025).

Masked diffusion has thus matured into a broad, theoretically grounded, and computationally versatile paradigm for generative modeling and representation learning across discrete and continuous domains, with substantial ongoing developments in order expressivity, dependency modeling, and efficiency.