Flexible Masked Diffusion Models (FlexMDM)

Updated 1 June 2026

FlexMDM is a unified generative modeling framework that uses learnable, context-dependent masking schedules to enhance performance on images, sequences, and graphs.
It decouples corruption and denoising orders via parameterized schedulers, enabling adaptive inference and improved state-of-the-art results.
Empirical studies show FlexMDM boosts sample quality, reduces computational costs, and achieves robust generalization on benchmarks like CelebA-HQ and GSM8K.

Flexible Masked Diffusion Models (FlexMDM) are a class of generative models that generalize and unify masked diffusion processes via learnable, context-dependent generation orders, adaptive masking schedules, and support for variable-length structured outputs across modalities including images, sequences, and graphs. FlexMDM encompasses multiple methodological advances, including per-dimension or element-wise learnable noise schedules, uncertainty-guided generation, autoregressive reductions, and joint optimization of scheduling and denoising within a unified continuous-time variational framework. By decoupling the corruption and denoising order via parameterized schedulers, FlexMDM achieves state-of-the-art data modeling, robust generalization, and efficient parallel or adaptive inference.

1. Foundational Formulations and Unification

Masked diffusion models (MDMs) operate by iteratively corrupting an input (e.g., an image, sequence, or graph) through masking, then learning to reverse this corruption via neural denoising. In standard MDMs, all positions follow identical mask schedules, leading to uniform random orderings. FlexMDM extends this by allowing order-expressive, position- and context-dependent masking schedules $\alpha^{(i)}(u, t)$ ; these schedules encode either static or learnable, input-adaptive generation orders (Hong et al., 2 Feb 2026, Garg et al., 24 Nov 2025).

The forward process is parameterized as: $q_\alpha(z_t^{(i)} \mid x) = \mathrm{Cat}\bigl(\alpha^{(i)}(u, t)\,x^{(i)} + (1-\alpha^{(i)}(u, t))\,m\bigr)$ with $m$ a special mask token, and $\alpha^{(i)}$ controlling the masking rate for each position $i$ (tokens, pixels, graph elements). The corresponding reverse process $p_{\theta, \hat\alpha}$ is conditioned on the current masked state $z_t$ and context-dependent reverse schedule $\hat\alpha^{(i)}(\hat{u}, t)$ .

This order-expressive framework (OeMDM) subsumes conventional MDMs (uniform schedule), autoregressive models (ARMs; sharp, strictly monotonic schedule), block-diffusion (piecewise constant schedule), and FlexMDM (arbitrary learnable schedules) (Hong et al., 2 Feb 2026, Garg et al., 24 Nov 2025, Karami et al., 23 Jan 2026).

2. Learning and Parameterizing Generation Order

The key methodological advance of FlexMDM is the introduction of learnable, potentially context-adaptive noise schedules. For a sequence of $L$ positions, FlexMDM parameterizes the masking rate for each position via $\alpha_\phi^{(i)}(x, t)$ , with learnable parameters $q_\alpha(z_t^{(i)} \mid x) = \mathrm{Cat}\bigl(\alpha^{(i)}(u, t)\,x^{(i)} + (1-\alpha^{(i)}(u, t))\,m\bigr)$ 0. This enables the model to discover orderings that are optimal for the target data distribution and task (Garg et al., 24 Nov 2025, Hong et al., 2 Feb 2026).

A typical parameterization is: $q_\alpha(z_t^{(i)} \mid x) = \mathrm{Cat}\bigl(\alpha^{(i)}(u, t)\,x^{(i)} + (1-\alpha^{(i)}(u, t))\,m\bigr)$ 1 where $q_\alpha(z_t^{(i)} \mid x) = \mathrm{Cat}\bigl(\alpha^{(i)}(u, t)\,x^{(i)} + (1-\alpha^{(i)}(u, t))\,m\bigr)$ 2 is a small neural head atop a "feature" trunk $q_\alpha(z_t^{(i)} \mid x) = \mathrm{Cat}\bigl(\alpha^{(i)}(u, t)\,x^{(i)} + (1-\alpha^{(i)}(u, t))\,m\bigr)$ 3, $q_\alpha(z_t^{(i)} \mid x) = \mathrm{Cat}\bigl(\alpha^{(i)}(u, t)\,x^{(i)} + (1-\alpha^{(i)}(u, t))\,m\bigr)$ 4 control monotonicity, and $q_\alpha(z_t^{(i)} \mid x) = \mathrm{Cat}\bigl(\alpha^{(i)}(u, t)\,x^{(i)} + (1-\alpha^{(i)}(u, t))\,m\bigr)$ 5 ensures suitable normalization.

The model's joint continuous-time loss takes the form: $q_\alpha(z_t^{(i)} \mid x) = \mathrm{Cat}\bigl(\alpha^{(i)}(u, t)\,x^{(i)} + (1-\alpha^{(i)}(u, t))\,m\bigr)$ 6 where $q_\alpha(z_t^{(i)} \mid x) = \mathrm{Cat}\bigl(\alpha^{(i)}(u, t)\,x^{(i)} + (1-\alpha^{(i)}(u, t))\,m\bigr)$ 7 is the forward schedule "velocity". Backpropagation occurs through both schedule and denoiser, and the inference order is dynamically adapted using $q_\alpha(z_t^{(i)} \mid x) = \mathrm{Cat}\bigl(\alpha^{(i)}(u, t)\,x^{(i)} + (1-\alpha^{(i)}(u, t))\,m\bigr)$ 8 at sampling (Hong et al., 2 Feb 2026).

The FlexMDM objective decomposes into an expectation over autoregressive log-losses with respect to the implicitly defined permutation distribution $q_\alpha(z_t^{(i)} \mid x) = \mathrm{Cat}\bigl(\alpha^{(i)}(u, t)\,x^{(i)} + (1-\alpha^{(i)}(u, t))\,m\bigr)$ 9 that arises from the schedule (Garg et al., 24 Nov 2025, Karami et al., 23 Jan 2026).

3. Uncertainty-Guided Adaptive Inference

Uncertainty in the denoising process—quantified as denoising entropy—plays a central role in adaptive decoding and sample quality. Denoising entropy is defined as the average Shannon entropy over the masked positions' predictive distributions at each intermediate state: $m$ 0 where $m$ 1 denotes currently masked positions. The cumulative entropy along the generative path, $m$ 2, is used to select or steer decoding paths away from high-uncertainty states (Chen et al., 24 Dec 2025).

Two algorithms leverage this:

E-BoN (Entropy-based Best-of-N): Selects the lowest path-entropy among $m$ 3 full reverse samples.
E-SMC (Entropy-guided Sequential Monte Carlo): Online population-based decoding with periodic resampling focused on low-entropy trajectories.

Minimizing denoising entropy strongly correlates with reduced per-token loss, lower perplexity, and enhanced performance on reasoning/code tasks. Entropy-guided policies can be combined with learned policies, and path-level entropy acts as both a metric and a reward in potential reinforcement learning setups for decoder optimization (Chen et al., 24 Dec 2025).

4. Extensions: Variable-Length, Blockwise, and Structured Data

FlexMDM generalizes to variable-length data and discrete structured domains:

Variable-length Sequences: FlexMDM introduces both insertion and unmasking operations in the forward interpolant, governed by independent schedules $m$ 4 (insertions) and $m$ 5 (unmasking). This enables modeling open-ended sequences and infilling. The reverse process couples a learned unmasking posterior $m$ 6 with a learned insertion model $m$ 7; sampling remains exact under any unmasking order provided transitions are sampled from these posteriors. Inference can leverage adaptive orderings based on per-position confidence or other heuristics (Kim et al., 31 Aug 2025).

Molecular Graphs: Element-wise learnable schedules are critical for avoiding the state-clashing failure mode in standard MDMs, where masking collisions induce multimodal posteriors irreconcilable with element-wise denoisers. FlexMDM employs a noise-scheduling MLP that assigns distinct mask probabilities to each atom and bond, preventing collapse and enabling near-perfect validity (e.g., ZINC250K: 93.2% for FlexMDM vs. 27.8% for polynomial-schedule MDM). Straight-through Gumbel-softmax enables end-to-end training (Seo et al., 22 May 2025).

Blockwise and Causal Reductions: Reinterpreting MDMs as block-wise causal models enables a permutation-equivariant, strictly-causal attention architecture. FlexMDM supports progressive permutation curricula and strided block-parallel decoding, achieving up to $m$ 8 throughput with minimal degradation and rapid fine-tuning recovery. The training objective reduces to an autoregressive sum over blocks, making the full process analogous to a mixture over ARMs with learnable, possibly dynamic, block structures (Karami et al., 23 Jan 2026, Hong et al., 2 Feb 2026).

5. Empirical Performance and Generalization

Experimental validation across domains demonstrates the empirical strengths of FlexMDM:

Image generation: FlexMDM exhibits substantial FID improvements ( $m$ 9 vs. $\alpha^{(i)}$ 0 on CelebA-HQ 256×256) and 25–33% training cost reduction against U-ViT baselines. Generalizability is evidenced by significant FID gains (up to 46%) under data scarcity (Lei et al., 2023).
Text and code: FlexMDM reduces perplexity over fixed-order diffusion models (e.g., LoMDM $\alpha^{(i)}$ 1 on OWT vs. $\alpha^{(i)}$ 2 for MDLM) and achieves pronounced accuracy increases on math and code infilling (GSM8K: $\alpha^{(i)}$ 3, HumanEval: $\alpha^{(i)}$ 4 after FlexMDM fine-tuning) (Kim et al., 31 Aug 2025, Hong et al., 2 Feb 2026).
Tabular and molecular data: Learnable schedules match or improve over state-of-the-art (TabDiff and others) with far fewer parameters, and on molecular generation, validity improves from $\alpha^{(i)}$ 5 to $\alpha^{(i)}$ 6– $\alpha^{(i)}$ 7 (Seo et al., 22 May 2025, Garg et al., 24 Nov 2025).
Inference flexibility: Strided and adaptive decoding achieve substantial speedup with limited perplexity loss; entropy-guided sampling consistently boosts generation quality on benchmarks spanning text, code, and planning (Karami et al., 23 Jan 2026, Chen et al., 24 Dec 2025).

6. Applications, Design Trade-offs, and Limitations

Practical deployment of FlexMDM requires consideration of masking granularity, schedule parameterization, and computational trade-offs:

Mask Ratio: Optimal performance typically arises at 50–70% masking in image models; very high mask rates can be stabilized by appropriate cosine schedules. In sequences, learned schedules address the order-selection sensitivity inherent to random or blockwise orders (Lei et al., 2023, Hong et al., 2 Feb 2026).
Adaptive Policies: Order selection can be fixed, learned, or adaptive at inference using entropy or other heuristics. Learnable-order models converge faster and yield improved sample quality at lower training cost (Hong et al., 2 Feb 2026, Chen et al., 24 Dec 2025).
Generalization: Element-wise and position-adaptive schedules prevent mode collapse in graphs and better capture the statistical dependencies in structured data (Seo et al., 22 May 2025).
Limitations: At very late diffusion steps, even element-wise schedules may retain some irreducible multimodality for complex structures. Scaling FlexMDM to very large graphs or integrating additional edit operations (beyond insertion/unmask) are open research directions (Seo et al., 22 May 2025, Kim et al., 31 Aug 2025).

7. Future Directions

Research on FlexMDM is advancing several axes:

Joint entropy modeling: Improved uncertainty quantification via joint or mutual information metrics, moving beyond independent per-element entropy (Chen et al., 24 Dec 2025).
Reinforcement learning for ordering: Leveraging denoising entropy as a reward in training explicit decoder policies or integrating RL-based planners (Chen et al., 24 Dec 2025).
Multimodal and 3D structures: Extensions to 3D molecular graphs, multimodal data, and continuous-valued domains (Seo et al., 22 May 2025).
Hybrid AR–MDM models: Combining autoregressive scaffolding for "easy" tokens/regions with diffusion-based completion for complex/flexible regions (Chen et al., 24 Dec 2025, Hong et al., 2 Feb 2026).
Exactness and editability: Further development of theoretically grounded frameworks for any-order inference, edit-based generative modeling, and exact sample matching to target distributions (Kim et al., 31 Aug 2025, Hong et al., 2 Feb 2026).

In summary, Flexible Masked Diffusion Models provide a unified, theoretically principled, and empirically validated framework for masked diffusion-based generative modeling with explicit and learnable generation order. This flexibility enables state-of-the-art performance and efficient inference across a wide spectrum of data modalities and tasks (Lei et al., 2023, Garg et al., 24 Nov 2025, Chen et al., 24 Dec 2025, Kim et al., 31 Aug 2025, Karami et al., 23 Jan 2026, Seo et al., 22 May 2025, Hong et al., 2 Feb 2026).