FlexMDMs: Flexible Masked Diffusion Models

Updated 4 September 2025

FlexMDMs are a family of unified generative models that extend masked and diffusion paradigms to enable any-order and variable-length generation across diverse domains.
They employ flexible forward processes with learnable masking and unmasking schedules in both continuous and discrete settings, enhancing training efficiency and model performance.
The framework integrates efficient transformer architectures and auxiliary objectives to achieve state-of-the-art results in image, language, molecular, and multimodal applications.

Flexible Masked Diffusion Models (FlexMDMs) are a family of generative models that unify and extend masked and diffusion paradigms for modeling complex data. FlexMDMs emphasize forward process flexibility, any-order and variable-length generation, masking and unmasking as core operations, and efficient, modular training, enabling broad applications across vision, language, molecules, and multimodal domains.

1. Conceptual Foundations and Unified Framework

FlexMDMs generalize the classical diffusion model framework by introducing flexible masking and unmasking processes within both continuous and discrete domains. In the continuous domain, flexibility is often achieved by parameterizing the spatial dynamics of the forward stochastic differential equation (SDE), for example, through learnable Riemannian metrics or symplectic forms that guarantee convergence toward a target (usually Gaussian) distribution (Du et al., 2022).

In discrete settings, as in masked image or language modeling, the forward process is formulated as a Markov chain that stochastically masks observed tokens (or patches), followed by a reverse process that iteratively reconstructs or “unmasks” the original sequence. This masking process may be independent across tokens or element-specific, and is generalized as

$L(x_0) = \int w(t)\ \mathbb{E}_{q(x_t | x_0)} \left[\sum_{i: x_t^i = [M]} -\log p_\theta(x_0^i \mid x_t) \right]\ dt$

where $q(x_t | x_0)$ is the masking distribution governed by a schedule $\gamma_t$ , $w(t)$ is a weighting function encapsulating the mask dynamics, and $[M]$ denotes the mask token (You et al., 10 Mar 2025).

This formulation encompasses MaskGIT and MAR (with fixed-ratio masking and $w(t) = 1$ ), as well as Masked Diffusion Models (MDMs) employing independent Bernoulli masking and weighting $w(t) = \gamma'_t/\gamma_t$ . By varying the masking schedule, weighting, and prediction parameterization, the framework unifies prior discrete diffusion approaches with modern masked generation paradigms.

2. Flexible Forward Processes and Learnable Schedules

A defining feature of FlexMDMs is their capacity to learn or adapt the forward (corruption/masking) process:

Continuous Domain: By parameterizing the spatial component $R(x)$ of the forward SDE and optionally introducing an antisymmetric mixing term $\omega$ , one can tailor the noise injection to better align with the data manifold. For instance, the FP-Diffusion model specifies a drift ensuring the stationary distribution remains Gaussian while allowing for spatially adaptive and even degenerate diffusion (Du et al., 2022).
Discrete Domain & State-Dependent Schedules: FlexMDMs may deploy state-dependent masking, where each token or element (e.g., atom, bond) follows its own learnable corruption curve. In molecular generation, this element-wise learnability prevents “state-clashing,” where semantically distinct structures would otherwise collapse into indistinguishable corrupted states. MELD (Masked Element-wise Learnable Diffusion) parameterizes the per-element mask with a scheduling network, thereby separating forward trajectories and drastically improving chemical validity and property alignment (e.g., ZINC250K validity: 15%→93%) (Seo et al., 22 May 2025).
Variable-Length and Insertion Modeling: FlexMDMs extend beyond fixed-length generation by allowing for dynamic token insertions. The extended stochastic interpolant framework governs both insertion and unmasking—each with learnable schedules $(\alpha_t, \beta_t)$ —enabling the generation of sequences whose length matches the data distribution, as opposed to legacy MDMs which calibrate poorly to real-world length statistics (Kim et al., 31 Aug 2025).

3. Training Methodologies and Architectural Choices

FlexMDMs benefit from both efficient training regimes and architectural innovations:

Masked and Asymmetric Transformers: Leveraging transformers that process only unmasked patches/tokens (with lightweight decoders for reconstruction) reduces memory and computation, as shown by training cost reductions of 60–80% without quality loss (Zheng et al., 2023, Wei et al., 2023).
Auxiliary Objectives: Joint score-matching and masked patch reconstruction objectives help models maintain long-range coherence, even when only partial data is visible (Zheng et al., 2023).
Segmented and Dynamically Masked Inference: Training-free NAS (e.g., Flexiffusion) discovers optimal generation schedules and architectural routes (full, partial, null steps), facilitating segment-wise dynamic masking and maximizing efficiency. Notably, Flexiffusion can accelerate inference by 2–5× with negligible FID degradation across large image models (Huang et al., 3 Jun 2025).
One-Step Distillation: Distilling a multi-step masked diffusion teacher into a one-step generator (Di[M]O) uses token-level distribution matching and noise-injected initialization. This achieves near-teacher performance with a single pass, dramatically reducing inference time in both class- and text-conditional generation (Zhu et al., 19 Mar 2025).

4. Performance Analysis and Empirical Results

FlexMDMs demonstrate competitive or superior performance across a range of benchmarks:

Model / Domain	Metric	Baseline	FlexMDM Variant	Outcome
CelebA-HQ 256×256	FID	U-ViT: 24.83	MaskDM-B: 6.27 (Lei et al., 2023)	Record FID, 80% less training time
ImageNet 256×256	FID (low NFE)	VAR: >2.02	eMIGM-H: 2.02 (You et al., 10 Mar 2025)	Outperforms VAR
ImageNet 512×512	FID	EDM2 SOTA	eMIGM-L: better FID at 60% NFE	Lower computational cost
ZINC250K	Chemical validity	MDM: 15%	MELD: 93% (Seo et al., 22 May 2025)	Improved valid molecule generation
OpenWebText	Perplexity	Prior DDMs	MD4: 2.75 (CIFAR), 3.40 (IM64) (Shi et al., 6 Jun 2024)	Surpasses ARMs of similar size
GSM8K (math)	Accuracy	MDM: 58%	FlexMDM: 67% (Kim et al., 31 Aug 2025)	Retrofitting yields 9% gain

These results validate that careful design of the forward process, masking, and inference can yield significant gains not only in efficiency (lower NFE, less compute) but also in sample quality, calibration to variable-length tasks, and property alignment in molecular and planning domains.

5. Extensions: Multimodal, Editing, and Scientific Applications

FlexMDMs have been extended to enable:

Multimodal Generative Modeling: By integrating modality-specific encoders and decoder heads, a unified (multi-modal) diffusion backbone simultaneously synthesizes and reconstructs multiple data types—images, labels, masked images, and auxiliary representations—within one shared latent space with multi-task objectives (Chen et al., 24 Jul 2024).
Fine-Grained Editable Generation: DICE (Discrete Inversion for Controllable Editing) enhances discrete diffusion and masked generative models by tracking the residual “noise” during inversion, enabling precise, local content editing without pre-defined masks or attention patching—applicable to both images (e.g., VQ-Diffusion, Paella) and LLMs (e.g., RoBERTa) (He et al., 10 Oct 2024).
Spatiotemporal Scientific Forecasting: FLEX introduces a backbone for physical system modeling (e.g., turbulence), operating in residual space with hybrid U-Net/Transformer architectures and weak/strong hierarchical conditioning, achieving accurate super-resolution and forecasting even under out-of-distribution physical regimes and boundary conditions (2505.17351).

6. Theoretical Insights and Time-Agnosticism

Recent theory has uncovered several properties key to FlexMDMs:

Time-Agnostic Training and Sampling: MDMs can be formulated such that explicit time-conditioning disappears, replaced by the masked token count. The first-hitting sampler (FHS) exploits this property, providing a parallel and highly efficient sampling mechanism with up to 20× speedup compared to classic diffusion sampling. This also highlights connections to order-agnostic masked and autoregressive models (Zheng et al., 4 Sep 2024).
Non-Normal Diffusion Processes: FlexMDMs benefit from generalized diffusion step distributions that relax the normality assumption, allowing use of Laplace or Uniform increments with corresponding alternative loss functions (e.g., L1/L2), potentially trading off sample sharpness, regularization, and density estimation (Li, 10 Dec 2024).
Generalized Losses and Schedules: The loss landscape for masked diffusion can be simplified to weighted cross-entropy integrals, and state-dependent schedules can be learned for flexible, data-aligned masking. For language and image modeling, this has resulted in improved perplexity and bits-per-dimension compared to prior discrete diffusion or even autoregressive baselines (Shi et al., 6 Jun 2024).

7. Future Directions and Open Challenges

Key ongoing research directions include:

Variable-Length and Adaptive Generation: Continued development of insertion/unmasking frameworks to more naturally handle open-ended or human-like editing and planning tasks (Kim et al., 31 Aug 2025).
Lessons from NAS and Efficient Inference: Leveraging segment-wise, dynamic-masking NAS for both model search and runtime adaptation promises further speedups and resource efficiency while maintaining generative quality (Huang et al., 3 Jun 2025).
Scalability to Large Domains: Scaling FlexMDMs to multi-billion parameter regimes has shown scaling laws comparable to ARMs, with only a 16× compute gap (versus 64× for continuous diffusion), suggesting continued progress as system-level optimizations improve (Nie et al., 24 Oct 2024).
Modalities Beyond Images and Text: Flexible forward trajectories and masking architectures are being successfully extended to graphs/molecules, audio-video, and spatiotemporal systems, providing a unifying foundation across domains (Seo et al., 22 May 2025, Nunez et al., 2023, 2505.17351).
Theoretical and Practical Evaluation: Addressing numerical artifacts in categorical sampling (32 vs 64-bit), properly benchmarking diversity, and clarifying when FlexMDMs match or surpass ARMs in generation and reasoning.

FlexMDMs mark an overview of score-based, masked, and autoencoding paradigms—empowering models with principled flexibility in forward process design, masking strategy, variable length, and task-specific adaptation. Through these innovations, they are well-positioned for diverse generative tasks, efficient large-scale deployment, and future progress in multi-domain, editable, and interactive AI systems.