Any-Order Flexible Length Masked Diffusion

Published 31 Aug 2025 in cs.LG | (2509.01025v2)

Abstract: Masked diffusion models (MDMs) have recently emerged as a promising alternative to autoregressive models over discrete domains. MDMs generate sequences in an any-order, parallel fashion, enabling fast inference and strong performance on non-causal tasks. However, a crucial limitation is that they do not support token insertions and are thus limited to fixed-length generations. To this end, we introduce Flexible Masked Diffusion Models (FlexMDMs), a discrete diffusion paradigm that simultaneously can model sequences of flexible length while provably retaining MDMs' flexibility of any-order inference. Grounded in an extension of the stochastic interpolant framework, FlexMDMs generate sequences by inserting mask tokens and unmasking them. Empirically, we show that FlexMDMs match MDMs in perplexity while modeling length statistics with much higher fidelity. On a synthetic maze planning task, they achieve $\approx 60 \%$ higher success rate than MDM baselines. Finally, we show pretrained MDMs can easily be retrofitted into FlexMDMs: on 16 H100s, it takes only three days to fine-tune LLaDA-8B into a FlexMDM, achieving superior performance on math (GSM8K, $58\% \to 67\%$) and code infilling performance ($52\% \to 65\%$).

Abstract PDF Chat (Pro)

Summary

The paper presents FlexMDM, a diffusion model that supports variable-length sequences and any-order token insertion using a dual loss training strategy.
It leverages continuous-time Markov chains and a stochastic interpolant framework to predict token unmasking and insertion during inference efficiently.
Experimental results demonstrate nearly 60% improved success in maze planning and enhanced performance on text generation and code infilling tasks.

Any-Order Flexible Length Masked Diffusion

Introduction

The paper introduces Flexible Masked Diffusion Models (FlexMDM), enhancing the capabilities of Masked Diffusion Models (MDM) by allowing for variable-length sequences and token insertion, while maintaining any-order generation. This development addresses limitations in current MDM implementations, where sequences are fixed-length and token insertion isn't supported.

In FlexMDM, sequences are generated by initially inserting mask tokens and subsequently unmasking them through predictions of the expected number of mask tokens to insert and the posterior over clean tokens.

Figure 1: Flexible Masked Diffusion Model (FlexMDM) addresses MDMs' inability to handle variable-length sequences and token insertion while preserving any-order generation power.

Preliminaries: Continuous-Time Markov Chains and Masked Diffusions

FlexMDM is grounded in the stochastic interpolant framework expanded to accommodate variable-length sequences via continuous-time Markov chains (CTMCs). It relates CTMCs to discrete diffusion and utilizes them for sequence generation by inserting and unmasking tokens. The FlexMDM approach uses both insertion and unmasking schedules to define the stochastic interpolant which determines the distribution path between the base and target distributions.

Figure 2: MDM interpolant process diagram.

FlexMDM Training Methodology

Training FlexMDM involves learning two key components: the unmasking posterior similar to MDM, and the number of tokens expected to be inserted. The loss function combines these components to facilitate learning of the correct rate matrix for sequence generation:

Unmasking Posterior: Models the distribution of a clean token at masked positions.
Insertion Expectation: Predicts the expected number of tokens to insert between tokens in the sequence.

This duality allows FlexMDM to handle variable-length sequences efficiently at scale.

Inference with FlexMDM

FlexMDM supports adaptive inference where tokens can be unmasked in an arbitrary order, maintaining the theoretical any-order guarantees of MDMs. Its approach leverages tau-leaping for accurate and efficient unmasking and insertion during inference.

Figure 3: Maze task illustration. The model is given subgoals and is required to connect them.

FlexMDM's inference adaptability is crucial for tasks like maze planning where preassigning token positions is challenging without accurate a priori information.

Experimental Validation

FlexMDM demonstrates strong performance across varying tasks, outperforming baseline MDMs by nearly 60% success rate on synthetic maze planning tasks, proving its efficacy in subgoal-style planning scenarios. Additionally, it shows greater fidelity in modeling length distributions during text generation without compromising sequence perplexity.

For scalability, FlexMDM can retrofit pretrained MDM weights, allowing for rapid task transfer and enhancement in performance metrics on datasets like GSM8K and code infilling benchmarks.

Figure 4: FlexMDM performance exhibits superior scaling when more sampling steps are allocated.

Conclusion

FlexMDM introduces a robust framework for discrete diffusion in variable-length sequences, significantly advancing the capabilities over traditional MDM approaches. It maintains any-order generation flexibly while enabling efficient training and inference even at large scales.

By enabling token insertions in generative models, FlexMDM aligns the state-of-the-art methodology with natural sequence compositions in human language and nature, heralding advances in generative modeling with practical applications across diverse domains.