Dilated Scheduled Unmasking

Updated 2 November 2025

Dilated Scheduled Unmasking (DUS) is a deterministic scheduling method that leverages geometric dilation and fast-mixing Markov chains for efficient, parallel token unmasking in MDLMs.
It reduces the number of denoiser calls per block from O(B) to O(log B) by revealing non-adjacent tokens in parallel, ensuring speed gains without architectural changes.
Empirical evaluations on benchmarks like GSM8K, HumanEval, and MBPP show that DUS achieves higher accuracy and robust generalization compared to confidence-based planners.

Dilated Scheduled Unmasking (DUS) is a deterministic scheduling method for inference in masked diffusion LLMs (MDLMs) that enables efficient, parallel unmasking of non-adjacent tokens. DUS leverages the fast-mixing property of first-order Markov chains to partition masked token positions into dilation-based, well-spaced groups, which can then be revealed in parallel without requiring any changes to model architecture or additional training. This approach delivers substantial inference speedup—reducing the number of model calls from $\mathcal{O}(B)$ to $\mathcal{O}(\log B)$ per block of $B$ tokens—while maintaining or exceeding output quality compared to prior parallel planners.

1. Motivation for Dilated Scheduled Unmasking

MDLMs provide the theoretical foundation for parallel, any-order text generation, circumventing the sequential dependencies required by autoregressive (AR) models. However, typical samplers for MDLMs, such as those based on per-token denoiser confidence or entropy, act as implicit planners that disregard pairwise token dependencies. This shortcoming results in two critical issues:

Ignored dependencies: Parallel planners using confidence scores fail to account for the conditional dependencies between simultaneously unmasked tokens, leading to degradation in generation quality—especially in highly-interconnected domains like code or math.
Inefficiency: Despite their parallel agenda, these approaches typically require $\mathcal{O}(B)$ denoiser calls to resolve a block of $B$ tokens, offering speed comparable to AR models and negating the core advantage of MDLMs.

DUS addresses both issues by providing a planner-free, model-free schedule that exploits Markov chain properties for maximal safe parallelism.

2. Algorithmic Structure of DUS

DUS defines a schedule that performs token unmasking in geometric, dilation-based patterns across a block of masked tokens. The scheduling is as follows:

For a block of $B$ masked tokens, set exponent base $a$ (typically $a=2$ ).
Iterate $R = \lceil \log_a B \rceil$ $R = ⌈ lo g_{a} B ⌉$ times, where in each iteration $t$ $t$ :
- Compute stride $s_t = \left\lfloor \frac{B}{a^t} \right\rfloor$ .
- Unmask all positions $k$ (with $k$ not yet unmasked) such that $(k-1)\bmod s_t = 0$ .

This approach ensures that, at every iteration, tokens revealed are maximally spaced, critically reducing their interdependence given the Markov chain context.

Iteration	$s_t$	Unmasked Indices (for $B=8, \ a=2$ )
1	4	1, 5
2	2	3, 7
3	1	2, 4, 6, 8

Mathematically, for general $B$ : $R = \lceil \log_a B \rceil, \quad s_t = \left\lfloor \frac{B}{a^t} \right\rfloor$

$\mathcal{P}_t = \{ k \in \{1,\dots,B\}\setminus \mathcal{U}_{t-1} : (k-1)\bmod s_t=0 \}$

where $\mathcal{U}_t$ is the set of already unmasked positions prior to step $t$ .

3. Theoretical Underpinnings

DUS is predicated on the first-order Markov assumption with fast mixing, modeling the sequence $\mathcal{X} = \{X_1, ..., X_K\}$ such that tokens far apart are nearly conditionally independent given the current unmasked context. This enables the following:

Negligible Mutual Information (MI): Dilation ensures the conditional mutual information $I(X_{i_m}; X_{i_n})$ between unmasked pairs $(i_m, i_n)$ is upper-bounded and decays exponentially with spacing $d=|i_m - i_n|$ :

$I(X_{i_m}; X_{i_n}) \leq -\frac{1}{2}\log(1 - \rho^{2d}) \leq \delta_d < \varepsilon$

where $\rho$ is the maximal correlation coefficient.

Efficient Entropy Reduction: Instead of minimizing the sum of individual entropies (as in confidence-based planners), DUS minimizes the joint conditional entropy

$H(X_{i_1}, ..., X_{i_k} | \mathcal{S}_t)\approx \sum_{j=1}^k H(X_{i_j}|\mathcal{S}_t)$

due to negligible MI within each group, where $\mathcal{S}_t$ is the current set of unmasked tokens.

By contrast, confidence-based planners may cluster unmasking positions, leading to inflated joint entropy and increased error propagation.

4. Computational Complexity and Comparative Analysis

DUS exhibits exponential efficiency gains compared to prior approaches:

Method	Denoiser Calls (per block of size $B$ )	Planner Model Extra Training	Scheduling Principle
DUS	$\mathcal{O}(\log B)$	No	Deterministic, dilation
LLADA, Dream	$\mathcal{O}(B)$	No	Blockwise/planner-based
Confidence/Entropy	$\mathcal{O}(B)$	No	Heuristic (per-token)

For a target generation length $G$ , DUS reduces NFE (model calls) to

$\text{NFE}_{\text{total}} = \frac{G}{B}\log_2 B$

This deep reduction enables practical exploitation of MDLM parallelism, especially in long-sequence and high-throughput inference settings.

5. Empirical Evaluation

Experiments on GSM8K (grade-school math), HumanEval (Python code), and MBPP (programming by example) benchmarks demonstrate the effectiveness of DUS across multiple LLMs (LLADA-Base-8B, LLADA-Instruct-8B, Dream-Instruct-7B):

Accuracy and Speed: DUS consistently delivers higher accuracy at every tested inference speedup compared to confidence-based planners. In HumanEval and MBPP, up to 27 percentage points in score are recovered at high parallelism without retraining or modifying the denoiser.
Robust Generalization: Even as denoiser iterations decrease ( $B$ increases), DUS retains higher task performance. For $B=8$ on GSM8K, DUS achieves 63.08% vs. 59.29% for the self-confidence baseline.

Block Size	Self-Conf. Score (%)	DUS Score (%)	Avg NFE @ Block	Inference Speedup
8	59.29	63.08	3	2.7×
16	51.23	59.51	4	4.0×
32	29.04	49.36	5	6.4×
64	8.04	35.18	6	10.7×

Graphical results (see the paper’s Figure 1) consistently place DUS above baseline planners in quality/speed space.

6. Implications for Non-Ordinal Generation and Broader Applications

DUS is particularly suited for non-ordinal tasks—domains where left-to-right token generation is suboptimal and long-range dependencies dominate, as in code completion and mathematical reasoning. The ability to parallelize unmasking without sacrificing generation quality directly addresses the "information bottleneck" imposed by AR and heuristic planners, enabling richer context integration.

DUS is "budget-aware": it adapts to arbitrary block sizes and numbers of function evaluations (NFE), making it suitable for variable resource constraints and long-sequence applications. DUS is agnostic to underlying model architectures and training setups, serving as a drop-in replacement for any block-based MDLM schedule (including LLADA, Dream, and others).

7. Summary and Prospective Directions

DUS delivers an efficient, theoretically motivated inference schedule for MDLM text generation, achieving:

Exponentially reduced computation: $\mathcal{O}(B)\to\mathcal{O}(\log B)$ denoiser calls per block.
Superior or matched output quality versus state-of-the-art non-AR approaches, validated on code and math benchmarks.
Broad applicability: No model retraining or architectural change required.

A plausible implication is that as sequence lengths and the complexity of generation tasks increase, DUS may represent a critical advance for scaling masked diffusion inference. Continued analysis of its theoretical guarantees under alternative sequence distributions, and potential integration with domain-specific prior knowledge, remain areas for further research.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Dilated Scheduled Unmasking (DUS).