Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 119 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 60 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 423 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Dilated Scheduled Unmasking

Updated 2 November 2025
  • Dilated Scheduled Unmasking (DUS) is a deterministic scheduling method that leverages geometric dilation and fast-mixing Markov chains for efficient, parallel token unmasking in MDLMs.
  • It reduces the number of denoiser calls per block from O(B) to O(log B) by revealing non-adjacent tokens in parallel, ensuring speed gains without architectural changes.
  • Empirical evaluations on benchmarks like GSM8K, HumanEval, and MBPP show that DUS achieves higher accuracy and robust generalization compared to confidence-based planners.

Dilated Scheduled Unmasking (DUS) is a deterministic scheduling method for inference in masked diffusion LLMs (MDLMs) that enables efficient, parallel unmasking of non-adjacent tokens. DUS leverages the fast-mixing property of first-order Markov chains to partition masked token positions into dilation-based, well-spaced groups, which can then be revealed in parallel without requiring any changes to model architecture or additional training. This approach delivers substantial inference speedup—reducing the number of model calls from O(B)\mathcal{O}(B) to O(logB)\mathcal{O}(\log B) per block of BB tokens—while maintaining or exceeding output quality compared to prior parallel planners.

1. Motivation for Dilated Scheduled Unmasking

MDLMs provide the theoretical foundation for parallel, any-order text generation, circumventing the sequential dependencies required by autoregressive (AR) models. However, typical samplers for MDLMs, such as those based on per-token denoiser confidence or entropy, act as implicit planners that disregard pairwise token dependencies. This shortcoming results in two critical issues:

  • Ignored dependencies: Parallel planners using confidence scores fail to account for the conditional dependencies between simultaneously unmasked tokens, leading to degradation in generation quality—especially in highly-interconnected domains like code or math.
  • Inefficiency: Despite their parallel agenda, these approaches typically require O(B)\mathcal{O}(B) denoiser calls to resolve a block of BB tokens, offering speed comparable to AR models and negating the core advantage of MDLMs.

DUS addresses both issues by providing a planner-free, model-free schedule that exploits Markov chain properties for maximal safe parallelism.

2. Algorithmic Structure of DUS

DUS defines a schedule that performs token unmasking in geometric, dilation-based patterns across a block of masked tokens. The scheduling is as follows:

  1. For a block of BB masked tokens, set exponent base aa (typically a=2a=2).
  2. Iterate R=logaBR = \lceil \log_a B \rceil times, where in each iteration tt:
    • Compute stride st=Bats_t = \left\lfloor \frac{B}{a^t} \right\rfloor.
    • Unmask all positions kk (with kk not yet unmasked) such that (k1)modst=0(k-1)\bmod s_t = 0.

This approach ensures that, at every iteration, tokens revealed are maximally spaced, critically reducing their interdependence given the Markov chain context.

Iteration sts_t Unmasked Indices (for B=8, a=2B=8, \ a=2)
1 4 1, 5
2 2 3, 7
3 1 2, 4, 6, 8

Mathematically, for general BB: R=logaB,st=BatR = \lceil \log_a B \rceil, \quad s_t = \left\lfloor \frac{B}{a^t} \right\rfloor

Pt={k{1,,B}Ut1:(k1)modst=0}\mathcal{P}_t = \{ k \in \{1,\dots,B\}\setminus \mathcal{U}_{t-1} : (k-1)\bmod s_t=0 \}

where Ut\mathcal{U}_t is the set of already unmasked positions prior to step tt.

3. Theoretical Underpinnings

DUS is predicated on the first-order Markov assumption with fast mixing, modeling the sequence X={X1,...,XK}\mathcal{X} = \{X_1, ..., X_K\} such that tokens far apart are nearly conditionally independent given the current unmasked context. This enables the following:

  • Negligible Mutual Information (MI): Dilation ensures the conditional mutual information I(Xim;Xin)I(X_{i_m}; X_{i_n}) between unmasked pairs (im,in)(i_m, i_n) is upper-bounded and decays exponentially with spacing d=imind=|i_m - i_n|:

I(Xim;Xin)12log(1ρ2d)δd<εI(X_{i_m}; X_{i_n}) \leq -\frac{1}{2}\log(1 - \rho^{2d}) \leq \delta_d < \varepsilon

where ρ\rho is the maximal correlation coefficient.

  • Efficient Entropy Reduction: Instead of minimizing the sum of individual entropies (as in confidence-based planners), DUS minimizes the joint conditional entropy

H(Xi1,...,XikSt)j=1kH(XijSt)H(X_{i_1}, ..., X_{i_k} | \mathcal{S}_t)\approx \sum_{j=1}^k H(X_{i_j}|\mathcal{S}_t)

due to negligible MI within each group, where St\mathcal{S}_t is the current set of unmasked tokens.

By contrast, confidence-based planners may cluster unmasking positions, leading to inflated joint entropy and increased error propagation.

4. Computational Complexity and Comparative Analysis

DUS exhibits exponential efficiency gains compared to prior approaches:

Method Denoiser Calls (per block of size BB) Planner Model Extra Training Scheduling Principle
DUS O(logB)\mathcal{O}(\log B) No Deterministic, dilation
LLADA, Dream O(B)\mathcal{O}(B) No Blockwise/planner-based
Confidence/Entropy O(B)\mathcal{O}(B) No Heuristic (per-token)

For a target generation length GG, DUS reduces NFE (model calls) to

NFEtotal=GBlog2B\text{NFE}_{\text{total}} = \frac{G}{B}\log_2 B

This deep reduction enables practical exploitation of MDLM parallelism, especially in long-sequence and high-throughput inference settings.

5. Empirical Evaluation

Experiments on GSM8K (grade-school math), HumanEval (Python code), and MBPP (programming by example) benchmarks demonstrate the effectiveness of DUS across multiple LLMs (LLADA-Base-8B, LLADA-Instruct-8B, Dream-Instruct-7B):

  • Accuracy and Speed: DUS consistently delivers higher accuracy at every tested inference speedup compared to confidence-based planners. In HumanEval and MBPP, up to 27 percentage points in score are recovered at high parallelism without retraining or modifying the denoiser.
  • Robust Generalization: Even as denoiser iterations decrease (BB increases), DUS retains higher task performance. For B=8B=8 on GSM8K, DUS achieves 63.08% vs. 59.29% for the self-confidence baseline.
Block Size Self-Conf. Score (%) DUS Score (%) Avg NFE @ Block Inference Speedup
8 59.29 63.08 3 2.7×
16 51.23 59.51 4 4.0×
32 29.04 49.36 5 6.4×
64 8.04 35.18 6 10.7×

Graphical results (see the paper’s Figure 1) consistently place DUS above baseline planners in quality/speed space.

6. Implications for Non-Ordinal Generation and Broader Applications

DUS is particularly suited for non-ordinal tasks—domains where left-to-right token generation is suboptimal and long-range dependencies dominate, as in code completion and mathematical reasoning. The ability to parallelize unmasking without sacrificing generation quality directly addresses the "information bottleneck" imposed by AR and heuristic planners, enabling richer context integration.

DUS is "budget-aware": it adapts to arbitrary block sizes and numbers of function evaluations (NFE), making it suitable for variable resource constraints and long-sequence applications. DUS is agnostic to underlying model architectures and training setups, serving as a drop-in replacement for any block-based MDLM schedule (including LLADA, Dream, and others).

7. Summary and Prospective Directions

DUS delivers an efficient, theoretically motivated inference schedule for MDLM text generation, achieving:

  • Exponentially reduced computation: O(B)O(logB)\mathcal{O}(B)\to\mathcal{O}(\log B) denoiser calls per block.
  • Superior or matched output quality versus state-of-the-art non-AR approaches, validated on code and math benchmarks.
  • Broad applicability: No model retraining or architectural change required.

A plausible implication is that as sequence lengths and the complexity of generation tasks increase, DUS may represent a critical advance for scaling masked diffusion inference. Continued analysis of its theoretical guarantees under alternative sequence distributions, and potential integration with domain-specific prior knowledge, remain areas for further research.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Dilated Scheduled Unmasking (DUS).