Infinite Mask Diffusion for Few-Step Distillation

Published 11 May 2026 in cs.CL and cs.AI | (2605.10518v1)

Abstract: Masked Diffusion Models (MDMs) have emerged as a promising alternative to autoregressive models in language modeling, offering the advantages of parallel decoding and bidirectional context processing within a simple yet effective framework. Specifically, their explicit distinction between masked tokens and data underlies their simple framework and effective conditional generation. However, MDMs typically require many sampling iterations due to factorization errors stemming from simultaneous token updates. We observe that a theoretical lower bound of the factorization error exists, which standard MDMs cannot reduce due to their use of a deterministic single-state mask. In this paper, we propose the Infinite Mask Diffusion Model (IMDM), which introduces a stochastic infinite-state mask to mitigate the theoretical bound while directly inheriting the benefits of MDMs, including the compatibility with pre-trained weights. We empirically demonstrate that MDM fails to perform few-step generation even in a simple synthetic task due to the factorization error bound, whereas IMDM can find an efficient solution for the same task. Finally, when equipped with appropriate distillation methods, IMDM surpasses existing few-step distillation methods at small step counts on LM1B and OpenWebText. Code is available at https://Ugness.github.io/official_imdm.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces IMDM, a novel approach that replaces deterministic masks with an infinite set of stochastic masks to remove structural factorization errors.
It leverages a partition-and-map mechanism, ensuring near-zero error in few-step generation by accurately modeling token dependencies.
Empirical results on synthetic tasks and benchmarks like LM1B validate that IMDM significantly outperforms traditional MDMs in generative quality and efficiency.

Infinite Mask Diffusion for Few-Step Distillation: Analysis and Implications

Introduction

The paper "Infinite Mask Diffusion for Few-Step Distillation" (2605.10518) interrogates the structural bottleneck imposed by Masked Diffusion Models (MDMs) in language generation, specifically the irreducible factorization error associated with using a single deterministic mask token. By introducing the Infinite Mask Diffusion Model (IMDM), the work aims to eliminate this source of error, enabling efficient few-step generation and improving the applicability of diffusion-based models for large-scale language modeling tasks with parallel, bidirectional decoding.

Theoretical Foundation

MDMs have been recognized for supporting efficient, parallel, and bidirectional decoding, exploiting a mask token that unambiguously distinguishes masked positions from data tokens. Despite this, the mask's deterministic and singular nature ensures that simultaneous prediction of correlated tokens yields a persistent factorization error. This error is formally lower-bounded by the conditional mutual information between token pairs simultaneously unmasked, irrespective of the distillation or model optimization strategy. The result is that even at optimality with respect to ELBO or other training objectives, MDMs are intrinsically unable to match the joint distribution of the data in a few-step (large stride) regime.

The paper rigorously formulates this bottleneck and provides a lower bound on the conditional total correlation for any MDM, formalized in Theorem 4.1. The bound is shown to grow with step size and token dependency, becoming especially problematic in domains like natural language where token correlations are strong and broad in context.

Infinite Mask Diffusion Model: Formulation and Properties

IMDM generalizes MDMs by replacing the unitary deterministic mask with an infinite set of distinguishable stochastic mask tokens, sampled via injection of continuous noise and linear layers. This preserves the disjoint property between mask and data tokens—a key enabler for efficient conditional generation and transfer learning from pre-trained MDM weights—but also admits a partition-and-map mechanism across the infinite mask embedding space. In the infinite mask limit, this allows IMDM to construct deterministic maps from distinct mask noise realizations to particular joint outcomes over multiple masked tokens, effectively modeling the complete conditional distribution and removing structural factorization error.

Concretely, the forward process for IMDM is defined as a uniform discrete diffusion over an augmented token space, with both data tokens and a countably infinite mask category. The model thereby inherits the advantages of MDMs, including seamless compatibility for reusing pre-trained weights, trivial separation of masked and observed positions, and highly parallel, context-rich decoding.

Theoretically, Theorem 4.2 proves the existence of parameters for IMDM that can achieve zero factorization error for any sequence length and step configuration, contingent on the model's capacity to partition the mask space at sufficient resolution.

Empirical Validation

Synthetic Tasks: On synthetic datasets engineered to expose maximal dependence between token pairs (e.g., {00,11}), MDMs, even with extensive distillation, saturate at the predicted factorization error floor and fail to generate valid data in the few-step regime. In contrast, IMDM achieves near-perfect sample validity and drives the measured factorization error close to zero, consistent with the theoretical analysis.

Standard Benchmarks: Experiments on LM1B and OpenWebText validate the practical benefits of IMDM in large-scale language modeling. For both unconditional generation and conditional tasks (span infilling, continuation), IMDM surpasses state-of-the-art MDM distillation methods—including SDTT, ReDi, and Di4C—by a substantial margin in generative perplexity and MAUVE, particularly as the number of sampling steps is reduced below 8. As step count increases, the gap narrows (as predicted), confirming that IMDM's core advantage is in regimes where MDMs' structural bottleneck binds.

Noise Ablation: The analysis of injected mask noise demonstrates that IMDM performance is robust to distributional and scaling choices, and that performance primarily correlates with ensuring high-dimensional, distinctive mask embeddings to simulate an infinite mask set.

Compatibility and Scaling: IMDM is shown to be structurally compatible with MDM checkpoints, enabling efficient distillation or finetuning from large pre-trained models. Scaling to larger architectures (up to 860M parameters) yields consistent trends in favor of IMDM.

Practical and Theoretical Implications

IMDM directly addresses the major limitation of discrete diffusion approaches for generative modeling: the inability to efficiently generate diverse, coherent outputs in few steps when strong token dependencies are present. By enabling partition-and-map mechanisms via stochastic masking, the IMDM framework unlocks bidirectional context modeling and parallel decoding without the severe trade-offs of AR or low-step MDM approaches. This advances the frontier for high-throughput sequence generation and supports rapid conditional sampling (with empirical validation across language modeling, conditional generation, and infilling tasks).

At the theoretical level, the work provides a detailed information-theoretic analysis of the mask design space, drawing precise connections between mask structure, factorization error, and achievable expressivity under the discrete diffusion paradigm. The extension provides a practical path for future models to simultaneously leverage mask-based objectives, pre-trained knowledge, and scalable, parallel sampling regimes.

Future Directions

Immediate research avenues include:

Improved Partitioning Algorithms: Though IMDM admits a zero-error construction in principle, optimizing practical training procedures to achieve this mapping in complex, high-dimensional sequence data remains challenging.
Broader Generative Domains: Extension to multimodal, non-textual, or highly structured output spaces where mask expressivity and stochasticity may be even more critical.
Integration with Advanced Distillation Schemes: Coupling IMDM with emerging teacher-student, energy-based, or optimal-transport distillation procedures to further minimize practical sample complexity and inference latency.
Compression and Memory: Exploring the trade-offs between noise dimensionality in mask embeddings and hardware/memory requirements, especially as models scale.

Conclusion

This work establishes IMDM as a well-founded, strictly more expressive alternative to traditional MDMs for few-step discrete diffusion, rigorously characterizing and removing a core source of generation error while maintaining the practical advantages of the standard masked diffusion formulation. The theoretical analysis and exhaustive empirical validation position IMDM as a superior choice for scalable, high-performance generative modeling in sequence domains characterized by strong token dependencies (2605.10518).

Markdown Report Issue