Dice Question Streamline Icon: https://streamlinehq.com

Homogenization of distant [MASK] prediction distributions

Determine whether, in mask diffusion language models, the model’s predicted marginal distributions over the token vocabulary at [MASK] positions become almost identical at sufficiently large distances when the sequence of [MASK] tokens is infinite, and whether, for a fixed sequence length, such near-identical distributions appear in the middle positions of the sequence.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper analyzes why mask diffusion LLMs output marginal (rather than joint) distributions for [MASK] positions and shows that these marginals tend to become smoother as the distance from observed (unmasked) tokens increases. Empirically, the authors observe that distant [MASK] positions often exhibit homogenized predictions dominated by high-frequency function words or end-of-text tokens.

Motivated by these observations, the authors explicitly state a conjecture that, at sufficiently large distances (with infinitely many [MASK] tokens), the predictive distributions become almost identical, and for fixed-length sequences, similar near-identical behavior emerges in the middle of the sequence. This conjecture formalizes a hypothesized limiting behavior of mask diffusion that would further constrain effective parallel generation and the use of bidirectional context.

References

Conjecture. At sufficiently large distances with an infinite length of [MASK] tokens, the distributions become almost identical. With a fixed given length, this near-identical behavior appears in the middle parts of the sequence.

Why mask diffusion does not work (2510.03289 - Sun et al., 29 Sep 2025) in Conjecture, Homogenization of Distant Mask Predictions (within Section 3.2.3: Marginal Distributions as a Function of Distance, following the table)