Absorbing State Discrete Diffusion Models

Updated 7 November 2025

Absorbing state discrete diffusion models are generative models that use a forward Markov process to stochastically mask data with an irreversible [MASK] state.
They enable efficient, parallel generation through a reverse denoising procedure that progressively reconstructs original data from masked positions.
Their design supports controlled generation in language, vision, and multimodal tasks while addressing challenges like non-revisability and sparse gradient updates.

Absorbing state discrete diffusion models are a class of generative models for discrete data (such as language, vision, and multimodal signals) that leverage a forward Markov process designed to stochastically corrupt data by replacing variables with a distinguished absorbing state, typically referred to as a “[MASK]” token or equivalent. Once a variable is absorbed, it remains in the absorbing state for all subsequent forward steps. The generative (reverse) process attempts to reconstruct the original data in a denoising fashion, allowing for parallel, controlled, and highly efficient generation. This approach has become central to large-scale discrete diffusion modeling in language, vision-language, and related domains.

1. Mathematical Foundations and Absorbing State Construction

Absorbing state discrete diffusion models are defined by a forward (noising) process and a reverse (denoising) process, both typically structured as Markov chains on the discrete state space.

Forward (Corruption) Process.

Given an initial data sequence $x_0$ (tokens, pixels), the forward process applies a sequence of categorical transition matrices $Q_t$ that iteratively replace variables with the absorbing state: $q(x_{1:T} \mid x_0) = \prod_{t=1}^T q(x_t \mid x_{t-1}),$ where for each variable,

$q(x_t \mid x_0) = \mathrm{Cat}(x_t; \alpha_t x_0 + (1-\alpha_t) m),$

$m$ is the one-hot vector for the absorbing state ([MASK]). The schedule $\alpha_t$ is typically decreasing and defined by the cumulative product of masking probabilities.

The corresponding absorbing transition matrix for a $K+1$ -state (mask-augmented) variable is: $(Q_t^{\mathrm{absorb}})_{ij} = \begin{cases} 1-\beta_t & \text{if } i = j \neq m \ \beta_t & \text{if } i \neq m,\, j = m\ 1 & \text{if } i = j = m \ 0 & \text{otherwise} \end{cases}$

Reverse (Denoising) Process.

Parameterizing $p_\theta(x_{0:T})$ , generative modeling inverts the forward process. The denoising process, typically parameterized by a neural network, reconstructs unabsorbed symbols at masked positions: $p_\theta(x_{0:T}) = p(x_T)\prod_{t=1}^T p_\theta(x_{t-1} \mid x_t).$ The posterior for masked diffusion is

$q(x_s \mid x_t, x_0) = \begin{cases} \text{Cat}(x_s; x_t), & x_t \neq m \ \text{Cat}\left(x_s; \frac{(1-\alpha_s)m + (\alpha_s - \alpha_t)x_0}{1-\alpha_t}\right), & x_t = m \end{cases}$

The absorbing state is critical: once a variable is masked, it remains so, providing computational and theoretical tractability in both process definition and reverse modeling.

2. Training Regimes and Loss Formulation

Training absorbing state discrete diffusion models typically involves minimizing a variational lower bound, which simplifies under this setup to a weighted cross-entropy over masked positions: $\mathcal{L} = \sum_{t=2}^T \mathbb{E}_{x_0, x_{1:T}}\left[ -\frac{\alpha_{t-1}-\alpha_t}{1-\alpha_t} \sum_{n=1}^N \delta_m(x_{t,n})\, x_{0,n}\log[f_\theta(x_t)]_n \right],$ with $\delta_m(x_{t,n}) = 1$ if the $n$ -th token is [MASK], 0 otherwise.

This loss unifies discrete diffusion training with masked language modeling (MLM): only masked positions at each sampled time step contribute to the objective, and the masking schedule $(\beta_t)$ controls corruption severity. For state-dependent schedules, the objective extends naturally allowing different rates per symbol.

Because of deterministic absorption, supervision is restricted to masked positions, resulting in sparse gradient updates, yet this allows efficient parallelization and large-batch training.

3. Inference, Parallel Generation, and Decoding Control

Denoising as Iterative Unmasking.

During inference, tokens at masked positions are iteratively predicted and, once restored, are fixed—reflecting the absorbing nature of the process. This enables high parallelism: all unfilled positions can be updated in a single denoising pass per step.

Controllability and Bidirectionality.

The process supports arbitrary mask patterns, so infilling, controlled generation lengths, and constraint-aware decoding are achieved naturally. For example, maintaining certain variables as masked during denoising allows infilling or conditional generation on user-specified subsets.

Tradeoff: Non-Revisability.

A limitation is non-revisability: once a variable is unmasked (restored by the model), it cannot be further revised by downstream steps. This is a direct consequence of the absorbing transition structure and ensures process irreversibility. Non-revisability can freeze early errors, limiting self-correction—a central point for research into hybrid and remasking strategies.

Parallel Decoding and Acceleration.

Because all masked variables are predicted in parallel from the available unmasked context at each step, inference is accelerated by up to $10\times$ over autoregressive models, which can only predict or infill tokens sequentially.

4. Model Variants, Extensions, and Generalizations

Absorbing state models have motivated numerous extensions:

Hybrid transitions: These linearly combine absorbing and uniform matrices; tokens can occasionally be re-absorbed, supporting self-correction at the expense of reversibility.
Remasking / Test-Time Refinement: Some inference procedures periodically re-mask already predicted tokens, letting the model refine outputs even after initial unmasking.
Continuous-Time and Concrete Score Frameworks: By moving to continuous time via CTMCs, the absorbing model unifies with score-based formulations and can be parameterized in terms of time-dependent or time-independent conditional probabilities.
Block Diffusion: Denoising operates on segments (blocks) of the sequence, supporting hierarchical or structured generation and fine-grained controllability.

These variants address specific limitations (notably non-revisability and sample diversity), while often building on the same absorbing state foundation for computational stability.

5. Theoretical Properties and Convergence

Theoretical analyses of absorbing state discrete diffusion models emphasize their unique Markov structure:

Convergence Rates: The forward process rapidly concentrates on the absorbing singleton, but special care is needed: the stationary distribution is a delta over [MASK], making the forward KL divergence ill-defined. Surrogate initialization distributions, as introduced in recent work, control forward process convergence and enable rigorous error analysis (Liang et al., 2 Jun 2025).
Sharper Bounds: Under absorbing matrices, KL error, sampling step complexity, and score function behavior can be bounded more tightly than in uniform-rate models, supporting improved theoretical and practical convergence rates.
No Early Stopping Required: Under mild regularity—namely, if the masked state occurs with minimum probability in the prior—absorbing models can be sampled to arbitrary accuracy without early stopping, unlike uniform-rate models.

Key finite-time error results show that, for absorbing rate matrices, sampling complexity for KL error $\epsilon$ scales nearly linearly in model dimension and logarithmically in the error threshold, outperforming uniform-rate diffusion models. Recent analyses employ Jensen-type inequalities and process-specific score bounds for tight control of error propagation (Liang et al., 2 Jun 2025).

6. Applications and Impact

Absorbing state discrete diffusion models form the backbone of contemporary large-scale non-autoregressive generation in language, vision, and multimodal domains:

Language Modeling: Multimodal and LLM-scale models employing absorbing state diffusion achieve competitive or superior accuracy to autoregressive analogues on benchmarks such as OpenWebText, WikiText, and billion-word corpora, with much faster inference (Shi et al., 6 Jun 2024, Yu et al., 16 Jun 2025).
Vision and Multimodal Generation: Vector-quantized image models and vision-language architectures utilize absorbing state masking for high-resolution, globally coherent image synthesis with parallel decoding (Bond-Taylor et al., 2021).
Fine-grained Control: The masking structure enables control over sample length, infilling, and local constraint satisfaction, which is difficult to achieve with AR architectures.

Limitations and Active Research Topics:

Lower corpus utilization and length bias: Only masked tokens contribute to gradients, which can lead to uneven training.
Non-revisability: Inflexibility in revising outputs is being addressed via hybrid or remasking strategies.
Training-inference divergence: The difference between teacher-forced training and self-sampled inference can introduce subtle distributional gaps.

7. Summary of Key Equations and Formal Properties

Process	Equation
Forward (corruption)	$q(x_t\|x_0) = \text{Cat}(x_t; \alpha_t x_0 + (1-\alpha_t)m)$
Absorbing Matrix	See explicit $Q_t^{\mathrm{absorb}}$ in Section 1
Reverse (posterior)	See case distinction in Section 1 above
Loss	$\mathcal{L} = \sum_{t=2}^T \mathbb{E}_{x_0,x_{1:T}}\big[ -\frac{\alpha_{t-1}-\alpha_t}{1-\alpha_t} \sum_{n=1}^N \delta_m(x_{t,n}) x_{0,n} \log [f_\theta(x_t)]_n \big]$

These equations underpin the efficient, scalable, and highly controllable character of absorbing state discrete diffusion models.

Absorbing state discrete diffusion models have established themselves as the standard for scalable, parallel, and controllable generative modeling of discrete data across language and multimodal domains, supported by both compelling empirical success and increasingly refined theoretical guarantees (Yu et al., 16 Jun 2025, Liang et al., 2 Jun 2025, Shi et al., 6 Jun 2024). While research continues to address their unique limitations, the clarity of their mathematical structure, efficiency of their training and inference, and flexibility for real-world applications ensure their pivotal role in the current landscape of discrete generative modeling.