Papers
Topics
Authors
Recent
2000 character limit reached

SEDD-Absorb: Discrete Absorbing Diffusion

Updated 11 October 2025
  • SEDD-Absorb is a discrete diffusion framework that uses an absorbing rate matrix to force tokens into a mask state, ensuring rigorous convergence in high-dimensional discrete spaces.
  • It employs continuous-time Markov chain dynamics with surrogate initialization to handle KL divergence challenges and supports efficient reverse sampling via τ-leaping and uniformization.
  • The approach removes early-stopping constraints by maintaining score upper bounds, achieving provable recovery guarantees and enhanced generation quality for discrete data.

SEDD-Absorb (Score Entropy Discrete Diffusion with Absorbing Rate Matrices) denotes a class of discrete diffusion models in which the corruption ("noising") process is governed by an absorbing rate matrix, typically forcing each token in a sequence to eventually transition to a special “mask” state. This approach, now central in several state-of-the-art discrete generative frameworks, provides both principled learning objectives—via score entropy loss—and rigorous convergence guarantees in high-dimensional discrete-state spaces relevant for tasks in language modeling and other discrete data domains.

1. Absorbing Rate Matrix and Forward Dynamics

The defining element in SEDD-Absorb is the absorbing rate matrix QtokQ^{\text{tok}}, used in the forward (corruption) process to drive a continuous-time Markov chain (CTMC) over a discrete state space. For a vocabulary of SS symbols with a designated absorbing state mask\mathtt{mask}, the rate matrix for each token position takes the form: Qtok=1SemaskTISQ^{\text{tok}} = \mathbb{1}_S e_{\mathtt{mask}}^T - I_S where 1S\mathbb{1}_S is the column vector of all ones, emaske_{\mathtt{mask}} is a vector with 1 in the mask coordinate, and ISI_S is the identity matrix.

The resulting transition mechanism:

  • Leaves each non-mask token unchanged with probability ete^{-t} at time tt, and with probability 1et1-e^{-t}, transitions it to the mask state.
  • The mask state is absorbing: once entered, the token remains masked for all subsequent steps.

This contrasts sharply with uniform or symmetric noising processes that randomly replace tokens with arbitrary alternatives. The forward process for all tokens converges (as tt\to\infty) to the unique all-mask configuration—an asymptotic singleton distribution.

2. Addressing KL Divergence and Initialization

A major technical challenge stems from the stationary (all-mask) distribution: it is singular, so KL divergence computations between the process endpoint and the data distribution are formally ill-defined (as they may involve log0\log 0). To resolve this, SEDD-Absorb analysis introduces a surrogate initialization distribution: pinit=[(1ϵT)δmask+ϵTS1jmaskδj]dp_{\text{init}} = \left[(1-\epsilon_T)\, \delta_{\mathtt{mask}} + \frac{\epsilon_T}{S-1} \sum_{j\neq \mathtt{mask}} \delta_j\right]^{\otimes d} with small ϵTeT\epsilon_T \approx e^{-T}. As TT increases, the KL divergence between the process at time TT and pinitp_{\text{init}} is controlled, yielding the bound: KL(qTpinit)deT\text{KL}(q_T \,\|\, p_{\text{init}}) \lesssim d\, e^{-T} This validates the practical approach of initializing the denoising process slightly away from the true (singular) stationary point for analysis and sampling.

3. Reverse Process, Sampling, and Convergence Guarantees

The reverse denoising process follows the time-reversed CTMC dynamics, with the transition rates adjusted according to Girsanov’s theorem. Two main sampling strategies are analyzed:

  • τ\tau-leaping: The interval [0,Tδ][0, T-\delta] is discretized into NN steps; for each, simultaneous stochastic transitions (possibly multiple token updates per step) are effected using Poisson increments.
  • Uniformization: The process is “uniformized,” enabling exact simulation of the inhomogeneous CTMC by randomizing both transition timing and target states.

Analytical results show that SEDD-Absorb achieves improved convergence rates relative to uniform/symmetric noising:

  • For τ\tau-leaping, achieving KL error ϵ\epsilon requires O(d/ϵ)O(d/\epsilon) steps, significantly better than uniform noising, which typically incurs extra logarithmic factors in dd.
  • For uniformization, the expected number of transitions scales as O(d[loglog(d/ϵ)+logδ1])O(d\, [\log\log(d/\epsilon) + \log \delta^{-1}]).

Error decomposition for the overall convergence guarantee comprises:

  1. Initialization error (exponentially small in TT),
  2. Score estimation error (dependent on the quality of the learned/approximated score function via the score entropy objective), and
  3. Discretization/sampler error (controlled via step size and Lipschitz continuity of the score).

Overall, the total KL divergence between generated and true data can be bounded as: KL(qδ    pTδ)deT+ϵscore+d(T+log(Mδ1))(T+logδ1)2N\text{KL}(q_\delta \;\|\; p_{T-\delta}) \lesssim d\, e^{-T} + \epsilon_{\text{score}} + \frac{d (T + \log(M \delta^{-1}))(T + \log \delta^{-1})^2}{N}

4. Removal of Early-Stopping via Score Upper Bounds

A notable distinction of SEDD-Absorb (relative to uniform discrete diffusion) is the capacity to remove the early-stopping requirement. In traditional approaches, one must halt the denoising process before t=0t=0 to avoid singularities (score functions diverging as t0t\to 0). SEDD-Absorb establishes that, under the assumption the mask token is present with a minimal relative probability in the data (Assumption “maskq0”), a time-uniform upper bound on the discrete score can be maintained even as t0t\to 0: st(y,x)1ts_t(y, x) \leq \frac{1}{t} whenever xj=maskx^j = \mathtt{mask} and yjmasky^j \neq \mathtt{mask}. This technical result enables full-length denoising, with strong total variation and KL error control to the actual data distribution, positioning SEDD-Absorb as the first discrete diffusion framework with such provable recovery guarantees.

5. Technical Tools and Mathematical Framework

SEDD-Absorb analysis develops several new mathematical tools:

  • A Jensen-type entropy argument to bound convergence of the forward absorbing process.
  • Precise upper and lower bounds on the score function st(y,x)s_t(y,x) where Q(y,x)>0Q(y,x) > 0.
  • An integral decomposition for KL error leveraging a change-of-measure approach (akin to the Girsanov theorem) to sum the initialization, score estimation, and discretization components.
  • Use of Taylor expansions for terms like (1x)log(1x)x(1-x)\log(1-x)\approx -x in the regime of small xx to capture sharp rates of exponential decay due to absorption.

The following table summarizes some key error bounds established:

Sampler Error Bound Dependence Early-Stopping Removable
τ\tau-leaping O(dϵ)O\left(\frac{d}{\epsilon}\right) steps Yes, under maskq0
Uniformization O(dloglogdϵ)O\left(d\,\log\log\frac{d}{\epsilon}\right) Yes, under maskq0

6. Implications and Applications

The SEDD-Absorb paradigm advances both theoretical and practical aspects of discrete generative modeling:

  • Improved High-Dimensional Convergence: The linear-in-dimension dependence and reduced sampling overhead position absorbing diffusion as a scalable, provably convergent approach for large-scale discrete data (e.g., sequences, images, music, graphs).
  • Enhanced Generation Quality: Empirical results and the derived bounds indicate that absorbing processes lead to improved sample quality relative to uniform noising, as observed in metrics such as likelihood bounds and generation fidelity.
  • Theoretical Justification: The analysis supports the choice of the absorbing process both for practical model design and for advancing the mathematical understanding of discrete diffusion dynamics.

Direct applications include natural language generation, LLM pre-training, and any task where the data lies in a finite discrete space and benefits from iterative denoising.

7. Summary and Outlook

SEDD-Absorb models, defined by absorbing-continuous time Markov chain corruption and learned score-based denoising, represent a principled, theoretically grounded approach to discrete generative modeling with sharply improved convergence guarantees. The use of absorbing rate matrices enables rigorous error control in high dimensions, supports efficient sampling strategies (e.g., τ\tau-leaping, uniformization), and removes standard limitations associated with early-stopping in the reverse process. The technical innovations described lay a foundation for further advances in structure-aware, theoretically-guaranteed learning over discrete spaces, and suggest future work in hybrid schemes, iterative correction strategies, and sampling algorithms optimized for absorbing diffusion dynamics (Liang et al., 2 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SEDD-Absorb.