Discrete Diffusion Models (DDMs)

Updated 23 June 2025

Discrete diffusion models (DDMs) are stochastic generative frameworks that extend the denoising diffusion probabilistic modeling paradigm to discrete data spaces, such as text, categorical images, or structured sequences. Grounded in Markovian or non-Markovian stochastic processes, DDMs systematically corrupt and then reconstruct discrete data via learned inference procedures. Their development has yielded new theoretical bridges between denoising, masking, autoregressive, and score-based generative models for categorical domains. The growing methodological sophistication in DDMs—ranging from structured transition matrix design to advanced sampling, guidance, and convergence analysis—has enabled high-fidelity, flexible, and controllable generation for applications spanning natural language, vision, biology, and materials science.

1. Transition Matrix Design and Theoretical Foundations

The forward process of a discrete diffusion model is governed by a sequence of transition matrices $Q_t$ (of shape $K \times K$ for $K$ -ary variables), where each $[Q_t]_{ij}$ describes the probability of transitioning from state $i$ to state $j$ at step $t$ : $q(x_t \mid x_{t-1}) = \mathrm{Cat}(x_t;~p = x_{t-1} Q_t)$ Key transition matrix types include:

Uniform: Applies symmetric corruption to all states; stationary distribution is uniform.
Absorbing ([MASK]): States have probability $\beta_t$ of being replaced by a special absorbing token (e.g., [MASK]), unifying with the masked LLMing framework as in BERT and the objective of CMLMs. Stationary distribution is the absorbing state.
Discretized Gaussian: Favors transitions to "nearby" states (by ordinal distance); used for ordered data such as pixels.
Nearest-Neighbor/Embedding-based: Leverages semantic similarity in embeddings; transition probabilities reflect functional or semantic proximity.

These design choices directly affect noise accumulation, mixing behavior, and the inference challenge for the learned reverse process. Theoretical connections are drawn between DDMs and variational inference, Markov processes, and continuous SDE-based diffusion models.

2. Connections to Related Probabilistic Models

Absorbing-state DDMs provide a framework that encompasses and unifies:

Autoregressive Models: Deterministic masking schedules in the forward process, where tokens are masked and unmasked one at a time, reproduce the autoregressive training loss and sampling scheme.
Masked LLMs (MLMs): One-step absorbing diffusion with a fraction of tokens masked—negative ELBO is equivalent to the BERT objective.
Conditional Masked LLMs (CMLMs): Scheduling of absorbing transitions interpolates between ARLMs and MLMs.
Continuous Diffusion Models: When the corruption is discretized Gaussian and the state space is ordinal, the forward process and noise properties approach those of continuous diffusion as $K\to\infty$ .

The ability to interpolate between, or exactly recover, these established generative paradigms within the DDM formalism underlines the expressive and practical power of diffusion approaches for discrete data.

3. Variational and Auxiliary Training Objectives

The primary training objective in DDMs is a variational lower bound (negative ELBO) on the likelihood: $L_{\mathrm{vb}} = \mathbb{E}_{q(x_0)} \left[ D_{\mathrm{KL}}[q(x_T \mid x_0) \| p(x_T)] + \sum_{t=2}^T \mathbb{E}_{q(x_t|x_0)} [D_{\mathrm{KL}}[q(x_{t-1}|x_t, x_0)\| p_\theta(x_{t-1}|x_t)]] - \mathbb{E}_{q(x_1|x_0)} [\log p_\theta(x_0|x_1)] \right]$ An auxiliary cross-entropy loss is found to improve sample quality and training stability: $L_{\lambda} = L_{\mathrm{vb}} + \lambda~\mathbb{E}_{q(x_0)}\mathbb{E}_{q(x_t|x_0)}[-\log \widetilde{p}_\theta(x_0|x_t)]$ where $\widetilde{p}_\theta(x_0|x_t)$ is the predicted distribution over clean data given noise. This loss ensures robust denoising at all time steps, assists with gradient propagation, and mitigates imbalance that can arise in per-timestep weighting of the ELBO.

4. Performance, Benchmarks, and Empirical Results

Structured DDMs demonstrate strong sample quality and likelihood competitiveness across diverse discrete domains:

Text8 (character-level): D3PM with absorbing transitions achieves bits-per-character NLL of 1.45 (at 1000 steps), outperforming uniform and embedding-based diffusions and approaching Transformer XL (1.08).
LM1B (subword, large vocab): D3PM with absorbing transitions attains perplexity of 76.9 (1000 steps), notably improving over non-absorbing schedules.
CIFAR-10 (image, 8-bit): Gaussian D3PMs with auxiliary loss reach FID 7.34 and NLL 3.43 (comparable or superior to continuous DDPMs).

Inference times are competitive, and D3PMs remain tractable even with step counts reduced tenfold (e.g., 20 steps), providing a smooth speed-quality trade-off.

Dataset	Model (Char-Level)	Steps	NLL / PPL	FID / IS
text8	D3PM absorbing	1000	1.45	—
LM1B	D3PM absorbing	1000	76.9	—
CIFAR-10	D3PM Gaussian+logistic	1000	3.43	7.34/8.6

Structured transitions (absorbing, embedding-based) yield statistically significant improvements in both text and image generation tasks.

5. Mathematical and Algorithmic Formalism

The discrete diffusion model is formalized as follows:

Forward process:

$q(x_{1:T} | x_0) = \prod_{t=1}^T q(x_t | x_{t-1}),\quad q(x_t | x_{t-1}) = \mathrm{Cat}(x_t; p = x_{t-1} Q_t)$

Conditional distribution across $t$ steps:

$q(x_t | x_0) = \mathrm{Cat}(x_t; p = x_0 \overline{Q}_t),\quad \overline{Q}_t = Q_1 Q_2 \ldots Q_t$

Posterior (reverse):

$q(x_{t-1}|x_t, x_0) = \mathrm{Cat}\left( x_{t-1}; p = \frac{x_t Q_t^T \odot x_0 \overline{Q}_{t-1}}{x_0 \overline{Q}_t x_t^\top} \right)$

The reverse process is parameterized as an $x_0$ -predictor: $p_{\theta}(x_{t-1}|x_t) \propto \sum_{\widetilde{x}_0} q(x_{t-1}, x_t | \widetilde{x}_0)\cdot \widetilde{p}_\theta(\widetilde{x}_0|x_t)$

Transition and stationary behavior are tightly controlled by $Q_t$ choices.

6. Inductive Bias and Structured Transitions

By selecting $Q_t$ that encode domain knowledge (e.g., Gaussian for ordinal variables, nearest-neighbor or embedding-based for semantic similarity, absorbing for masking-based tasks), DDMs can exploit inductive biases:

Image Modeling: Local transitions (Gaussian) enforce continuity and reflect the true structure of pixel spaces.
LLMing: Absorbing states mimic the partial observation mechanism of MLMs, and embedding-based transitions leverage lexical closeness.

In all cases, the careful design of $Q_t$ is observed to significantly increase generative performance.

7. Impact, Extensions, and Research Directions

The introduction of D3PMs established the foundation for a growing landscape of discrete diffusion research, with subsequent advances exploring:

Efficient sampling and acceleration, including non-Markovian, continuous-time, and hybrid designs;
Score-based estimation and generalizations to categorical variables;
Error analysis, contractivity, and convergence rate theory;
Scaling to high-dimensional, large-vocabulary, and multimodal applications;
Integration with preference alignment, guidance, and explicit structured control.

This development has catalyzed new paradigms for flexible, interpretable, and domain-adaptive generation across fields where discrete data is fundamental.

Key Aspect	D3PM Choices/Characteristics
Transition Matrix	Uniform, Absorbing, Discretized Gaussian, NN
Stationary Distribution	Uniform, Point Mass (Absorbing)
Inductive Bias	Embedding similarity, ordinal smoothing
Objective	Variational bound + auxiliary cross-entropy
Text Results	Absorbing D3PM exceeds non-AR and matches AR
Image Results	Gaussian D3PM matches/surpasses continuous DDPM

Discrete denoising diffusion models now constitute a standard, extensible methodology for structured discrete data generation, supporting a broadening suite of architectures, training objectives, and practical applications.

PDF Markdown Bookmark Chat (Pro)