Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Discrete Diffusion Models (DDMs)

Updated 23 June 2025

Discrete diffusion models (DDMs) are stochastic generative frameworks that extend the denoising diffusion probabilistic modeling paradigm to discrete data spaces, such as text, categorical images, or structured sequences. Grounded in Markovian or non-Markovian stochastic processes, DDMs systematically corrupt and then reconstruct discrete data via learned inference procedures. Their development has yielded new theoretical bridges between denoising, masking, autoregressive, and score-based generative models for categorical domains. The growing methodological sophistication in DDMs—ranging from structured transition matrix design to advanced sampling, guidance, and convergence analysis—has enabled high-fidelity, flexible, and controllable generation for applications spanning natural language, vision, biology, and materials science.

1. Transition Matrix Design and Theoretical Foundations

The forward process of a discrete diffusion model is governed by a sequence of transition matrices QtQ_t (of shape K×KK \times K for KK-ary variables), where each [Qt]ij[Q_t]_{ij} describes the probability of transitioning from state ii to state jj at step tt: q(xtxt1)=Cat(xt; p=xt1Qt)q(x_t \mid x_{t-1}) = \mathrm{Cat}(x_t;~p = x_{t-1} Q_t) Key transition matrix types include:

  • Uniform: Applies symmetric corruption to all states; stationary distribution is uniform.
  • Absorbing ([MASK]): States have probability βt\beta_t of being replaced by a special absorbing token (e.g., [MASK]), unifying with the masked LLMing framework as in BERT and the objective of CMLMs. Stationary distribution is the absorbing state.
  • Discretized Gaussian: Favors transitions to "nearby" states (by ordinal distance); used for ordered data such as pixels.
  • Nearest-Neighbor/Embedding-based: Leverages semantic similarity in embeddings; transition probabilities reflect functional or semantic proximity.

These design choices directly affect noise accumulation, mixing behavior, and the inference challenge for the learned reverse process. Theoretical connections are drawn between DDMs and variational inference, Markov processes, and continuous SDE-based diffusion models.

2. Connections to Related Probabilistic Models

Absorbing-state DDMs provide a framework that encompasses and unifies:

  • Autoregressive Models: Deterministic masking schedules in the forward process, where tokens are masked and unmasked one at a time, reproduce the autoregressive training loss and sampling scheme.
  • Masked LLMs (MLMs): One-step absorbing diffusion with a fraction of tokens masked—negative ELBO is equivalent to the BERT objective.
  • Conditional Masked LLMs (CMLMs): Scheduling of absorbing transitions interpolates between ARLMs and MLMs.
  • Continuous Diffusion Models: When the corruption is discretized Gaussian and the state space is ordinal, the forward process and noise properties approach those of continuous diffusion as KK\to\infty.

The ability to interpolate between, or exactly recover, these established generative paradigms within the DDM formalism underlines the expressive and practical power of diffusion approaches for discrete data.

3. Variational and Auxiliary Training Objectives

The primary training objective in DDMs is a variational lower bound (negative ELBO) on the likelihood: Lvb=Eq(x0)[DKL[q(xTx0)p(xT)]+t=2TEq(xtx0)[DKL[q(xt1xt,x0)pθ(xt1xt)]]Eq(x1x0)[logpθ(x0x1)]]L_{\mathrm{vb}} = \mathbb{E}_{q(x_0)} \left[ D_{\mathrm{KL}}[q(x_T \mid x_0) \| p(x_T)] + \sum_{t=2}^T \mathbb{E}_{q(x_t|x_0)} [D_{\mathrm{KL}}[q(x_{t-1}|x_t, x_0)\| p_\theta(x_{t-1}|x_t)]] - \mathbb{E}_{q(x_1|x_0)} [\log p_\theta(x_0|x_1)] \right] An auxiliary cross-entropy loss is found to improve sample quality and training stability: Lλ=Lvb+λ Eq(x0)Eq(xtx0)[logp~θ(x0xt)]L_{\lambda} = L_{\mathrm{vb}} + \lambda~\mathbb{E}_{q(x_0)}\mathbb{E}_{q(x_t|x_0)}[-\log \widetilde{p}_\theta(x_0|x_t)] where p~θ(x0xt)\widetilde{p}_\theta(x_0|x_t) is the predicted distribution over clean data given noise. This loss ensures robust denoising at all time steps, assists with gradient propagation, and mitigates imbalance that can arise in per-timestep weighting of the ELBO.

4. Performance, Benchmarks, and Empirical Results

Structured DDMs demonstrate strong sample quality and likelihood competitiveness across diverse discrete domains:

  • Text8 (character-level): D3PM with absorbing transitions achieves bits-per-character NLL of 1.45 (at 1000 steps), outperforming uniform and embedding-based diffusions and approaching Transformer XL (1.08).
  • LM1B (subword, large vocab): D3PM with absorbing transitions attains perplexity of 76.9 (1000 steps), notably improving over non-absorbing schedules.
  • CIFAR-10 (image, 8-bit): Gaussian D3PMs with auxiliary loss reach FID 7.34 and NLL 3.43 (comparable or superior to continuous DDPMs).

Inference times are competitive, and D3PMs remain tractable even with step counts reduced tenfold (e.g., 20 steps), providing a smooth speed-quality trade-off.

Dataset Model (Char-Level) Steps NLL / PPL FID / IS
text8 D3PM absorbing 1000 1.45
LM1B D3PM absorbing 1000 76.9
CIFAR-10 D3PM Gaussian+logistic 1000 3.43 7.34/8.6

Structured transitions (absorbing, embedding-based) yield statistically significant improvements in both text and image generation tasks.

5. Mathematical and Algorithmic Formalism

The discrete diffusion model is formalized as follows:

  • Forward process:

q(x1:Tx0)=t=1Tq(xtxt1),q(xtxt1)=Cat(xt;p=xt1Qt)q(x_{1:T} | x_0) = \prod_{t=1}^T q(x_t | x_{t-1}),\quad q(x_t | x_{t-1}) = \mathrm{Cat}(x_t; p = x_{t-1} Q_t)

  • Conditional distribution across tt steps:

q(xtx0)=Cat(xt;p=x0Qt),Qt=Q1Q2Qtq(x_t | x_0) = \mathrm{Cat}(x_t; p = x_0 \overline{Q}_t),\quad \overline{Q}_t = Q_1 Q_2 \ldots Q_t

  • Posterior (reverse):

q(xt1xt,x0)=Cat(xt1;p=xtQtTx0Qt1x0Qtxt)q(x_{t-1}|x_t, x_0) = \mathrm{Cat}\left( x_{t-1}; p = \frac{x_t Q_t^T \odot x_0 \overline{Q}_{t-1}}{x_0 \overline{Q}_t x_t^\top} \right)

The reverse process is parameterized as an x0x_0-predictor: pθ(xt1xt)x~0q(xt1,xtx~0)p~θ(x~0xt)p_{\theta}(x_{t-1}|x_t) \propto \sum_{\widetilde{x}_0} q(x_{t-1}, x_t | \widetilde{x}_0)\cdot \widetilde{p}_\theta(\widetilde{x}_0|x_t)

Transition and stationary behavior are tightly controlled by QtQ_t choices.

6. Inductive Bias and Structured Transitions

By selecting QtQ_t that encode domain knowledge (e.g., Gaussian for ordinal variables, nearest-neighbor or embedding-based for semantic similarity, absorbing for masking-based tasks), DDMs can exploit inductive biases:

  • Image Modeling: Local transitions (Gaussian) enforce continuity and reflect the true structure of pixel spaces.
  • LLMing: Absorbing states mimic the partial observation mechanism of MLMs, and embedding-based transitions leverage lexical closeness.

In all cases, the careful design of QtQ_t is observed to significantly increase generative performance.

7. Impact, Extensions, and Research Directions

The introduction of D3PMs established the foundation for a growing landscape of discrete diffusion research, with subsequent advances exploring:

  • Efficient sampling and acceleration, including non-Markovian, continuous-time, and hybrid designs;
  • Score-based estimation and generalizations to categorical variables;
  • Error analysis, contractivity, and convergence rate theory;
  • Scaling to high-dimensional, large-vocabulary, and multimodal applications;
  • Integration with preference alignment, guidance, and explicit structured control.

This development has catalyzed new paradigms for flexible, interpretable, and domain-adaptive generation across fields where discrete data is fundamental.


Key Aspect D3PM Choices/Characteristics
Transition Matrix Uniform, Absorbing, Discretized Gaussian, NN
Stationary Distribution Uniform, Point Mass (Absorbing)
Inductive Bias Embedding similarity, ordinal smoothing
Objective Variational bound + auxiliary cross-entropy
Text Results Absorbing D3PM exceeds non-AR and matches AR
Image Results Gaussian D3PM matches/surpasses continuous DDPM

Discrete denoising diffusion models now constitute a standard, extensible methodology for structured discrete data generation, supporting a broadening suite of architectures, training objectives, and practical applications.