Discrete Denoising Diffusion Models (D3PMs)

Updated 3 January 2026

Discrete Denoising Diffusion Probabilistic Models (D3PMs) are generative models for discrete data that use iterative Markovian corruption and denoising via structured transition matrices.
They achieve state-of-the-art performance in domains such as text, images, and symbolic music by employing varied corruption strategies like uniform, absorbing, Gaussian, and nearest neighbor kernels.
D3PMs enable flexible conditional generation and accelerated sampling through techniques like auxiliary cross-entropy loss and classifier guidance for improved reconstruction.

Discrete Denoising Diffusion Probabilistic Models (D3PMs) are a class of generative models for discrete data that extend the Denoising Diffusion Probabilistic Model (DDPM) framework to categorical spaces, including text, images, and symbolic music. D3PMs formalize the forward and reverse stochastic processes using carefully designed transition matrices, enabling flexible Markovian corruption and iterative denoising over discrete sequences. Design choices in the corruption kernel, model parameterization, and training objective produce notable improvements over previous discrete and continuous approaches, with state-of-the-art results in multiple domains (Austin et al., 2021, Plasser et al., 2023).

1. Forward Diffusion Process

The forward process in D3PMs is a discrete-time Markov chain acting on sequences of categorical variables. Each variable $x_0^{(i)}$ takes values in $\{1, \ldots, K\}$ , often represented as one-hot vectors $x_t \in \{e_1, \ldots, e_K\}$ . The corruption process is parameterized by a sequence of $K \times K$ transition matrices $Q_t$ :

$q(x_t \mid x_{t-1}) = \mathrm{Cat}(x_t; p = x_{t-1} Q_t)$

Several structured choices for $Q_t$ enable nontrivial forms of corruption:

Multinomial/Uniform (D3PM-uniform): All categories are equally probable under corruption, stationary distribution is uniform.
Absorbing-state (D3PM-absorbing): Designates one special “mask” category $m$ , to which all variables eventually transition. The transition matrix is:

$[Q_t]_{ij} = \begin{cases} 1 & i = j = m \ 1-\beta_t & i = j \neq m \ \beta_t & i \neq m, j = m \ 0 & \text{otherwise} \end{cases}$

Discretized Gaussian (D3PM-Gauss): Blurs ordinal categories, with probability decaying with squared difference in values.
Nearest Neighbor/Embedding-based (D3PM-NN): Corruption is restricted to semantically similar tokens via a $k$ -NN adjacency matrix.

The closed-form $t$ -step marginal is

$q(x_t\mid x_0) = \mathrm{Cat}(x_t; p = x_0\,\overline{Q}_t), \quad \overline{Q}_t = Q_1 Q_2 \cdots Q_t$

where $\overline{Q}_t$ encodes the cumulative effect of the transition kernels.

In continuous time, the forward process can be generalized using a time-inhomogeneous Continuous-Time Markov Chain (CTMC) with generator $R_t$ , satisfying Kolmogorov's forward equation (Campbell et al., 2022):

$\frac{d}{dt} q_t(x) = \sum_{y} q_t(y) R_t(y,x)$

Commuting generators ( $R_t = \beta(t) R_b$ ) further enable closed-form marginals via matrix exponentiation.

2. Reverse Denoising and Training Objective

The reverse process aims to reconstruct $x_0$ by inverting the forward corruption dynamics. The discriminative model $p_\theta(x_{t-1}\mid x_t)$ is constructed via $x_0$ -parameterization:

$p_\theta(x_{t-1}\mid x_t) \propto \sum_{\widetilde{x}_0} q(x_{t-1}, x_t \mid \widetilde{x}_0) \tilde{p}_\theta(\widetilde{x}_0 \mid x_t)$

This parameterization mixes the model's prediction for $x_0$ with the analytic forward transition probabilities, ensuring correct sparsity and denoising capacity.

Training minimizes a variational lower bound (ELBO):

$L_{\mathrm{vb}} = \mathbb{E}_{q(x_0)} \left[ D_{KL}(q(x_T \mid x_0) \Vert p(x_T)) +\sum_{t=2}^T \mathbb{E}_{q(x_t\mid x_0)} D_{KL}(q(x_{t-1}\mid x_t, x_0)\Vert p_\theta(x_{t-1}\mid x_t)) -\mathbb{E}_{q(x_1\mid x_0)} \log p_\theta(x_0\mid x_1) \right]$

An auxiliary cross-entropy loss on $x_0$ improves stability:

$L_\lambda = L_{\mathrm{vb}} + \lambda \, \mathbb{E}_{q(x_0)} \mathbb{E}_{q(x_t|x_0)} [-\log \tilde{p}_\theta(x_0 | x_t)]$

For absorbing-state D3PMs (ASD3PMs), the objective further simplifies to a weighted cross-entropy over masked positions with time-dependent weights (Plasser et al., 2023). The continuous-time counterpart (CT-ELBO) upper-bounds the negative log-likelihood via expectations over CTMC jumps and reverse rates (Campbell et al., 2022).

3. Architecture and Implementation

D3PM architectures are tailored to the sequence modality and corruption structure. For symbolic music (Plasser et al., 2023):

Input: Sequence of length $L$ (e.g., $L=1024$ time-steps).
Local compression: Embedding each token into $\mathbb{R}^{128}$ , followed by a 1D convolution (kernel $4$, stride $4$) to summarize into “quarter-note” segments ( $\mathbb{R}^{512}$ ).
Global modeling: Deep stack of 24 Transformer blocks (self-attention, layer norm, FFN), hidden size $512$, $8$ heads.
Decoding: Shared transposed convolution to upsample sequence, linear prediction “head” per symbol.
Multi-track support: Independent local summarization for each track, shared global Transformer.

For images and text, architectures consist of U-Nets or Transformer decoders, with time-step and positional embeddings to represent discrete states and diffusion steps (Austin et al., 2021).

4. Inference and Sampling Strategies

Sampling from D3PMs follows the reverse Markov process, typically initializing $x_T$ to absorbing or uniform states and iteratively refining. For ASD3PMs (Plasser et al., 2023):

Initialization: $x_T = [\text{mask}, ..., \text{mask}]$ .
Iterative sampling: At each step $t = T, ..., 1$ $t = T, ..., 1$ :
1. Compute logits $\ell = f_\theta(x_t)$ , probabilities $\hat{y} = \mathrm{softmax}(\ell)$ .
2. Sample $x_0^{\mathrm{cand}} \sim \mathrm{Cat}(\hat{y})$ .
3. With probability $1/t$, replace mask in $x_t$ with $x_0^{\mathrm{cand}}$ ; otherwise, retain mask.

Variable step-count and accelerated sampling (e.g., skipping $t$ with adjusted probabilities) reduce inference time (Plasser et al., 2023). For continuous-time models, high-performance CTMC sampling methods are employed:

Uniformization/Jensen’s method: Constant-rate Poisson clock and thinning.
Tau-leaping: Simultaneous state updates in time chunks ( $\tau$ ), with Poisson-distributed jumps $N_{xy} \sim \mathrm{Pois}(\tau\, \hat{R}_t^\theta(x, y))$ .
Predictor-corrector: Alternating between generative and mixing steps for improved sample quality (Campbell et al., 2022).

5. Extensions: Classifier Guidance and Conditional Generation

D3PMs support flexible conditioning via post-hoc classifier guidance. A classifier $\ell_\phi(x_0)$ (e.g., for note-density in music) influences the sampling trajectory:

At each step $t$ , the model's predicted $p_\theta(x_0|x_t)$ is adjusted by the gradient of the classifier loss:

$\hat{y}_{\text{guided}} \propto \hat{y} \cdot \exp(-s \nabla_{\hat{y}} L)$

where $s > 0$ is a guidance scale. This method enables targeted generation without needing to retrain the diffusion model with new conditions (Plasser et al., 2023).

Note-level infilling leverages the absorbing-state framework: arbitrary positions in $x_0$ are pre-masked, and the reverse process fills “holes” with musically plausible content. This generalizes masked language modeling to arbitrary discrete structured data.

6. Evaluation Metrics and Adversarial Critique

Standard quantitative metrics, such as bits/dim, negative log-likelihood, Inception Score (IS), Fréchet Inception Distance (FID), and domain-specific measures (e.g., frame-wise self-similarity in music) evaluate sample quality:

Model	IS	FID	NLL (bits/dim)
D3PM-absorbing	6.78	30.97	≤4.40
D3PM-Gauss	8.56	7.34	≤3.44
Continuous DDPM	9.46	3.17	≤3.75

Empirically, D3PMs with structured kernels and hybrid objectives are competitive or superior to continuous DDPMs in both likelihood and discriminative metrics (Austin et al., 2021).

Frame-wise self-similarity metrics for symbolic music partition pieces into overlapping windows, compute Gaussian approximations for pitch and duration, and aggregate window overlap areas for consistency/variance scores (Plasser et al., 2023). However, these metrics can be confounded: simulated annealing applied to arbitrary binary images produces piano-rolls matching self-similarity metrics but lacking genuine musicality (the “Anscombe” confounder). This critique demonstrates the need for robust evaluation measures sensitive to true semantic fidelity.

7. Connections, Comparisons, and Theoretical Guarantees

D3PMs establish principled links with autoregressive models and masked LLMs. Absorbing-state D3PMs with appropriate schedules reproduce the CMLM and BERT objectives, while deterministic absorbing diffusion corresponds to autoregressive cross-entropies (Austin et al., 2021). Discretized Gaussian kernels approximate the locality-induced bias of continuous DDPMs on ordinal data.

Continuous-time D3PMs provide tighter theoretical control over approximation error. Under bounded reverse-rate and mixing assumptions, the total variation between the sampled and data distribution is quantitatively bounded by step size, rate error, and mixing time (Campbell et al., 2022):

$\|\text{Law}(y_0) - p_{\mathrm{data}}\|_{\mathrm{TV}} \leq 3MT + \{(|R|S D C_1)^2 + \frac{1}{2} C_2 (M + C_1 S D |R|)\} \tau T + 2 \exp\left(-\frac{T \log^2 2}{t_{\mathrm{mix}} \log(4D)}\right)$

A plausible implication is that systematic selection of $Q_t$ or $R_t$ and efficient parameterization of $p_\theta(x_0|x_t)$ are critical for both sampling fidelity and computational tractability in high-dimensional categorical spaces.

8. Domain-Specific Implementations and Impact

D3PMs have demonstrated state-of-the-art results in symbolic music generation, images, and text (Plasser et al., 2023, Austin et al., 2021, Campbell et al., 2022). Flexible infilling, accelerated sampling, and classifier guidance broaden the applicability across generative tasks. Crucially, metric confounding highlights the imperative for evaluating fidelity beyond statistical resemblance.

Design choices—such as structured corruption kernels (absorbing, Gaussian, nearest neighbor), auxiliary hybrid losses, and model architecture—directly influence D3PMs' ability to match or exceed continuous DDPMs in both sample quality and likelihood. This suggests ongoing research will further refine D3PM methodology for large-scale discrete domains and deepen their connections to established sequence modeling paradigms.

PDF Markdown Chat (Pro)

References (3)

Structured Denoising Diffusion Models in Discrete State-Spaces (2021)

Discrete Diffusion Probabilistic Models for Symbolic Music Generation (2023)

A Continuous Time Framework for Discrete Denoising Models (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Discrete Denoising Diffusion Probabilistic Models (D3PMs).

Discrete Denoising Diffusion Models (D3PMs)

1. Forward Diffusion Process

2. Reverse Denoising and Training Objective

3. Architecture and Implementation

4. Inference and Sampling Strategies

5. Extensions: Classifier Guidance and Conditional Generation

6. Evaluation Metrics and Adversarial Critique

7. Connections, Comparisons, and Theoretical Guarantees

8. Domain-Specific Implementations and Impact

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Discrete Denoising Diffusion Models (D3PMs)

1. Forward Diffusion Process

2. Reverse Denoising and Training Objective

3. Architecture and Implementation

4. Inference and Sampling Strategies

5. Extensions: Classifier Guidance and Conditional Generation

6. Evaluation Metrics and Adversarial Critique

7. Connections, Comparisons, and Theoretical Guarantees

8. Domain-Specific Implementations and Impact

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research