Papers
Topics
Authors
Recent
2000 character limit reached

Discrete Denoising Diffusion Models (D3PMs)

Updated 3 January 2026
  • Discrete Denoising Diffusion Probabilistic Models (D3PMs) are generative models for discrete data that use iterative Markovian corruption and denoising via structured transition matrices.
  • They achieve state-of-the-art performance in domains such as text, images, and symbolic music by employing varied corruption strategies like uniform, absorbing, Gaussian, and nearest neighbor kernels.
  • D3PMs enable flexible conditional generation and accelerated sampling through techniques like auxiliary cross-entropy loss and classifier guidance for improved reconstruction.

Discrete Denoising Diffusion Probabilistic Models (D3PMs) are a class of generative models for discrete data that extend the Denoising Diffusion Probabilistic Model (DDPM) framework to categorical spaces, including text, images, and symbolic music. D3PMs formalize the forward and reverse stochastic processes using carefully designed transition matrices, enabling flexible Markovian corruption and iterative denoising over discrete sequences. Design choices in the corruption kernel, model parameterization, and training objective produce notable improvements over previous discrete and continuous approaches, with state-of-the-art results in multiple domains (Austin et al., 2021, Plasser et al., 2023).

1. Forward Diffusion Process

The forward process in D3PMs is a discrete-time Markov chain acting on sequences of categorical variables. Each variable x0(i)x_0^{(i)} takes values in {1,,K}\{1, \ldots, K\}, often represented as one-hot vectors xt{e1,,eK}x_t \in \{e_1, \ldots, e_K\}. The corruption process is parameterized by a sequence of K×KK \times K transition matrices QtQ_t:

q(xtxt1)=Cat(xt;p=xt1Qt)q(x_t \mid x_{t-1}) = \mathrm{Cat}(x_t; p = x_{t-1} Q_t)

Several structured choices for QtQ_t enable nontrivial forms of corruption:

  • Multinomial/Uniform (D3PM-uniform): All categories are equally probable under corruption, stationary distribution is uniform.
  • Absorbing-state (D3PM-absorbing): Designates one special “mask” category mm, to which all variables eventually transition. The transition matrix is:

[Qt]ij={1i=j=m 1βti=jm βtim,j=m 0otherwise[Q_t]_{ij} = \begin{cases} 1 & i = j = m \ 1-\beta_t & i = j \neq m \ \beta_t & i \neq m, j = m \ 0 & \text{otherwise} \end{cases}

  • Discretized Gaussian (D3PM-Gauss): Blurs ordinal categories, with probability decaying with squared difference in values.
  • Nearest Neighbor/Embedding-based (D3PM-NN): Corruption is restricted to semantically similar tokens via a kk-NN adjacency matrix.

The closed-form tt-step marginal is

q(xtx0)=Cat(xt;p=x0Qt),Qt=Q1Q2Qtq(x_t\mid x_0) = \mathrm{Cat}(x_t; p = x_0\,\overline{Q}_t), \quad \overline{Q}_t = Q_1 Q_2 \cdots Q_t

where Qt\overline{Q}_t encodes the cumulative effect of the transition kernels.

In continuous time, the forward process can be generalized using a time-inhomogeneous Continuous-Time Markov Chain (CTMC) with generator RtR_t, satisfying Kolmogorov's forward equation (Campbell et al., 2022):

ddtqt(x)=yqt(y)Rt(y,x)\frac{d}{dt} q_t(x) = \sum_{y} q_t(y) R_t(y,x)

Commuting generators (Rt=β(t)RbR_t = \beta(t) R_b) further enable closed-form marginals via matrix exponentiation.

2. Reverse Denoising and Training Objective

The reverse process aims to reconstruct x0x_0 by inverting the forward corruption dynamics. The discriminative model pθ(xt1xt)p_\theta(x_{t-1}\mid x_t) is constructed via x0x_0-parameterization:

pθ(xt1xt)x~0q(xt1,xtx~0)p~θ(x~0xt)p_\theta(x_{t-1}\mid x_t) \propto \sum_{\widetilde{x}_0} q(x_{t-1}, x_t \mid \widetilde{x}_0) \tilde{p}_\theta(\widetilde{x}_0 \mid x_t)

This parameterization mixes the model's prediction for x0x_0 with the analytic forward transition probabilities, ensuring correct sparsity and denoising capacity.

Training minimizes a variational lower bound (ELBO):

Lvb=Eq(x0)[DKL(q(xTx0)p(xT))+t=2TEq(xtx0)DKL(q(xt1xt,x0)pθ(xt1xt))Eq(x1x0)logpθ(x0x1)]L_{\mathrm{vb}} = \mathbb{E}_{q(x_0)} \left[ D_{KL}(q(x_T \mid x_0) \Vert p(x_T)) +\sum_{t=2}^T \mathbb{E}_{q(x_t\mid x_0)} D_{KL}(q(x_{t-1}\mid x_t, x_0)\Vert p_\theta(x_{t-1}\mid x_t)) -\mathbb{E}_{q(x_1\mid x_0)} \log p_\theta(x_0\mid x_1) \right]

An auxiliary cross-entropy loss on x0x_0 improves stability:

Lλ=Lvb+λEq(x0)Eq(xtx0)[logp~θ(x0xt)]L_\lambda = L_{\mathrm{vb}} + \lambda \, \mathbb{E}_{q(x_0)} \mathbb{E}_{q(x_t|x_0)} [-\log \tilde{p}_\theta(x_0 | x_t)]

For absorbing-state D3PMs (ASD3PMs), the objective further simplifies to a weighted cross-entropy over masked positions with time-dependent weights (Plasser et al., 2023). The continuous-time counterpart (CT-ELBO) upper-bounds the negative log-likelihood via expectations over CTMC jumps and reverse rates (Campbell et al., 2022).

3. Architecture and Implementation

D3PM architectures are tailored to the sequence modality and corruption structure. For symbolic music (Plasser et al., 2023):

  • Input: Sequence of length LL (e.g., L=1024L=1024 time-steps).
  • Local compression: Embedding each token into R128\mathbb{R}^{128}, followed by a 1D convolution (kernel $4$, stride $4$) to summarize into “quarter-note” segments (R512\mathbb{R}^{512}).
  • Global modeling: Deep stack of 24 Transformer blocks (self-attention, layer norm, FFN), hidden size $512$, $8$ heads.
  • Decoding: Shared transposed convolution to upsample sequence, linear prediction “head” per symbol.
  • Multi-track support: Independent local summarization for each track, shared global Transformer.

For images and text, architectures consist of U-Nets or Transformer decoders, with time-step and positional embeddings to represent discrete states and diffusion steps (Austin et al., 2021).

4. Inference and Sampling Strategies

Sampling from D3PMs follows the reverse Markov process, typically initializing xTx_T to absorbing or uniform states and iteratively refining. For ASD3PMs (Plasser et al., 2023):

  • Initialization: xT=[mask,...,mask]x_T = [\text{mask}, ..., \text{mask}].
  • Iterative sampling: At each step t=T,...,1t = T, ..., 1:

    1. Compute logits =fθ(xt)\ell = f_\theta(x_t), probabilities y^=softmax()\hat{y} = \mathrm{softmax}(\ell).
    2. Sample x0candCat(y^)x_0^{\mathrm{cand}} \sim \mathrm{Cat}(\hat{y}).
    3. With probability $1/t$, replace mask in xtx_t with x0candx_0^{\mathrm{cand}}; otherwise, retain mask.

Variable step-count and accelerated sampling (e.g., skipping tt with adjusted probabilities) reduce inference time (Plasser et al., 2023). For continuous-time models, high-performance CTMC sampling methods are employed:

  • Uniformization/Jensen’s method: Constant-rate Poisson clock and thinning.

  • Tau-leaping: Simultaneous state updates in time chunks (τ\tau), with Poisson-distributed jumps NxyPois(τR^tθ(x,y))N_{xy} \sim \mathrm{Pois}(\tau\, \hat{R}_t^\theta(x, y)).
  • Predictor-corrector: Alternating between generative and mixing steps for improved sample quality (Campbell et al., 2022).

5. Extensions: Classifier Guidance and Conditional Generation

D3PMs support flexible conditioning via post-hoc classifier guidance. A classifier ϕ(x0)\ell_\phi(x_0) (e.g., for note-density in music) influences the sampling trajectory:

  • At each step tt, the model's predicted pθ(x0xt)p_\theta(x_0|x_t) is adjusted by the gradient of the classifier loss:

y^guidedy^exp(sy^L)\hat{y}_{\text{guided}} \propto \hat{y} \cdot \exp(-s \nabla_{\hat{y}} L)

where s>0s > 0 is a guidance scale. This method enables targeted generation without needing to retrain the diffusion model with new conditions (Plasser et al., 2023).

Note-level infilling leverages the absorbing-state framework: arbitrary positions in x0x_0 are pre-masked, and the reverse process fills “holes” with musically plausible content. This generalizes masked language modeling to arbitrary discrete structured data.

6. Evaluation Metrics and Adversarial Critique

Standard quantitative metrics, such as bits/dim, negative log-likelihood, Inception Score (IS), Fréchet Inception Distance (FID), and domain-specific measures (e.g., frame-wise self-similarity in music) evaluate sample quality:

Model IS FID NLL (bits/dim)
D3PM-absorbing 6.78 30.97 ≤4.40
D3PM-Gauss 8.56 7.34 ≤3.44
Continuous DDPM 9.46 3.17 ≤3.75

Empirically, D3PMs with structured kernels and hybrid objectives are competitive or superior to continuous DDPMs in both likelihood and discriminative metrics (Austin et al., 2021).

Frame-wise self-similarity metrics for symbolic music partition pieces into overlapping windows, compute Gaussian approximations for pitch and duration, and aggregate window overlap areas for consistency/variance scores (Plasser et al., 2023). However, these metrics can be confounded: simulated annealing applied to arbitrary binary images produces piano-rolls matching self-similarity metrics but lacking genuine musicality (the “Anscombe” confounder). This critique demonstrates the need for robust evaluation measures sensitive to true semantic fidelity.

7. Connections, Comparisons, and Theoretical Guarantees

D3PMs establish principled links with autoregressive models and masked LLMs. Absorbing-state D3PMs with appropriate schedules reproduce the CMLM and BERT objectives, while deterministic absorbing diffusion corresponds to autoregressive cross-entropies (Austin et al., 2021). Discretized Gaussian kernels approximate the locality-induced bias of continuous DDPMs on ordinal data.

Continuous-time D3PMs provide tighter theoretical control over approximation error. Under bounded reverse-rate and mixing assumptions, the total variation between the sampled and data distribution is quantitatively bounded by step size, rate error, and mixing time (Campbell et al., 2022):

Law(y0)pdataTV3MT+{(RSDC1)2+12C2(M+C1SDR)}τT+2exp(Tlog22tmixlog(4D))\|\text{Law}(y_0) - p_{\mathrm{data}}\|_{\mathrm{TV}} \leq 3MT + \{(|R|S D C_1)^2 + \frac{1}{2} C_2 (M + C_1 S D |R|)\} \tau T + 2 \exp\left(-\frac{T \log^2 2}{t_{\mathrm{mix}} \log(4D)}\right)

A plausible implication is that systematic selection of QtQ_t or RtR_t and efficient parameterization of pθ(x0xt)p_\theta(x_0|x_t) are critical for both sampling fidelity and computational tractability in high-dimensional categorical spaces.

8. Domain-Specific Implementations and Impact

D3PMs have demonstrated state-of-the-art results in symbolic music generation, images, and text (Plasser et al., 2023, Austin et al., 2021, Campbell et al., 2022). Flexible infilling, accelerated sampling, and classifier guidance broaden the applicability across generative tasks. Crucially, metric confounding highlights the imperative for evaluating fidelity beyond statistical resemblance.

Design choices—such as structured corruption kernels (absorbing, Gaussian, nearest neighbor), auxiliary hybrid losses, and model architecture—directly influence D3PMs' ability to match or exceed continuous DDPMs in both sample quality and likelihood. This suggests ongoing research will further refine D3PM methodology for large-scale discrete domains and deepen their connections to established sequence modeling paradigms.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Discrete Denoising Diffusion Probabilistic Models (D3PMs).