Discrete Denoising Diffusion Models (D3PMs)
- Discrete Denoising Diffusion Probabilistic Models (D3PMs) are generative models for discrete data that use iterative Markovian corruption and denoising via structured transition matrices.
- They achieve state-of-the-art performance in domains such as text, images, and symbolic music by employing varied corruption strategies like uniform, absorbing, Gaussian, and nearest neighbor kernels.
- D3PMs enable flexible conditional generation and accelerated sampling through techniques like auxiliary cross-entropy loss and classifier guidance for improved reconstruction.
Discrete Denoising Diffusion Probabilistic Models (D3PMs) are a class of generative models for discrete data that extend the Denoising Diffusion Probabilistic Model (DDPM) framework to categorical spaces, including text, images, and symbolic music. D3PMs formalize the forward and reverse stochastic processes using carefully designed transition matrices, enabling flexible Markovian corruption and iterative denoising over discrete sequences. Design choices in the corruption kernel, model parameterization, and training objective produce notable improvements over previous discrete and continuous approaches, with state-of-the-art results in multiple domains (Austin et al., 2021, Plasser et al., 2023).
1. Forward Diffusion Process
The forward process in D3PMs is a discrete-time Markov chain acting on sequences of categorical variables. Each variable takes values in , often represented as one-hot vectors . The corruption process is parameterized by a sequence of transition matrices :
Several structured choices for enable nontrivial forms of corruption:
- Multinomial/Uniform (D3PM-uniform): All categories are equally probable under corruption, stationary distribution is uniform.
- Absorbing-state (D3PM-absorbing): Designates one special “mask” category , to which all variables eventually transition. The transition matrix is:
- Discretized Gaussian (D3PM-Gauss): Blurs ordinal categories, with probability decaying with squared difference in values.
- Nearest Neighbor/Embedding-based (D3PM-NN): Corruption is restricted to semantically similar tokens via a -NN adjacency matrix.
The closed-form -step marginal is
where encodes the cumulative effect of the transition kernels.
In continuous time, the forward process can be generalized using a time-inhomogeneous Continuous-Time Markov Chain (CTMC) with generator , satisfying Kolmogorov's forward equation (Campbell et al., 2022):
Commuting generators () further enable closed-form marginals via matrix exponentiation.
2. Reverse Denoising and Training Objective
The reverse process aims to reconstruct by inverting the forward corruption dynamics. The discriminative model is constructed via -parameterization:
This parameterization mixes the model's prediction for with the analytic forward transition probabilities, ensuring correct sparsity and denoising capacity.
Training minimizes a variational lower bound (ELBO):
An auxiliary cross-entropy loss on improves stability:
For absorbing-state D3PMs (ASD3PMs), the objective further simplifies to a weighted cross-entropy over masked positions with time-dependent weights (Plasser et al., 2023). The continuous-time counterpart (CT-ELBO) upper-bounds the negative log-likelihood via expectations over CTMC jumps and reverse rates (Campbell et al., 2022).
3. Architecture and Implementation
D3PM architectures are tailored to the sequence modality and corruption structure. For symbolic music (Plasser et al., 2023):
- Input: Sequence of length (e.g., time-steps).
- Local compression: Embedding each token into , followed by a 1D convolution (kernel $4$, stride $4$) to summarize into “quarter-note” segments ().
- Global modeling: Deep stack of 24 Transformer blocks (self-attention, layer norm, FFN), hidden size $512$, $8$ heads.
- Decoding: Shared transposed convolution to upsample sequence, linear prediction “head” per symbol.
- Multi-track support: Independent local summarization for each track, shared global Transformer.
For images and text, architectures consist of U-Nets or Transformer decoders, with time-step and positional embeddings to represent discrete states and diffusion steps (Austin et al., 2021).
4. Inference and Sampling Strategies
Sampling from D3PMs follows the reverse Markov process, typically initializing to absorbing or uniform states and iteratively refining. For ASD3PMs (Plasser et al., 2023):
- Initialization: .
- Iterative sampling: At each step :
- Compute logits , probabilities .
- Sample .
- With probability $1/t$, replace mask in with ; otherwise, retain mask.
Variable step-count and accelerated sampling (e.g., skipping with adjusted probabilities) reduce inference time (Plasser et al., 2023). For continuous-time models, high-performance CTMC sampling methods are employed:
Uniformization/Jensen’s method: Constant-rate Poisson clock and thinning.
- Tau-leaping: Simultaneous state updates in time chunks (), with Poisson-distributed jumps .
- Predictor-corrector: Alternating between generative and mixing steps for improved sample quality (Campbell et al., 2022).
5. Extensions: Classifier Guidance and Conditional Generation
D3PMs support flexible conditioning via post-hoc classifier guidance. A classifier (e.g., for note-density in music) influences the sampling trajectory:
- At each step , the model's predicted is adjusted by the gradient of the classifier loss:
where is a guidance scale. This method enables targeted generation without needing to retrain the diffusion model with new conditions (Plasser et al., 2023).
Note-level infilling leverages the absorbing-state framework: arbitrary positions in are pre-masked, and the reverse process fills “holes” with musically plausible content. This generalizes masked language modeling to arbitrary discrete structured data.
6. Evaluation Metrics and Adversarial Critique
Standard quantitative metrics, such as bits/dim, negative log-likelihood, Inception Score (IS), Fréchet Inception Distance (FID), and domain-specific measures (e.g., frame-wise self-similarity in music) evaluate sample quality:
| Model | IS | FID | NLL (bits/dim) |
|---|---|---|---|
| D3PM-absorbing | 6.78 | 30.97 | ≤4.40 |
| D3PM-Gauss | 8.56 | 7.34 | ≤3.44 |
| Continuous DDPM | 9.46 | 3.17 | ≤3.75 |
Empirically, D3PMs with structured kernels and hybrid objectives are competitive or superior to continuous DDPMs in both likelihood and discriminative metrics (Austin et al., 2021).
Frame-wise self-similarity metrics for symbolic music partition pieces into overlapping windows, compute Gaussian approximations for pitch and duration, and aggregate window overlap areas for consistency/variance scores (Plasser et al., 2023). However, these metrics can be confounded: simulated annealing applied to arbitrary binary images produces piano-rolls matching self-similarity metrics but lacking genuine musicality (the “Anscombe” confounder). This critique demonstrates the need for robust evaluation measures sensitive to true semantic fidelity.
7. Connections, Comparisons, and Theoretical Guarantees
D3PMs establish principled links with autoregressive models and masked LLMs. Absorbing-state D3PMs with appropriate schedules reproduce the CMLM and BERT objectives, while deterministic absorbing diffusion corresponds to autoregressive cross-entropies (Austin et al., 2021). Discretized Gaussian kernels approximate the locality-induced bias of continuous DDPMs on ordinal data.
Continuous-time D3PMs provide tighter theoretical control over approximation error. Under bounded reverse-rate and mixing assumptions, the total variation between the sampled and data distribution is quantitatively bounded by step size, rate error, and mixing time (Campbell et al., 2022):
A plausible implication is that systematic selection of or and efficient parameterization of are critical for both sampling fidelity and computational tractability in high-dimensional categorical spaces.
8. Domain-Specific Implementations and Impact
D3PMs have demonstrated state-of-the-art results in symbolic music generation, images, and text (Plasser et al., 2023, Austin et al., 2021, Campbell et al., 2022). Flexible infilling, accelerated sampling, and classifier guidance broaden the applicability across generative tasks. Crucially, metric confounding highlights the imperative for evaluating fidelity beyond statistical resemblance.
Design choices—such as structured corruption kernels (absorbing, Gaussian, nearest neighbor), auxiliary hybrid losses, and model architecture—directly influence D3PMs' ability to match or exceed continuous DDPMs in both sample quality and likelihood. This suggests ongoing research will further refine D3PM methodology for large-scale discrete domains and deepen their connections to established sequence modeling paradigms.