Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

119 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Denoising Diffusion Probabilistic Models (DDPMs)

Updated 30 June 2025

Denoising diffusion probabilistic models (DDPMs) are a class of likelihood-based deep generative models that synthesize complex data distributions by gradually transforming simple noise into samples that resemble real data. These models operate by reversing a defined forward diffusion process that incrementally adds noise to the data, and they have demonstrated competitive or state-of-the-art performance in image synthesis and other generative tasks. The training procedure is rooted in variational inference and is deeply connected to concepts from denoising score matching and Langevin dynamics, situating DDPMs at the intersection of probabilistic modeling and nonequilibrium thermodynamics.

1. Principles of Denoising Diffusion Probabilistic Models

DDPMs are formulated as latent variable models that define a Markov chain to transition from data samples $x_0$ to a noise prior distribution (typically Gaussian) over $T$ steps and then learn to reverse this noising process. The core components are:

Forward Process ( $q$ ): A Markov chain with fixed, time-dependent Gaussian transitions adds noise to each data point in $T$ steps:

$q(x_{1:T}|x_0) = \prod_{t=1}^T q(x_t | x_{t-1}), \quad q(x_t | x_{t-1}) = \mathcal{N}\left(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I\right),$

where $\{\beta_t\}$ is a predefined variance schedule.

Reverse Process ( $p_\theta$ ): A parameterized Markov chain (typically a neural network) is trained to model the reverse transitions:

$p_\theta(x_{0:T}) = p(x_T)\prod_{t=1}^T p_\theta(x_{t-1}|x_t), \quad p_\theta(x_{t-1}|x_t) = \mathcal{N}\left(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)\right),$

with $p(x_T) = \mathcal{N}(0, I)$ .

DDPMs reconstruct samples by iteratively denoising from random Gaussian noise—that is, by sampling $x_T \sim p(x_T)$ and recursively applying $p_\theta(x_{t-1}|x_t)$ for $t = T,\ldots,1$ .

2. Variational Objective and Connection to Score Matching

The central training objective is a variational lower bound (ELBO) on the data log-likelihood. This objective decomposes as: $-\log p_\theta(x_0) \leq \mathbb{E}_{q} \left[ -\log p(x_T) - \sum_{t=1}^T \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1})} \right]$ This can be rewritten using closed-form KL divergences, thanks to the Gaussian assumptions.

A key insight is that the loss term for the reverse process's mean can be written as a weighted mean squared error between the actual noise and the model's noise prediction: $L_{t-1} \propto \mathbb{E}_{x_0, \epsilon}\left[ \frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar{\alpha}_t)} \left\| \epsilon - \epsilon_\theta\left(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t\right) \right\|^2 \right]$ where $\epsilon \sim \mathcal{N}(0, I)$ , and $\epsilon_\theta$ is the neural predictor. This is identical in form to the objective for denoising score matching and closely related to annealed Langevin dynamics. The equivalence clarifies that training a DDPM models the score function of the perturbed data distribution across noise scales, and Langevin-style iterative denoising at inference corresponds to the path of maximizing data likelihood.

Empirically, an unweighted version of this loss,

$L_{\mathrm{simple}}(\theta) = \mathbb{E}_{t, x_0, \epsilon} \left[ \left\| \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon, t) \right\|^2 \right],$

focuses training on the more challenging, highly corrupted samples and leads to improved perceptual sample quality.

3. Progressive Decoding and Interpretable Compression

DDPMs admit an interpretation as progressive lossy decompressors. The forward process encodes information about $x_0$ across a sequence, and progressively sending $x_T, \ldots, x_0$ allows progressive reconstruction: $\hat{x}_0 = \frac{x_t - \sqrt{1 - \bar{\alpha}_t}\epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}}$ Unlike classical autoregressive decoders (which follow hard coordinate-wise orderings), the DDPM's trajectory is continuous in noise space, yielding a generalization of sequential coding. This interpretation enables DDPMs to be analyzed using information-theoretic rate-distortion theory, showing that most coding effort describes small perceptual details.

4. Empirical Performance and Benchmark Comparisons

The original DDPM framework achieves state-of-the-art results on image generation benchmarks:

Unconditional CIFAR10: Inception Score $9.46 \pm 0.11$ , FID $3.17$, Bits/Dim $\leq 3.75$ , competitive with GAN and autoregressive models.
LSUN 256x256: Sample quality matches or exceeds ProgressiveGAN, though StyleGAN remains superior by FID on some splits.

In summary tables:

Model	IS	FID
NCSN (Score Matching)	8.87	25.32
SNGAN-DDLS (GAN)	9.09	15.42
StyleGAN2+ADA (cond.)	9.74	3.26
Diffusion (ours, $L_{simple}$ )	9.46	3.17

DDPMs close the gap with GANs for perceptual metrics, sometimes outperforming class-conditional models despite their likelihood-based foundation and more stable training.

5. Theoretical Unification and Future Implications

Denoising diffusion probabilistic models reveal a unification across several areas of generative modeling:

Connection to latent variable models: The forward/reverse process is a latent variable model with explicit evidence lower bound.
Connection to score matching/Langevin dynamics: Training aligns with denoising score matching objectives and sampling implements annealed Langevin dynamics.
Autoregressive-like progressive decoding: Sampling can be seen as generalized lossy decompression.

This unified view enables transfer of architectural, algorithmic, and analytical advances between generative modeling paradigms—score-based models, VAEs, and energy-based models.

Future research directions suggested include leveraging more expressive decoders, integrating hybrid energy/autoregressive modules, efficient sampling approaches, and extending the lossy decompression view to modalities beyond images.

6. Summary of Main Mathematical Expressions

Forward process:

$q(x_t|x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t)I)$

Reverse transition update:

$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right) + \sigma_t z,\quad z\sim \mathcal{N}(0,I)$

Weighted variational objective:

$L = \mathbb{E}_{q}\left[ \mathrm{KL}(q(x_T|x_0)\|p(x_T)) + \sum_{t>1} \mathrm{KL}(q(x_{t-1}|x_t, x_0)\|p_\theta(x_{t-1}|x_t)) - \log p_\theta(x_0|x_1) \right]$

Simplified mean squared error loss:

$L_{\mathrm{simple}} = \mathbb{E}_{t,x_0,\epsilon}\left[ \|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, t)\|^2 \right]$

7. Impact and Ongoing Developments

DDPMs establish a scalable, robust, and theoretically grounded framework for deep generative modeling—demonstrating that likelihood-based models can achieve or surpass GAN-level sample quality with improved mode coverage and training stability. Their foundational ties to score-based methods and variational inference clarify the algorithm’s operating principles and suggest broad potential for interpretability, compression, and scalable generation across data modalities. Ongoing research continues to explore higher expressiveness in decoders, more efficient sampling, improved rates of lossy decompression, and new applications to domains beyond images, leveraging the architecture’s flexibility and principled mathematical structure.

PDF Markdown Chat (Pro)