Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Denoising Diffusion Probabilistic Models (DDPMs)

Updated 30 June 2025

Denoising diffusion probabilistic models (DDPMs) are a class of likelihood-based deep generative models that synthesize complex data distributions by gradually transforming simple noise into samples that resemble real data. These models operate by reversing a defined forward diffusion process that incrementally adds noise to the data, and they have demonstrated competitive or state-of-the-art performance in image synthesis and other generative tasks. The training procedure is rooted in variational inference and is deeply connected to concepts from denoising score matching and Langevin dynamics, situating DDPMs at the intersection of probabilistic modeling and nonequilibrium thermodynamics.

1. Principles of Denoising Diffusion Probabilistic Models

DDPMs are formulated as latent variable models that define a Markov chain to transition from data samples x0x_0 to a noise prior distribution (typically Gaussian) over TT steps and then learn to reverse this noising process. The core components are:

  • Forward Process (qq): A Markov chain with fixed, time-dependent Gaussian transitions adds noise to each data point in TT steps:

q(x1:Tx0)=t=1Tq(xtxt1),q(xtxt1)=N(xt;1βtxt1,βtI),q(x_{1:T}|x_0) = \prod_{t=1}^T q(x_t | x_{t-1}), \quad q(x_t | x_{t-1}) = \mathcal{N}\left(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I\right),

where {βt}\{\beta_t\} is a predefined variance schedule.

  • Reverse Process (pθp_\theta): A parameterized Markov chain (typically a neural network) is trained to model the reverse transitions:

pθ(x0:T)=p(xT)t=1Tpθ(xt1xt),pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t)),p_\theta(x_{0:T}) = p(x_T)\prod_{t=1}^T p_\theta(x_{t-1}|x_t), \quad p_\theta(x_{t-1}|x_t) = \mathcal{N}\left(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)\right),

with p(xT)=N(0,I)p(x_T) = \mathcal{N}(0, I).

  • DDPMs reconstruct samples by iteratively denoising from random Gaussian noise—that is, by sampling xTp(xT)x_T \sim p(x_T) and recursively applying pθ(xt1xt)p_\theta(x_{t-1}|x_t) for t=T,,1t = T,\ldots,1.

2. Variational Objective and Connection to Score Matching

The central training objective is a variational lower bound (ELBO) on the data log-likelihood. This objective decomposes as: logpθ(x0)Eq[logp(xT)t=1Tlogpθ(xt1xt)q(xtxt1)]-\log p_\theta(x_0) \leq \mathbb{E}_{q} \left[ -\log p(x_T) - \sum_{t=1}^T \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1})} \right] This can be rewritten using closed-form KL divergences, thanks to the Gaussian assumptions.

A key insight is that the loss term for the reverse process's mean can be written as a weighted mean squared error between the actual noise and the model's noise prediction: Lt1Ex0,ϵ[βt22σt2αt(1αˉt)ϵϵθ(αˉtx0+1αˉtϵ,t)2]L_{t-1} \propto \mathbb{E}_{x_0, \epsilon}\left[ \frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar{\alpha}_t)} \left\| \epsilon - \epsilon_\theta\left(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t\right) \right\|^2 \right] where ϵN(0,I)\epsilon \sim \mathcal{N}(0, I), and ϵθ\epsilon_\theta is the neural predictor. This is identical in form to the objective for denoising score matching and closely related to annealed Langevin dynamics. The equivalence clarifies that training a DDPM models the score function of the perturbed data distribution across noise scales, and Langevin-style iterative denoising at inference corresponds to the path of maximizing data likelihood.

Empirically, an unweighted version of this loss,

Lsimple(θ)=Et,x0,ϵ[ϵϵθ(αˉtx0+1αˉtϵ,t)2],L_{\mathrm{simple}}(\theta) = \mathbb{E}_{t, x_0, \epsilon} \left[ \left\| \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon, t) \right\|^2 \right],

focuses training on the more challenging, highly corrupted samples and leads to improved perceptual sample quality.

3. Progressive Decoding and Interpretable Compression

DDPMs admit an interpretation as progressive lossy decompressors. The forward process encodes information about x0x_0 across a sequence, and progressively sending xT,,x0x_T, \ldots, x_0 allows progressive reconstruction: x^0=xt1αˉtϵθ(xt,t)αˉt\hat{x}_0 = \frac{x_t - \sqrt{1 - \bar{\alpha}_t}\epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}} Unlike classical autoregressive decoders (which follow hard coordinate-wise orderings), the DDPM's trajectory is continuous in noise space, yielding a generalization of sequential coding. This interpretation enables DDPMs to be analyzed using information-theoretic rate-distortion theory, showing that most coding effort describes small perceptual details.

4. Empirical Performance and Benchmark Comparisons

The original DDPM framework achieves state-of-the-art results on image generation benchmarks:

  • Unconditional CIFAR10: Inception Score 9.46±0.119.46 \pm 0.11, FID $3.17$, Bits/Dim 3.75\leq 3.75, competitive with GAN and autoregressive models.
  • LSUN 256x256: Sample quality matches or exceeds ProgressiveGAN, though StyleGAN remains superior by FID on some splits.

In summary tables:

Model IS FID
NCSN (Score Matching) 8.87 25.32
SNGAN-DDLS (GAN) 9.09 15.42
StyleGAN2+ADA (cond.) 9.74 3.26
Diffusion (ours, LsimpleL_{simple}) 9.46 3.17

DDPMs close the gap with GANs for perceptual metrics, sometimes outperforming class-conditional models despite their likelihood-based foundation and more stable training.

5. Theoretical Unification and Future Implications

Denoising diffusion probabilistic models reveal a unification across several areas of generative modeling:

  • Connection to latent variable models: The forward/reverse process is a latent variable model with explicit evidence lower bound.
  • Connection to score matching/Langevin dynamics: Training aligns with denoising score matching objectives and sampling implements annealed Langevin dynamics.
  • Autoregressive-like progressive decoding: Sampling can be seen as generalized lossy decompression.

This unified view enables transfer of architectural, algorithmic, and analytical advances between generative modeling paradigms—score-based models, VAEs, and energy-based models.

Future research directions suggested include leveraging more expressive decoders, integrating hybrid energy/autoregressive modules, efficient sampling approaches, and extending the lossy decompression view to modalities beyond images.

6. Summary of Main Mathematical Expressions

  • Forward process:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t|x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t)I)

  • Reverse transition update:

xt1=1αt(xt1αt1αˉtϵθ(xt,t))+σtz,zN(0,I)x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right) + \sigma_t z,\quad z\sim \mathcal{N}(0,I)

  • Weighted variational objective:

L=Eq[KL(q(xTx0)p(xT))+t>1KL(q(xt1xt,x0)pθ(xt1xt))logpθ(x0x1)]L = \mathbb{E}_{q}\left[ \mathrm{KL}(q(x_T|x_0)\|p(x_T)) + \sum_{t>1} \mathrm{KL}(q(x_{t-1}|x_t, x_0)\|p_\theta(x_{t-1}|x_t)) - \log p_\theta(x_0|x_1) \right]

  • Simplified mean squared error loss:

Lsimple=Et,x0,ϵ[ϵϵθ(αˉtx0+1αˉtϵ,t)2]L_{\mathrm{simple}} = \mathbb{E}_{t,x_0,\epsilon}\left[ \|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, t)\|^2 \right]

7. Impact and Ongoing Developments

DDPMs establish a scalable, robust, and theoretically grounded framework for deep generative modeling—demonstrating that likelihood-based models can achieve or surpass GAN-level sample quality with improved mode coverage and training stability. Their foundational ties to score-based methods and variational inference clarify the algorithm’s operating principles and suggest broad potential for interpretability, compression, and scalable generation across data modalities. Ongoing research continues to explore higher expressiveness in decoders, more efficient sampling, improved rates of lossy decompression, and new applications to domains beyond images, leveraging the architecture’s flexibility and principled mathematical structure.