Diffusion Probabilistic Models

Updated 8 July 2025

Diffusion probabilistic models are deep generative models characterized by a forward noising and a learned reverse denoising process.
They employ a Markov chain to progressively corrupt data and then iteratively reconstruct it using neural network–predicted parameters.
Recent advances focus on variance learning, accelerated sampling, and improved reverse process accuracy for enhanced performance.

Diffusion probabilistic models are a class of deep generative models characterized by their construction as latent variable models governed by a Markov chain that sequentially reverses a process of iterative noising. Drawing inspiration from nonequilibrium thermodynamics, these models define a forward (diffusion) process that gradually transforms data into a noise distribution and a learned reverse (denoising) process that reconstructs data from noise. This modeling paradigm has established state-of-the-art performance in high-fidelity image synthesis and has demonstrated versatility across numerous application domains.

1. Model Formulation and Training

Diffusion probabilistic models introduce a forward process that progressively corrupts an observed data point $x_0$ with Gaussian noise over $T$ discrete steps. The forward chain is defined by:

$q(x_t\,|\,x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\,x_{t-1}, \beta_t I)$

with a fixed variance schedule $\{\beta_t\}$ . This process can be compactly described at any time $t$ by:

$q(x_t\,|\,x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}\,x_0, (1 - \bar{\alpha}_t)I)$

where $\bar{\alpha}_t = \prod_{s=1}^t(1-\beta_s)$ .

The reverse process is modeled by a Markov chain parameterized as:

$p_\theta(x_{t-1}\,|\,x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$

with the initial state $x_T$ drawn from a standard normal distribution. The parameterized mean $\mu_\theta(x_t, t)$ is typically predicted by a neural network trained to estimate the perturbation $\epsilon$ added at each diffusion step.

Training proceeds by minimizing a variational bound on the negative log-likelihood of the data:

$-\log p_\theta(x_0) \leq \mathbb{E}_{q}\left[-\log p(x_T) - \sum_{t=1}^T \log \frac{p_\theta(x_{t-1}|x_t)}{q(x_t|x_{t-1})}\right]$

This bound can be efficiently optimized by a reformulation that leads to a mean squared error loss between the actual noise $\epsilon$ and its predicted value $\epsilon_\theta$ :

$L_\mathrm{simple}(\theta) = \mathbb{E}_{t, x_0, \epsilon} \Big[ \| \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t) \|^2 \Big]$

where $t$ is sampled uniformly, and $\epsilon \sim \mathcal{N}(0, I)$ (2006.11239).

2. Denoising Score Matching and Langevin Dynamics

A key insight is the connection between diffusion models and denoising score matching. By reframing the reverse process to predict $\epsilon$ , the optimization objective becomes equivalent to score matching at multiple noise levels. This facilitates sampling using an update reminiscent of annealed Langevin dynamics:

$x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \Big( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t, t) \Big) + \sigma_t z$

where $z \sim \mathcal{N}(0,I)$ . This connection both provides mathematical justification for the denoising approach and establishes a link to stochastic differential equations (SDEs)-based generative modeling (2006.11239).

3. Extensions and Recent Developments

Multiple extensions have been introduced to improve and generalize diffusion probabilistic models:

Variance Learning and Noise Scheduling: The "Improved DDPM" approach learns the reverse process variance through a parameterized interpolation and employs a cosine noise schedule for improved sample quality and log-likelihoods. Hybrid training objectives combining the simplified noise-prediction loss with the variational bound further stabilize and enhance performance (2102.09672).
Sampling Acceleration: Continuous-time and step-skipping strategies (e.g., DDIM, FastDPM) reduce the number of reverse steps required to generate high-quality samples by recasting the process in continuous domains and reparametrizing the sampling schedule, often without retraining (2106.00132).
Truncated and Adversarial Variants: Truncated DPMs propose truncating the forward process before reaching pure noise, and using an implicit, learnable prior to match the intermediate noisy state. This leads to faster inference with equivalent or superior sample quality and can be cast as a variant of adversarial auto-encoders (2202.09671).
Contractive Reverse Processes: The contractive DPM (CDPM) framework emphasizes imposing contraction in the reverse process, which controls and limits error accumulation associated with score estimation mismatch and numerical discretization. This is analytically shown to yield superior empirical robustness and sample fidelity (2401.13115).
Optimal Covariance Matching: New techniques enable more accurate estimation of the denoising covariance using analytic identities and direct regression of the diagonal Hessian under the score, which improves sample efficiency especially when using fewer sampling steps (2406.10808).
Discrete and Unified Token Modeling: Approaches such as RDPM (Recurrent DPM) transition the denoising process into discrete token domains, using recurrent prediction and cross-entropy loss for accelerated and potentially unified multimodal generation (2412.18390).

4. Interpretations and Progressive Decoding

The sequential reverse process in diffusion models induces a progressive lossy decompression, in which global structure is generated first, followed by finer details. This process is a generalization of autoregressive decoding, but unlike conventional autoregressive models, does not impose a fixed sequential order on data dimensions. Each reverse step can be interpreted as incrementally refining the generated sample, with early steps recovering coarse structure and later steps filling in high-frequency details (2006.11239).

5. Empirical Performance and Scalability

Diffusion probabilistic models achieve state-of-the-art results across image, audio, and other data modalities. For instance, unconditional models on CIFAR10 yield Inception Scores around 9.46 and FID scores as low as 3.17; generated images on LSUN datasets are competitive with ProgressiveGAN. The models scale predictably with compute, showing continual improvements in sample quality and likelihood with increased model capacity and training resources (2006.11239, 2102.09672).

Comparative experiments demonstrate that diffusion models achieve higher recall and better coverage of the target distribution's modes compared to GANs, highlighting their strength in modeling distributional diversity (2102.09672). In practical deployment, learning the reverse variance and employing cosine schedules allow for a substantial reduction in sampling steps without significant quality loss.

6. Applications and Broader Impact

Diffusion probabilistic models have wide-ranging applications, including:

Image synthesis and super-resolution: High-fidelity image generation, inpainting, and enhancement.
3D data and medical imaging: Synthesis of volumetric MRI and CT scans, as well as improved downstream task performance through data augmentation (2211.03364).
Probabilistic time series forecasting: Scenario generation for load, wind, and PV energy forecasting, with competitive or superior results versus GANs, VAEs, and normalizing flows (2212.02977).
Efficient generative pipelines: Patch-based and upsampling DPMs for reduced memory and compute requirements (2304.07087, 2305.16269).
Scientific domains and biomolecular modeling: Protein backbone and sequence generation, with equivariant networks and manifold-based diffusion processes (2406.01622).
Counterfactual and causal reasoning: Latent-space structured diffusion for controlled and semantic generation via causal interventions (2404.17735).
Probabilistic programming and Bayesian inference: Flexible variational inference in probabilistic programming languages via diffusion model approximations (2311.00474).
Inverse problems and imaging: Compressive SAR imaging and data-driven reconstruction using guided DPMs (2504.17053).

The flexibility of diffusion models to be conditioned on external information, such as class labels or textual descriptions, enhances their applicability in controlled generation tasks.

7. Theoretical Foundations and Future Directions

The theoretical basis of diffusion probabilistic models lies in nonequilibrium thermodynamics, Markov processes, and connections to stochastic differential equations. Scale-space theory provides a mathematical framework for understanding information degradation and recovery in these models (2309.08511).

Research challenges include:

Developing sharper theoretical analyses of information loss and reverse process properties.
Improving acceleration and efficiency to make deployment tractable for real-time and resource-constrained environments.
Designing conditioning and guidance mechanisms for interpretability and application to new data modalities.
Integrating diffusion models into broader probabilistic modeling and inference frameworks, enhancing their versatility.

Summary tables below provide a concise overview of core aspects.

Process	Formula	Description
Forward diffusion	$q(x_t\|x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t} x_{t-1}, \beta_t I)$	Adds Gaussian noise at each step
Cumulative	$q(x_t\|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I)$	Noise relative to original data
Reverse process	$p_\theta(x_{t-1}\|x_t) = \mathcal{N}(\mu_\theta(x_t, t), \sigma_t^2 I)$	Learned denoising transitions
Training loss	$L_\mathrm{simple} = \mathbb{E}_{t,x_0,\epsilon}[ \\|\epsilon - \epsilon_\theta(\cdot) \\|^2 ]$	Noise prediction objective

Diffusion probabilistic models represent an overview of ideas from variational inference, stochastic processes, score-based modeling, and deep neural architecture, underpinning their success in generative modeling and providing a foundation for ongoing research and new applications (2006.11239, 2102.09672, 2106.00132, 2202.09671, 2309.08511, 2401.13115, 2412.18390, 2503.21555, 2504.17053).