Diffusion Generative Models Overview

Updated 20 February 2026

Diffusion generative models are deep probabilistic techniques that generate data by inverting an iterative noise-corruption process inspired by thermodynamic diffusion.
They employ a forward noising process and a learned reverse denoising chain, enabling high-fidelity synthesis across images, audio, and molecular data.
Their training leverages variational lower bounds and score-matching losses, ensuring stable optimization and adaptability to diverse high-dimensional domains.

Diffusion generative models are a class of deep probabilistic models that synthesize data by inverting a progressive noise-corruption process, leveraging a close analogy to non-equilibrium thermodynamics. Their core design is built around two coupled stochastic processes: a forward (“diffusion” or “noising”) chain that incrementally destroys structure in the data, and a learned reverse (“denoising” or “generative”) process that reconstructs high-fidelity samples by sequentially removing noise. The success of diffusion models in image, audio, molecular, and multimodal generation tasks is rooted in rigorous variational learning principles, stable training even in high dimensions, and a broad array of algorithmic enhancements that allow the basic framework to adapt to numerous data domains and operational constraints (Torre, 2023, Gallon et al., 2024, Cao et al., 2022, Higham et al., 2023).

1. Thermodynamic Analogy and Fundamental Processes

Diffusion models are motivated by the analogy to physical diffusion in non-equilibrium thermodynamics, in which two miscible fluids mix and form a homogeneous solution as entropy increases. In generative modeling, the forward process is an engineered Markov chain that adds small amounts of Gaussian noise at each discrete or continuous step: $q(x_{1:T} \mid x_0) = \prod_{t=1}^{T} q(x_t \mid x_{t-1}),$ with

$q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I),$

where each $\beta_t \ll 1$ for reversibility. This gradually transforms complex data distributions into nearly isotropic Gaussian noise (Torre, 2023, Higham et al., 2023).

In the reverse direction, a neural network parameterizes the denoising kernel: $p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)),$ where $\mu_\theta$ is often defined through explicit noise-prediction: $\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} (x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}} \epsilon_\theta(x_t, t)),$ and $\epsilon_\theta$ predicts the noise added in the forward process (Torre, 2023, Scotta et al., 2023).

In the continuous limit, the forward SDE is

$dx = -\frac{1}{2} \beta(t) x \,dt + \sqrt{\beta(t)} \,dw_t,$

and the reverse SDE is

$dx = \left[-\frac{1}{2} \beta(t) x - \beta(t) \nabla_x \log q_t(x) \right] dt + \sqrt{\beta(t)} d\bar{w}_t.$

The "score function" $s(x, t) = \nabla_x \log q_t(x)$ is estimated by a neural net (Torre, 2023, Scotta et al., 2023).

2. Variational Training Objective and Score Matching

Diffusion models maximize likelihood via a variational lower bound (ELBO), which decomposes into sums of KL divergences corresponding to the discrepancy between the true (forward) and learned (reverse) transitions: $\log p_\theta(x_0) \geq \mathbb{E}_q \left[ \log \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)} \right ] = L_\mathrm{VLB}.$ For the Gaussian case, these KLs reduce to mean-squared error losses on the predicted noise: $L_{\mathrm{simple}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right ], \quad x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon.$ This loss can be interpreted equivalently as denoising score-matching (Torre, 2023, Scotta et al., 2023, Higham et al., 2023, Gallon et al., 2024). In continuous-time, the ELBO and the score-matching loss are linked through the Fokker–Planck equation governing the evolution of densities(Torre, 2023, Ding et al., 2024).