Diffusion Probabilistic Modeling

Updated 25 December 2025

Diffusion probabilistic modeling is a deep generative technique that learns to invert a forward Markovian diffusion process, turning structured data into noise and recovering it via a learned reverse process.
It employs variational training and denoising score matching to optimize the model, achieving superior sample fidelity and diversity compared to GANs and other generative methods.
The approach’s flexible architecture avoids latent variable pitfalls found in VAEs and flows, enabling robust performance in controllable generation and high-resolution synthesis.

Diffusion probabilistic modeling defines a class of deep generative models in which high-dimensional data distributions are learned via the reversal of a diffusion process that gradually corrupts data into noise through a Markovian sequence of Gaussian transitions. The approach is rooted in nonequilibrium thermodynamics, with algorithmic constructs analogous to the dynamics of particles diffusing from structured states (data) to unstructured noise and a trained model learning how to invert this process for effective sample generation. In contrast to VAEs and flows, where latent variables are of lower or equal dimension to the data, diffusion models operate in the original data space and currently outperform GANs and other generative paradigms in sample fidelity and diversity for several domains including images, video, and scientific applications (Strümke et al., 2023, Gallon et al., 2 Dec 2024).

1. Mathematical Foundations

Let $x_0$ denote a sample from the unknown data distribution $p_{\mathrm{data}}(x)$ in $\mathbb{R}^d$ . The model defines a time-indexed sequence via a discrete-time Markov chain: $q(x_{1:T}\mid x_0) = \prod_{t=1}^T q(x_t\mid x_{t-1}),\quad q(x_t\mid x_{t-1}) = \mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_t I),$ where $\{\beta_t\}$ is a variance schedule, and $\alpha_t = 1-\beta_t$ , $\bar\alpha_t = \prod_{i=1}^t \alpha_i$ . By induction,

$q(x_t\mid x_0) = \mathcal{N}(x_t;\sqrt{\bar\alpha_t}x_0,(1-\bar\alpha_t)I).$

For sufficiently large $T$ , $q(x_T)$ approaches $\mathcal{N}(0,I)$ , the reference noise prior.

The reverse process parameterizes the time-reversed sequence via: $p_\theta(x_{0:T}) = p_{\mathrm{prior}}(x_T) \prod_{t=1}^T p_\theta(x_{t-1}\mid x_t),\quad p_\theta(x_{t-1}\mid x_t) = \mathcal{N}(x_{t-1};\mu_\theta(x_t,t),\Sigma_\theta(x_t,t)),$ with $\mu_\theta$ and $\Sigma_\theta$ typically learned by neural networks to approximate the true reverse kernels.

Continuous limits replace the discrete Markov chain with an Itô SDE: $dx = -\frac{1}{2}\beta(t) x\,dt + \sqrt{\beta(t)}\,dw,$ with reverse SDE

$dx = \left[\frac{1}{2}\beta(t)x - \beta(t)s_\theta(x, t)\right]dt + \sqrt{\beta(t)}dw,$

where $s_\theta(x, t)$ learns $\nabla_x \log q(x_t)$ , the score of the forward process (Strümke et al., 2023, Gallon et al., 2 Dec 2024).

2. Variational Training and Score Matching

Training optimizes the model parameters by maximizing a variational lower bound (ELBO) on the marginal log-likelihood: $\begin{aligned} \log p_\theta(x_0) &= \log \int p_\theta(x_{0:T})\,dx_{1:T} \ &\geq \mathbb{E}_{q}\left[\log\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}\mid x_0)}\right] = -\mathcal{L}_{\mathrm{VLB}}, \end{aligned}$ with

$\mathcal{L}_{\mathrm{VLB}} = \mathbb{E}_q\left[ -\log p_\theta(x_0\mid x_1) + D_{\mathrm{KL}}(q(x_T\mid x_{T-1})\|p_{\mathrm{prior}}(x_T)) + \sum_{t=2}^T D_{\mathrm{KL}}(q(x_{t-1}\mid x_t,x_0)\|p_\theta(x_{t-1}\mid x_t)) \right].$

If the variances are matched to the true conditionals ( $\Sigma_\theta = \tilde\beta_t I$ ), each step's KL term corresponds to an MSE loss between the true noise and its prediction. In practice, the ELBO reduces to the denoising score matching objective: $\mathcal{L}_{\mathrm{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right],\quad x_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon,$ which is used for efficient optimization (Strümke et al., 2023, Ho et al., 2020, Gallon et al., 2 Dec 2024).

3. Sampling Mechanisms and Inference

After training, sample generation starts from $x_T \sim \mathcal{N}(0,I)$ and recursively applies the learned reverse transitions: $\begin{aligned} \epsilon_\theta &= \epsilon_\theta(x_t, t), \ \mu &= \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta\right), \ x_{t-1} &= \mu + \sqrt{\tilde\beta_t} z, \quad z \sim \mathcal{N}(0, I), \end{aligned}$ where $\tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t$ .

Deterministic alternatives (e.g., DDIM) and SDE solvers (e.g., Euler–Maruyama) are used to reduce the number of steps, accelerate inference, or deterministically map noise to samples (Gallon et al., 2 Dec 2024).

4. Algorithmic Implementation and Architectures

A typical training loop consists of:

Drawing a minibatch $x_0 \sim p_{\mathrm{data}}$
Sampling $t \in \{1, \ldots, T\}$ and $\epsilon \sim \mathcal{N}(0, I)$
Computing $x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon$
Predicting $\hat\epsilon = \epsilon_\theta(x_t, t)$
Taking a gradient step on the squared loss $\|\epsilon - \hat\epsilon\|^2$

U-Net backbones (often with group normalization, attention, and Fourier/time positional embeddings) are standard. The choice of noise schedule (linear or cosine) and number of diffusion steps ( $T$ ) are critical hyperparameters (Gallon et al., 2 Dec 2024, Strümke et al., 2023, Ho et al., 2020).

Sampling proceeds via ancestral generation as above or via continuous-time SDE integration depending on the application.

5. Thermodynamic and Theoretical Perspectives

Diffusion probabilistic models are formally motivated by analogy to nonequilibrium thermodynamics: forward noising increases entropy by driving data off the low-entropy data manifold, while the reverse process reconstructs structure by descending the free energy functional of the modeled system. The ELBO can be viewed as a discrete action or free-energy gap. This thermodynamic parallel also governs the stability, path properties, and asymptotic behavior of the learned Markov chain (Strümke et al., 2023, Peter, 2023).

6. Extensions, Applications, and Comparative Analysis

Extensions include:

Conditional and classifier-free guidance mechanisms for controllable generation.
Latent diffusion, where diffusion operates in a compressed latent space to enable high-resolution synthesis.
Contractive models, which improve robustness by enforcing a contraction property on the reverse drift (Tang et al., 23 Jan 2024).
Truncated diffusion, where the chain is shortened and terminated before reaching pure noise, with a learned implicit prior for accelerated sampling (Zheng et al., 2022).
Field models for generative modeling over arbitrary domains, including non-Euclidean manifolds (Zhuang et al., 2023).

Diffusion models have demonstrated superior sample quality and stability compared to GANs and VAEs. Unlike VAEs, which require an explicit encoder and are prone to posterior collapse, diffusion models operate exclusively in data space and typically avoid such collapse. Unlike normalizing flows, diffusion models are not constrained by invertibility or tractable Jacobians, allowing flexible architectures and more expressive mappings (Strümke et al., 2023, Gallon et al., 2 Dec 2024).

Diffusion probabilistic modeling has been successfully applied to high-fidelity image, video, and scientific domains, including protein structure generation, probabilistic forecasting, unsupervised disentanglement, and multimodal generation (Yang et al., 2022, Trippe et al., 2022, Kneissl et al., 6 Oct 2025, Wang et al., 13 Dec 2025, Wu et al., 24 Dec 2024).

7. Summary Table: Core Mechanistic Components

Phase	Distributional Form	Network Role
Forward process	$q(x_t\mid x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}x_{t-1}, \beta_t I)$	None (fixed stochastic process)
Reverse process	$p_\theta(x_{t-1}\mid x_t) = \mathcal{N}(\mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$	Parameterized (learned) by neural net
Training loss	$\\|\epsilon - \epsilon_\theta(x_t, t)\\|^2$	Predicts (conditional) noise
Sampling recursion	$\mu = \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta)$	Inverts noise step-by-step

Diffusion probabilistic models learn to invert a fixed sequence of noising operations by fitting each reverse transition to its theoretical optimum subject to tractable variational bounds or denoising-score-matching criteria. This framework enables a general and robust class of generative models with superior sample quality and extensibility across a wide array of downstream domains and modalities (Strümke et al., 2023, Gallon et al., 2 Dec 2024, Ho et al., 2020).