Papers
Topics
Authors
Recent
2000 character limit reached

Diffusion Probabilistic Modeling

Updated 25 December 2025
  • Diffusion probabilistic modeling is a deep generative technique that learns to invert a forward Markovian diffusion process, turning structured data into noise and recovering it via a learned reverse process.
  • It employs variational training and denoising score matching to optimize the model, achieving superior sample fidelity and diversity compared to GANs and other generative methods.
  • The approach’s flexible architecture avoids latent variable pitfalls found in VAEs and flows, enabling robust performance in controllable generation and high-resolution synthesis.

Diffusion probabilistic modeling defines a class of deep generative models in which high-dimensional data distributions are learned via the reversal of a diffusion process that gradually corrupts data into noise through a Markovian sequence of Gaussian transitions. The approach is rooted in nonequilibrium thermodynamics, with algorithmic constructs analogous to the dynamics of particles diffusing from structured states (data) to unstructured noise and a trained model learning how to invert this process for effective sample generation. In contrast to VAEs and flows, where latent variables are of lower or equal dimension to the data, diffusion models operate in the original data space and currently outperform GANs and other generative paradigms in sample fidelity and diversity for several domains including images, video, and scientific applications (Strümke et al., 2023, Gallon et al., 2 Dec 2024).

1. Mathematical Foundations

Let x0x_0 denote a sample from the unknown data distribution pdata(x)p_{\mathrm{data}}(x) in Rd\mathbb{R}^d. The model defines a time-indexed sequence via a discrete-time Markov chain: q(x1:Tx0)=t=1Tq(xtxt1),q(xtxt1)=N(xt;1βtxt1,βtI),q(x_{1:T}\mid x_0) = \prod_{t=1}^T q(x_t\mid x_{t-1}),\quad q(x_t\mid x_{t-1}) = \mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_t I), where {βt}\{\beta_t\} is a variance schedule, and αt=1βt\alpha_t = 1-\beta_t, αˉt=i=1tαi\bar\alpha_t = \prod_{i=1}^t \alpha_i. By induction,

q(xtx0)=N(xt;αˉtx0,(1αˉt)I).q(x_t\mid x_0) = \mathcal{N}(x_t;\sqrt{\bar\alpha_t}x_0,(1-\bar\alpha_t)I).

For sufficiently large TT, q(xT)q(x_T) approaches N(0,I)\mathcal{N}(0,I), the reference noise prior.

The reverse process parameterizes the time-reversed sequence via: pθ(x0:T)=pprior(xT)t=1Tpθ(xt1xt),pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t)),p_\theta(x_{0:T}) = p_{\mathrm{prior}}(x_T) \prod_{t=1}^T p_\theta(x_{t-1}\mid x_t),\quad p_\theta(x_{t-1}\mid x_t) = \mathcal{N}(x_{t-1};\mu_\theta(x_t,t),\Sigma_\theta(x_t,t)), with μθ\mu_\theta and Σθ\Sigma_\theta typically learned by neural networks to approximate the true reverse kernels.

Continuous limits replace the discrete Markov chain with an Itô SDE: dx=12β(t)xdt+β(t)dw,dx = -\frac{1}{2}\beta(t) x\,dt + \sqrt{\beta(t)}\,dw, with reverse SDE

dx=[12β(t)xβ(t)sθ(x,t)]dt+β(t)dw,dx = \left[\frac{1}{2}\beta(t)x - \beta(t)s_\theta(x, t)\right]dt + \sqrt{\beta(t)}dw,

where sθ(x,t)s_\theta(x, t) learns xlogq(xt)\nabla_x \log q(x_t), the score of the forward process (Strümke et al., 2023, Gallon et al., 2 Dec 2024).

2. Variational Training and Score Matching

Training optimizes the model parameters by maximizing a variational lower bound (ELBO) on the marginal log-likelihood: logpθ(x0)=logpθ(x0:T)dx1:T Eq[logpθ(x0:T)q(x1:Tx0)]=LVLB,\begin{aligned} \log p_\theta(x_0) &= \log \int p_\theta(x_{0:T})\,dx_{1:T} \ &\geq \mathbb{E}_{q}\left[\log\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}\mid x_0)}\right] = -\mathcal{L}_{\mathrm{VLB}}, \end{aligned} with

LVLB=Eq[logpθ(x0x1)+DKL(q(xTxT1)pprior(xT))+t=2TDKL(q(xt1xt,x0)pθ(xt1xt))].\mathcal{L}_{\mathrm{VLB}} = \mathbb{E}_q\left[ -\log p_\theta(x_0\mid x_1) + D_{\mathrm{KL}}(q(x_T\mid x_{T-1})\|p_{\mathrm{prior}}(x_T)) + \sum_{t=2}^T D_{\mathrm{KL}}(q(x_{t-1}\mid x_t,x_0)\|p_\theta(x_{t-1}\mid x_t)) \right].

If the variances are matched to the true conditionals (Σθ=β~tI\Sigma_\theta = \tilde\beta_t I), each step's KL term corresponds to an MSE loss between the true noise and its prediction. In practice, the ELBO reduces to the denoising score matching objective: Lsimple=Et,x0,ϵ[ϵϵθ(xt,t)2],xt=αˉtx0+1αˉtϵ,\mathcal{L}_{\mathrm{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right],\quad x_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon, which is used for efficient optimization (Strümke et al., 2023, Ho et al., 2020, Gallon et al., 2 Dec 2024).

3. Sampling Mechanisms and Inference

After training, sample generation starts from xTN(0,I)x_T \sim \mathcal{N}(0,I) and recursively applies the learned reverse transitions: ϵθ=ϵθ(xt,t), μ=1αt(xt1αt1αˉtϵθ), xt1=μ+β~tz,zN(0,I),\begin{aligned} \epsilon_\theta &= \epsilon_\theta(x_t, t), \ \mu &= \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta\right), \ x_{t-1} &= \mu + \sqrt{\tilde\beta_t} z, \quad z \sim \mathcal{N}(0, I), \end{aligned} where β~t=1αˉt11αˉtβt\tilde\beta_t = \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t.

Deterministic alternatives (e.g., DDIM) and SDE solvers (e.g., Euler–Maruyama) are used to reduce the number of steps, accelerate inference, or deterministically map noise to samples (Gallon et al., 2 Dec 2024).

4. Algorithmic Implementation and Architectures

A typical training loop consists of:

  • Drawing a minibatch x0pdatax_0 \sim p_{\mathrm{data}}
  • Sampling t{1,,T}t \in \{1, \ldots, T\} and ϵN(0,I)\epsilon \sim \mathcal{N}(0, I)
  • Computing xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t} \epsilon
  • Predicting ϵ^=ϵθ(xt,t)\hat\epsilon = \epsilon_\theta(x_t, t)
  • Taking a gradient step on the squared loss ϵϵ^2\|\epsilon - \hat\epsilon\|^2

U-Net backbones (often with group normalization, attention, and Fourier/time positional embeddings) are standard. The choice of noise schedule (linear or cosine) and number of diffusion steps (TT) are critical hyperparameters (Gallon et al., 2 Dec 2024, Strümke et al., 2023, Ho et al., 2020).

Sampling proceeds via ancestral generation as above or via continuous-time SDE integration depending on the application.

5. Thermodynamic and Theoretical Perspectives

Diffusion probabilistic models are formally motivated by analogy to nonequilibrium thermodynamics: forward noising increases entropy by driving data off the low-entropy data manifold, while the reverse process reconstructs structure by descending the free energy functional of the modeled system. The ELBO can be viewed as a discrete action or free-energy gap. This thermodynamic parallel also governs the stability, path properties, and asymptotic behavior of the learned Markov chain (Strümke et al., 2023, Peter, 2023).

6. Extensions, Applications, and Comparative Analysis

Extensions include:

  • Conditional and classifier-free guidance mechanisms for controllable generation.
  • Latent diffusion, where diffusion operates in a compressed latent space to enable high-resolution synthesis.
  • Contractive models, which improve robustness by enforcing a contraction property on the reverse drift (Tang et al., 23 Jan 2024).
  • Truncated diffusion, where the chain is shortened and terminated before reaching pure noise, with a learned implicit prior for accelerated sampling (Zheng et al., 2022).
  • Field models for generative modeling over arbitrary domains, including non-Euclidean manifolds (Zhuang et al., 2023).

Diffusion models have demonstrated superior sample quality and stability compared to GANs and VAEs. Unlike VAEs, which require an explicit encoder and are prone to posterior collapse, diffusion models operate exclusively in data space and typically avoid such collapse. Unlike normalizing flows, diffusion models are not constrained by invertibility or tractable Jacobians, allowing flexible architectures and more expressive mappings (Strümke et al., 2023, Gallon et al., 2 Dec 2024).

Diffusion probabilistic modeling has been successfully applied to high-fidelity image, video, and scientific domains, including protein structure generation, probabilistic forecasting, unsupervised disentanglement, and multimodal generation (Yang et al., 2022, Trippe et al., 2022, Kneissl et al., 6 Oct 2025, Wang et al., 13 Dec 2025, Wu et al., 24 Dec 2024).

7. Summary Table: Core Mechanistic Components

Phase Distributional Form Network Role
Forward process q(xtxt1)=N(1βtxt1,βtI)q(x_t\mid x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}x_{t-1}, \beta_t I) None (fixed stochastic process)
Reverse process pθ(xt1xt)=N(μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1}\mid x_t) = \mathcal{N}(\mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) Parameterized (learned) by neural net
Training loss ϵϵθ(xt,t)2\|\epsilon - \epsilon_\theta(x_t, t)\|^2 Predicts (conditional) noise
Sampling recursion μ=1αt(xt1αt1αˉtϵθ)\mu = \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta) Inverts noise step-by-step

Diffusion probabilistic models learn to invert a fixed sequence of noising operations by fitting each reverse transition to its theoretical optimum subject to tractable variational bounds or denoising-score-matching criteria. This framework enables a general and robust class of generative models with superior sample quality and extensibility across a wide array of downstream domains and modalities (Strümke et al., 2023, Gallon et al., 2 Dec 2024, Ho et al., 2020).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Diffusion Probabilistic Modeling.