Papers
Topics
Authors
Recent
Search
2000 character limit reached

Diffusion and Generative Modeling

Updated 15 February 2026
  • Diffusion and generative modeling is a probabilistic framework that transforms structured data into noise and reverses this process using neural networks.
  • The approach learns to predict and invert noise through denoising steps, achieving state-of-the-art results in image, audio, and molecular synthesis.
  • Recent innovations focus on sampling acceleration, controllability, and broader applications in multi-modal and structured data generation.

Diffusion and Generative Modeling

Diffusion and generative modeling comprises a class of probabilistic algorithms that synthesize data by simulating the reversal of an artificial noise process. At its core, this methodology transforms structured data into a simple prior (typically Gaussian noise) through a forward diffusion process, then learns a neural parameterization to invert that process via a sequence of denoising steps. Diffusion models have established state-of-the-art results in image, audio, molecular, and structured data synthesis, and underpin a variety of modern generative AI systems. This article surveys the mathematical principles, canonical algorithms, theoretical frameworks, model architectures, application domains, and recent analytical advances characterizing the field.

1. Mathematical Foundations: Forward and Reverse Dynamics

The central structure of diffusion models is a Markov chain or continuous-time stochastic process that iteratively corrupts data with noise. In the discrete-time case, a data vector x0Rdx_0\in\mathbb{R}^d evolves under forward dynamics:

q(xtxt1)=N(xt;αtxt1,(1αt)I),q(x_t|x_{t-1}) = \mathcal{N}(x_t;\sqrt{\alpha_t}\,x_{t-1},(1{-}\alpha_t)I),

with a variance schedule {βt}\{\beta_t\} such that αt=1βt\alpha_t=1-\beta_t, αˉt=i=1tαi\bar\alpha_t=\prod_{i=1}^t \alpha_i. The closed-form marginal

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t|x_0) = \mathcal{N}(x_t;\sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)I)

shows that after sufficiently many steps, xTx_T approximates N(0,I)\mathcal{N}(0,I). In continuous time, the variance-preserving stochastic differential equation (SDE) is

dxt=12β(t)xtdt+β(t)dWt,dx_t = -\tfrac{1}{2}\beta(t)x_t\,dt + \sqrt{\beta(t)}\,dW_t,

where β(t)\beta(t) is a time-dependent noise rate and WtW_t is standard Brownian motion.

The reverse (generative) process is formulated as a learned Markov or SDE chain that attempts to invert the forward dynamics. In the discrete case, the reverse transition is learned as

pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I),p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1};\mu_\theta(x_t,t),\sigma_t^2I),

with μθ\mu_\theta parameterized (typically by a neural network) as a function of xtx_t and tt. The reverse SDE is

dxt=[f(xt,t)g(t)2xtlogpt(xt)]dt+g(t)dWˉt,dx_t = [f(x_t,t) - g(t)^2\,\nabla_{x_t}\log p_t(x_t)]\,dt + g(t)\,d\bar W_t,

where pt(xt)p_t(x_t) is the marginal at time tt and Wˉt\bar W_t is time-reversed Brownian motion. The key unifying feature is that the generative process learns to predict, and thus locally invert, the effect of the forward noise (Ding et al., 2024, Gallon et al., 2024, Cao et al., 2022, Higham et al., 2023).

2. Training Objectives and Algorithmic Procedures

The canonical training criterion is the evidence lower bound (ELBO), which upper-bounds the negative log-likelihood:

logpθ(x0)Eq(x1:Tx0)[logpθ(x0:T)q(x1:Tx0)].\log p_\theta(x_0)\geq \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right].

In DDPMs, this decomposes into a sum of KL divergences and a reconstruction likelihood, but all per-step KLs with fixed variance schedule reduce to a mean-squared error in noise prediction (Ding et al., 2024, Gallon et al., 2024, Zhen et al., 2024, Torre, 2023):

LDM=Et,x0,ε[εϵθ(xt,t)2],xt=αˉtx0+1αˉtε.\mathcal{L}_{\rm DM} = \mathbb{E}_{t,x_0,\varepsilon}\big[\|\varepsilon - \epsilon_\theta(x_t, t)\|^2\big],\quad x_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\varepsilon.

Score-based models directly estimate the time-dependent score xtlogpt(xt)\nabla_{x_t}\log p_t(x_t) via denoising score matching (Higham et al., 2023, Gallon et al., 2024):

LSM=Et,x0,ε[w(t)εϵθ(xt,t)2].\mathcal{L}_{\rm SM} = \mathbb{E}_{t,x_0,\varepsilon} [ w(t)\|\varepsilon-\epsilon_\theta(x_t,t)\|^2 ].

Sampling in the generative phase proceeds via an "ancestral sampling" scheme, starting from xTN(0,I)x_T\sim \mathcal{N}(0,I) and iteratively applying the learned reverse updates for each step from TT to $0$. Deterministic (ODE-based) or semi-deterministic (DDIM, consistency models) alternatives are widely used to accelerate sampling without severe loss in fidelity (Ding et al., 2024, Gallon et al., 2024, Cao et al., 2022).

3. Analytical Theory and Trajectory Geometry

Recent work has elucidated the structure of diffusion model trajectories. In the case of isotropic Gaussian models, the probability flow ODE admits a closed-form solution:

dxdt=β(t)x(t)12g(t)2s(x(t),t),\frac{dx}{dt} = -\beta(t)x(t) - \tfrac12 g(t)^2 s(x(t),t),

where s(x,t)s(x,t) is the score at time tt. Under Gaussian data, the solution decomposes as

xt=αtμ+σtσTyT+k=1rψ(t,λk)ck(T)uk,x_t = \alpha_t\mu + \frac{\sigma_t}{\sigma_T}y_T^\perp + \sum_{k=1}^r\psi(t,\lambda_k)c_k(T)u_k,

with yTy_T^\perp the off-manifold component, ckc_k projections onto principal axes uku_k of covariance Σ\Sigma, and ψ\psi a time/eigenvalue-dependent scaling (Wang et al., 2023).

Trajectory geometry is thus characterized by:

  • Approximate 2D "rotation": If rDr\ll D, the trajectory xtx_t interpolates chiefly in the 2-plane between the initial noise xTx_T and the final signal x0x_0.
  • Coarse-to-fine feature emergence: Early during denoising, high-variance (global) features appear, later steps add fine details. The amplification factor for a perturbation demonstrates that early noise impacts global layout, while late noise impacts details.
  • Sampling acceleration: Closed-form interpolation allows skipping early steps ("teleportation") by computing the trajectory analytically for converged phases; empirical experiments show this enables omitting 30–40% of the steps on datasets like CIFAR-10 with negligible degradation in Fréchet Inception Distance (FID).

This analysis exhibits concrete analogies to GANs: diffusion's reverse process commits first to outlines (high-variance directions) and then fills in details, similar to progressive GAN synthesis (Wang et al., 2023).

4. Generalization, Memorization, and Theoretical Guarantees

The generalization properties of diffusion models have been scrutinized via mutual information bounds and excess risk. The generalization error of a generative model is bounded above by

GenErrM2nI(G;T),\operatorname{GenErr} \leq \sqrt{\frac{M^2}{n}I(G;T)},

where GG is the generated data, TT the training set, n=Tn=|T|, and MM bounds the test functions (Yi et al., 2023). For an empirical-optimal diffusion model with deterministic samplers (e.g., DDIM), the model memorizes the training set: the output becomes a soft nearest neighbor of the dataset, and I(x0;T)I(x_0;T)\to\infty.

However, practical training via stochastic gradient descent introduces an "optimization bias" that breaks exact memorization, empirically allowing the models to generate novel data. To systematically address memorization, alternative objectives have been proposed (e.g., predicting xt1x_{t-1} from xtx_t rather than original noise increments), whose population minimizers are mixtures over noisy instead of clean samples, achieving finite mutual information and thus alleviating the generalization problem at the loss minimizer (Yi et al., 2023).

5. Algorithmic Advances and Architectural Extensions

Numerous algorithmic innovations have extended the utility and efficiency of diffusion models (Cao et al., 2022, Ding et al., 2024):

  • Sampling Speedups: ODE-based solvers (e.g., DDIM, DPM-Solver, DEIS) allow sampling in 10–50 function evaluations instead of thousands. Accelerated, parallel, and block-sequential models can reduce step counts further by predicting entire denoising trajectories in a single network evaluation (Asthana et al., 2024).
  • Noise Schedules: Careful choice of {βt}\{\beta_t\} (linear, cosine, learnable signal-to-noise) impacts sample quality and convergence. Adaptive and exponential, image-aware schedules have been developed for further acceleration (Asthana et al., 2024).
  • Latent and Object-Centric Models: Latent Diffusion Models (LDM) diffuse in a lower-dimensional autoencoder (VAE) latent space, dramatically reducing computation in high-resolution settings (Cao et al., 2022). Object-centric models like SlotDiffusion combine slot-based factorization with latent diffusion for compositional, structured generation (Wu et al., 2023).
  • Multi-Modal and Structured Data: MT-Diffusion generalizes diffusion modeling to simultaneously generate multiple modalities (e.g., image+label) in a single framework by aggregating encodings in a joint diffusion space and sharing a backbone denoiser across tasks (Chen et al., 2024). Discrete, structured, and non-Euclidean domains are handled via bridge-based extensions and custom SDEs (Liu et al., 2022, Peluchetti, 2023).

6. Applications and Empirical Performance

Diffusion models have demonstrated versatility and competitive performance across diverse applications (Cao et al., 2022, Ding et al., 2024, Peluchetti, 2023, Baranwal et al., 2023, Briden et al., 1 Jan 2025, Rønne et al., 24 Jul 2025):

Empirical metrics include:

Metric Formula/Definition Context
FID μrμg2+Tr[Σr+Σg2(ΣrΣg)1/2]||\mu_r-\mu_g||^2 + \operatorname{Tr}[\Sigma_r+\Sigma_g-2(\Sigma_r\Sigma_g)^{1/2}] Image quality/diversity
IS exp(ExDKL[p(yx)p(y)])\exp(\mathbb{E}_x D_{KL}[p(y|x)||p(y)]) Class likelihood sharpness
PR curve Precision/Recall at distance threshold ε\varepsilon Fidelity/diversity (structures)
NLL Exdata[logpθ(x)]-\mathbb{E}_{x\sim\mathrm{data}}[\log p_\theta(x)] Log-likelihood fit, generative

Typical FID for cutting-edge DDPM and LDM models on CIFAR-10 is \sim2.9–3.3 (Zhen et al., 2024, Wang et al., 2023, Ding et al., 2024). SlotDiffusion improves object-centric decomposition and segmentation accuracy by several points over prior slot architectures (Wu et al., 2023). In scientific domains, diffusion-generated cardiac waves replicate statistics and morphologies indistinguishable from those of the biophysical PDE ground truth (Baranwal et al., 2023), while atomistic diffusion achieves close-matched precision-recall curves on cluster and 2D materials datasets (Rønne et al., 24 Jul 2025).

7. Open Problems, Theoretical Perspectives, and Future Directions

Current research addresses several open problems:

The field continues to expand by bridging variational inference, stochastic calculus, optimal transport, and deep learning, yielding broad theoretical, algorithmic, and application-level advances in generative modeling.


References: (Wang et al., 2023, Yi et al., 2023, Cao et al., 2022, Asthana et al., 2024, Liu et al., 2022, Peluchetti, 2023, Higham et al., 2023, Du et al., 2022, Zhen et al., 2024, Gallon et al., 2024, Torre, 2023, Wu et al., 2023, Le, 2024, Bartosh et al., 2023, Baranwal et al., 2023, Chen et al., 2024, Ding et al., 2024, Briden et al., 1 Jan 2025, Rønne et al., 24 Jul 2025, Li et al., 2023)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Diffusion and Generative Modeling.