Diffusion and Generative Modeling

Updated 15 February 2026

Diffusion and generative modeling is a probabilistic framework that transforms structured data into noise and reverses this process using neural networks.
The approach learns to predict and invert noise through denoising steps, achieving state-of-the-art results in image, audio, and molecular synthesis.
Recent innovations focus on sampling acceleration, controllability, and broader applications in multi-modal and structured data generation.

Diffusion and generative modeling comprises a class of probabilistic algorithms that synthesize data by simulating the reversal of an artificial noise process. At its core, this methodology transforms structured data into a simple prior (typically Gaussian noise) through a forward diffusion process, then learns a neural parameterization to invert that process via a sequence of denoising steps. Diffusion models have established state-of-the-art results in image, audio, molecular, and structured data synthesis, and underpin a variety of modern generative AI systems. This article surveys the mathematical principles, canonical algorithms, theoretical frameworks, model architectures, application domains, and recent analytical advances characterizing the field.

1. Mathematical Foundations: Forward and Reverse Dynamics

The central structure of diffusion models is a Markov chain or continuous-time stochastic process that iteratively corrupts data with noise. In the discrete-time case, a data vector $x_0\in\mathbb{R}^d$ evolves under forward dynamics:

$q(x_t|x_{t-1}) = \mathcal{N}(x_t;\sqrt{\alpha_t}\,x_{t-1},(1{-}\alpha_t)I),$

with a variance schedule $\{\beta_t\}$ such that $\alpha_t=1-\beta_t$ , $\bar\alpha_t=\prod_{i=1}^t \alpha_i$ . The closed-form marginal

$q(x_t|x_0) = \mathcal{N}(x_t;\sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)I)$

shows that after sufficiently many steps, $x_T$ approximates $\mathcal{N}(0,I)$ . In continuous time, the variance-preserving stochastic differential equation (SDE) is

$dx_t = -\tfrac{1}{2}\beta(t)x_t\,dt + \sqrt{\beta(t)}\,dW_t,$

where $\beta(t)$ is a time-dependent noise rate and $W_t$ is standard Brownian motion.

The reverse (generative) process is formulated as a learned Markov or SDE chain that attempts to invert the forward dynamics. In the discrete case, the reverse transition is learned as

$p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1};\mu_\theta(x_t,t),\sigma_t^2I),$

with $\mu_\theta$ parameterized (typically by a neural network) as a function of $x_t$ and $t$ . The reverse SDE is

$dx_t = [f(x_t,t) - g(t)^2\,\nabla_{x_t}\log p_t(x_t)]\,dt + g(t)\,d\bar W_t,$

where $p_t(x_t)$ is the marginal at time $t$ and $\bar W_t$ is time-reversed Brownian motion. The key unifying feature is that the generative process learns to predict, and thus locally invert, the effect of the forward noise (Ding et al., 2024, Gallon et al., 2024, Cao et al., 2022, Higham et al., 2023).

2. Training Objectives and Algorithmic Procedures

The canonical training criterion is the evidence lower bound (ELBO), which upper-bounds the negative log-likelihood:

$\log p_\theta(x_0)\geq \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right].$

In DDPMs, this decomposes into a sum of KL divergences and a reconstruction likelihood, but all per-step KLs with fixed variance schedule reduce to a mean-squared error in noise prediction (Ding et al., 2024, Gallon et al., 2024, Zhen et al., 2024, Torre, 2023):

$\mathcal{L}_{\rm DM} = \mathbb{E}_{t,x_0,\varepsilon}\big[\|\varepsilon - \epsilon_\theta(x_t, t)\|^2\big],\quad x_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\varepsilon.$

Score-based models directly estimate the time-dependent score $\nabla_{x_t}\log p_t(x_t)$ via denoising score matching (Higham et al., 2023, Gallon et al., 2024):

$\mathcal{L}_{\rm SM} = \mathbb{E}_{t,x_0,\varepsilon} [ w(t)\|\varepsilon-\epsilon_\theta(x_t,t)\|^2 ].$

Sampling in the generative phase proceeds via an "ancestral sampling" scheme, starting from $x_T\sim \mathcal{N}(0,I)$ and iteratively applying the learned reverse updates for each step from $T$ to $0$. Deterministic (ODE-based) or semi-deterministic (DDIM, consistency models) alternatives are widely used to accelerate sampling without severe loss in fidelity (Ding et al., 2024, Gallon et al., 2024, Cao et al., 2022).

3. Analytical Theory and Trajectory Geometry

Recent work has elucidated the structure of diffusion model trajectories. In the case of isotropic Gaussian models, the probability flow ODE admits a closed-form solution:

$\frac{dx}{dt} = -\beta(t)x(t) - \tfrac12 g(t)^2 s(x(t),t),$

where $s(x,t)$ is the score at time $t$ . Under Gaussian data, the solution decomposes as

$x_t = \alpha_t\mu + \frac{\sigma_t}{\sigma_T}y_T^\perp + \sum_{k=1}^r\psi(t,\lambda_k)c_k(T)u_k,$

with $y_T^\perp$ the off-manifold component, $c_k$ projections onto principal axes $u_k$ of covariance $\Sigma$ , and $\psi$ a time/eigenvalue-dependent scaling (Wang et al., 2023).

Trajectory geometry is thus characterized by:

Approximate 2D "rotation": If $r\ll D$ , the trajectory $x_t$ interpolates chiefly in the 2-plane between the initial noise $x_T$ and the final signal $x_0$ .
Coarse-to-fine feature emergence: Early during denoising, high-variance (global) features appear, later steps add fine details. The amplification factor for a perturbation demonstrates that early noise impacts global layout, while late noise impacts details.
Sampling acceleration: Closed-form interpolation allows skipping early steps ("teleportation") by computing the trajectory analytically for converged phases; empirical experiments show this enables omitting 30–40% of the steps on datasets like CIFAR-10 with negligible degradation in Fréchet Inception Distance (FID).

This analysis exhibits concrete analogies to GANs: diffusion's reverse process commits first to outlines (high-variance directions) and then fills in details, similar to progressive GAN synthesis (Wang et al., 2023).

4. Generalization, Memorization, and Theoretical Guarantees

The generalization properties of diffusion models have been scrutinized via mutual information bounds and excess risk. The generalization error of a generative model is bounded above by

$\operatorname{GenErr} \leq \sqrt{\frac{M^2}{n}I(G;T)},$

where $G$ is the generated data, $T$ the training set, $n=|T|$ , and $M$ bounds the test functions (Yi et al., 2023). For an empirical-optimal diffusion model with deterministic samplers (e.g., DDIM), the model memorizes the training set: the output becomes a soft nearest neighbor of the dataset, and $I(x_0;T)\to\infty$ .

However, practical training via stochastic gradient descent introduces an "optimization bias" that breaks exact memorization, empirically allowing the models to generate novel data. To systematically address memorization, alternative objectives have been proposed (e.g., predicting $x_{t-1}$ from $x_t$ rather than original noise increments), whose population minimizers are mixtures over noisy instead of clean samples, achieving finite mutual information and thus alleviating the generalization problem at the loss minimizer (Yi et al., 2023).

5. Algorithmic Advances and Architectural Extensions

Numerous algorithmic innovations have extended the utility and efficiency of diffusion models (Cao et al., 2022, Ding et al., 2024):

Sampling Speedups: ODE-based solvers (e.g., DDIM, DPM-Solver, DEIS) allow sampling in 10–50 function evaluations instead of thousands. Accelerated, parallel, and block-sequential models can reduce step counts further by predicting entire denoising trajectories in a single network evaluation (Asthana et al., 2024).
Noise Schedules: Careful choice of $\{\beta_t\}$ (linear, cosine, learnable signal-to-noise) impacts sample quality and convergence. Adaptive and exponential, image-aware schedules have been developed for further acceleration (Asthana et al., 2024).
Latent and Object-Centric Models: Latent Diffusion Models (LDM) diffuse in a lower-dimensional autoencoder (VAE) latent space, dramatically reducing computation in high-resolution settings (Cao et al., 2022). Object-centric models like SlotDiffusion combine slot-based factorization with latent diffusion for compositional, structured generation (Wu et al., 2023).
Multi-Modal and Structured Data: MT-Diffusion generalizes diffusion modeling to simultaneously generate multiple modalities (e.g., image+label) in a single framework by aggregating encodings in a joint diffusion space and sharing a backbone denoiser across tasks (Chen et al., 2024). Discrete, structured, and non-Euclidean domains are handled via bridge-based extensions and custom SDEs (Liu et al., 2022, Peluchetti, 2023).

6. Applications and Empirical Performance

Diffusion models have demonstrated versatility and competitive performance across diverse applications (Cao et al., 2022, Ding et al., 2024, Peluchetti, 2023, Baranwal et al., 2023, Briden et al., 1 Jan 2025, Rønne et al., 24 Jul 2025):

Image and Video Synthesis: Achieve state-of-the-art results on photorealism (e.g., Stable Diffusion, Imagen), semantic segmentation, inpainting, and video prediction (Gallon et al., 2024, Wu et al., 2023, Ding et al., 2024).
Scientific, Biomedical, and Physical Systems: Cardiac excitation wave generation, molecular structure synthesis, and spacecraft trajectory design employing compositional or energy-based guidance (Baranwal et al., 2023, Briden et al., 1 Jan 2025, Rønne et al., 24 Jul 2025).
Audio, Speech, and Time Series: Generative vocoders (WaveGrad, DiffWave), forecasting/imputation (Cao et al., 2022).
Multi-modal Generation: Simultaneous synthesis or translation across image, label, representation, and text (Chen et al., 2024).

Empirical metrics include:

Metric	Formula/Definition	Context
FID	$\|\|\mu_r-\mu_g\|\|^2 + \operatorname{Tr}[\Sigma_r+\Sigma_g-2(\Sigma_r\Sigma_g)^{1/2}]$	Image quality/diversity
IS	$\exp(\mathbb{E}_x D_{KL}[p(y\|x)\|\|p(y)])$	Class likelihood sharpness
PR curve	Precision/Recall at distance threshold $\varepsilon$	Fidelity/diversity (structures)
NLL	$-\mathbb{E}_{x\sim\mathrm{data}}[\log p_\theta(x)]$	Log-likelihood fit, generative

Typical FID for cutting-edge DDPM and LDM models on CIFAR-10 is $\sim$ 2.9–3.3 (Zhen et al., 2024, Wang et al., 2023, Ding et al., 2024). SlotDiffusion improves object-centric decomposition and segmentation accuracy by several points over prior slot architectures (Wu et al., 2023). In scientific domains, diffusion-generated cardiac waves replicate statistics and morphologies indistinguishable from those of the biophysical PDE ground truth (Baranwal et al., 2023), while atomistic diffusion achieves close-matched precision-recall curves on cluster and 2D materials datasets (Rønne et al., 24 Jul 2025).

7. Open Problems, Theoretical Perspectives, and Future Directions

Current research addresses several open problems:

Efficiency vs. Fidelity: Trade-offs between sampling speed and sample quality (e.g., consistency models, distillation, optimal-flow schemes) (Ding et al., 2024).
Controllability and Guidance: Mechanisms for fine control, compositional editing, and reward-based or classifier guidance, including classifier-free approaches (Gallon et al., 2024, Cao et al., 2022, Briden et al., 1 Jan 2025, Rønne et al., 24 Jul 2025).
Theoretical Guarantees: Statistical and discretization error bounds, convergence of bridge-based generative frameworks, and generalization analysis (e.g., mutual information minimization) (Yi et al., 2023, Liu et al., 2022, Wang et al., 2023).
Domain Extension and Integration: Extension to discrete, structured, and scientific domains—including proteins, molecules, and physical systems—using specialized diffusion bridges and energy-based methods (Liu et al., 2022, Rønne et al., 24 Jul 2025, Baranwal et al., 2023).
Optimal Transport and Schrödinger Bridges: Incorporating exact bridges (e.g., Iterated Diffusion Bridge Mixtures) for one-step or iterated optimal generation and transfer across domains (Peluchetti, 2023).
Architectural Innovations: Exploiting learnable, non-linear, and neural parameterizations (e.g., Neural Diffusion Models), image-aware schedules, block-sequential sampling, and equivariant GNNs in physical sciences (Bartosh et al., 2023, Asthana et al., 2024, Rønne et al., 24 Jul 2025).

The field continues to expand by bridging variational inference, stochastic calculus, optimal transport, and deep learning, yielding broad theoretical, algorithmic, and application-level advances in generative modeling.

References: (Wang et al., 2023, Yi et al., 2023, Cao et al., 2022, Asthana et al., 2024, Liu et al., 2022, Peluchetti, 2023, Higham et al., 2023, Du et al., 2022, Zhen et al., 2024, Gallon et al., 2024, Torre, 2023, Wu et al., 2023, Le, 2024, Bartosh et al., 2023, Baranwal et al., 2023, Chen et al., 2024, Ding et al., 2024, Briden et al., 1 Jan 2025, Rønne et al., 24 Jul 2025, Li et al., 2023)