Score-based Generative Models

Updated 9 November 2025

Score-based generative models are probabilistic frameworks that use forward diffusion and reverse-time SDEs to transform and generate data.
They harness denoising score matching with neural networks to approximate gradients of log-densities across varying noise levels.
Recent advances include critically-damped dynamics, phase space formulations, and preconditioning techniques that improve sampling stability and efficiency.

Score-based generative models (SGMs) are a class of probabilistic generative models defined by the interplay between stochastic differential equations (SDEs) and learned score functions—gradients of log densities at varying noise levels. Their central premise is to perturb data via a forward diffusion (noise-injection) process—rendering it progressively more like a tractable, typically Gaussian, reference distribution—then perform data synthesis by time-reversing the process, whereby a neural network approximates the gradient of the log-density ("score"), conditioning upon the noisy input at each time step. This paradigm, inspired by nonequilibrium statistical mechanics, encompasses a variety of practical architectures and training regimes, ranging from classic variance-preserving SDEs to recent physically-motivated extensions in phase space and Riemannian geometries.

1. Mathematical Foundations: Forward and Reverse SDEs

SGMs typically formalize the generative modeling problem as a pair of continuous-time Markov processes:

Forward (noising) SDE: Given data $x_0\sim p_{\rm data}$ , one defines

$dx_t = f(x_t, t)\,dt + g(t)\,dW_t,\qquad t\in[0,T],$

with $f(\cdot,t)$ the drift vector field and $g(t)$ diffusion coefficient. For most practical SGMs, this is an Itô process whose marginal $p_t(x)$ evolves from $p_0(x)$ to a well-understood prior (often centered isotropic Gaussian).

Reverse-time SDE: By Anderson's theorem, the time reversal $\bar{x}_t = x_{T-t}$ satisfies

$d\bar{x}_t = \left[-f(\bar{x}_t,T-t) + g^2(T-t)\,\nabla_{\bar{x}_t} \log p_{T-t}(\bar{x}_t)\right]\,dt + g(T-t)\,d\bar{W}_t.$

Samples from the reversed process at $t=T$ approximate data draws via $\bar{x}_0\sim \mathcal{N}(0,I)$ or other tractable prior.

Since $\nabla_x\log p_t(x)$ (the "score") is intractable, it is replaced with a neural network approximation $s_\theta(x, t)$ trained via variants of score matching.

2. Score-Matching Objectives and Training Principles

The centerpiece of SGM training is an objective function that makes the neural approximation $s_\theta(x, t)$ match the true score at each time/scale. Typical training uses denoising score matching (DSM): $\mathbb{E}_{t\sim U[0,T]}\;\mathbb{E}_{x_0\sim p_{\rm data},\,x_t\sim p_t(\cdot|x_0)}\Big[\lambda(t) \| s_\theta(x_t, t) - \nabla_{x_t} \log p_t(x_t|x_0)\|^2 \Big].$ Variants include maximum-likelihood reweighting, sliced score matching, and hybrid forms adapted to conditional or latent-variable problems.

SGM objectives can be extended to various data geometries:

Riemannian manifolds: Score matching on curved spaces uses the Riemannian gradient and divergence, as shown in RSGM (Bortoli et al., 2022).
Latent variable models: When paired with VAEs or hierarchical encoders, SGM training objectives adapt to probabilistic encoders and cross-entropy bounds (Vahdat et al., 2021).

Empirically, Hybrid Score Matching (HSM) provides stability, especially when forward SDEs admit analytic perturbation kernels, as it avoids pathologies near $t\to0$ (small noise) and can leverage conditional distributions (Dockhorn et al., 2021).

3. Extensions: Critically-Damped Dynamics, Phase Space, and Preconditioning

Recent advances have extended SGMs beyond scalar, variance-preserving SDEs, motivated by the insight that the choice of forward diffusion process alters the complexity of the reverse (generative) process.

3.1 Critically-Damped Langevin Diffusion (CLD)

CLD (Dockhorn et al., 2021) augments the state space with auxiliary "velocity" variables, interpreting the forward process as a Langevin system: $\begin{cases} dx_t = v_t\,dt, \ dv_t = -\beta x_t\,dt - \Gamma v_t\,dt + \sqrt{2\Gamma\beta}\,dW_t, \end{cases}$ where critical damping ( $\Gamma^2=4$ for mass $M=1$ ) allows fast convergence without oscillatory overshooting. The Hamiltonian coupling and velocity-space noise injection:

Accelerate state-space exploration and mixing,
Focus noise where ergodicity is needed, and
Simplify training by reducing the score matching task to only the velocity conditional: $s_\theta(x,v,t) \approx \nabla_v \log p_t(v|x)$ . This yields efficient, stable sampling and empirically outperforms classical SDEs at fixed compute (CIFAR-10 FID 2.25 at 500 NFE).

3.2 Phase Space Langevin Diffusion (PSLD) and Complete Recipes

Building on the "complete recipes" literature, PSLD (Pandey et al., 2023) generalizes the forward SDE to all Markovian processes converging to tractable Gaussian priors in augmented phase space. For $z = [q; v] \in \mathbb{R}^{2d}$ ,

$dz_t = A z_t\,dt + B\,dW_t,$

with $A,B$ constructed for critical damping, and hybrid score matching used for learning. PSLD yields lower FID and iteration complexity when the correct parametrization is used and supports conditional sampling via classifier guidance and inpainting.

3.3 Preconditioned SGMs

Ill-conditioning arising from anisotropic or batch-normalized data is mitigated by preconditioning (PDS) (Ma et al., 2023). By introducing a matrix $M$ such that $M M^\top$ rescales coordinates for more uniform variance, the method accelerates convergence by reducing the effective step-size restrictions inherent in standard isotropic SDE discretizations, without retraining or modifying the core score model.

4. Sampling Algorithms and Integration Schemes

SGM synthesis is sensitive to numerical stability and step efficiency, especially for large, potentially stiff systems:

Euler–Maruyama (EM) is simple but unstable for Hamiltonian or stiff SDEs unless using very small steps.
Symmetric Splitting Schemes (SSCS), motivated by Strang splitting, analytically integrate the linear (Ornstein-Uhlenbeck or Hamiltonian) part and execute score-based drift and noise as nonlinear steps. The practical SSCS pseudocode is:

for n = N-1 downto 0:
    x, v = Langevin_half_step(x, v, delta_t_n/2)
    v = v + delta_t_n * 2 * Gamma * [s_theta(x, v, t) + v / Sigma_t_vv]
    x, v = Langevin_half_step(x, v, delta_t_n/2)
    t += delta_t_n

SSCS remains stable at large step-sizes, greatly reducing the number of function evaluations (NFE) required.

Hybrid predictor–corrector samplers—alternating reverse SDE steps with Langevin MCMC—are standard when exact flow integration is not possible. Their effectiveness is explained rigorously by recent theory (Lee et al., 2022).

5. Theoretical Guarantees: Convergence and Polynomial Complexity

SGMs enjoy robust nonasymptotic convergence guarantees, under minimal regularity:

Wasserstein and TV bounds: If $L^2$ score error $\varepsilon$ is small and the target distribution has sufficiently decaying tails or (under stronger assumptions) smoothness,

$W_2(\text{SGM output},\,p_{\rm data}) \leq \delta$

can be achieved with compute and score approximation cost polynomial in $(d,1/\delta)$ (Lee et al., 2022, Gao et al., 2023, Bruno et al., 6 May 2025). Dimension-optimal $O(\sqrt{d})$ dependence holds under semiconvexity.

Sample and network complexity: If the log-relative density is approximable by a neural network of bounded path-norm (e.g., Barron-class or mixtures), then sample complexity can be dimension-free; thus, SGMs can "break the curse of dimensionality" for sub-Gaussian or low-complexity families (Cole et al., 12 Feb 2024).
Algorithm- and data-dependent generalization: The generalization error of the learned distribution, as measured in KL or Wasserstein metrics, decomposes into initialization (mixing time), discretization bias (step size), empirical score-matching error, and an algorithm-dependent gap. Explicit functionals of the optimization trajectory (e.g., sum of gradient norms, topological trajectory invariants) closely track observed FID gaps in practical experiments (Dupuis et al., 4 Jun 2025).

Recent work using Wasserstein Uncertainty Propagation (WUP) establishes sharp, explicit bounds in $W_1$ or TV for the error budget incurred by finite data, early stopping, objective choice, network expressiveness, and reference prior mismatch, showing SGM robustness under realistic errors (Mimikos-Stamatopoulos et al., 24 May 2024).

6. Applications and Practical Impact

Image Synthesis

SGMs and their CLD/PSLD variants set state-of-the-art unconditional and conditional generation FIDs on datasets such as CIFAR-10 and CelebA-HQ. For example:

CLD-SSCS (500 NFE, 108M params) achieves CIFAR-10 FID = 2.25, outperforming VP-SDE and LSGM baselines at comparable model size and evaluation budgets.
PSLD achieves FID = 2.10 for unconditional CIFAR-10 at probability-flow ODE sampling, and architectures scale to high resolution.

Sampling cost per image, measured in neural network forward passes, is substantially reduced when using advanced splitting or preconditioning: e.g., PDS approaches accelerate high-res image synthesis by up to 28x (Ma et al., 2023). SSCS-type samplers yield robust sample quality even for small numbers of discretization steps.

General Data Domains and Extensions

Time-series synthesis: SGM architectures extend to recurrent settings using conditional score networks (Lim et al., 2023).
Audio, Speech, Complex domains: Training in complex-valued domains (e.g., STFT of speech) leverages the same SGM infrastructure with domain-adapted architectures (Welker et al., 2022).
Latent variable and discrete data: Latent SGMs paired with VAE frameworks allow joint end-to-end training and enable compatiblity with noncontinuous data and complex decoders (Vahdat et al., 2021).
Riemannian domains: Score-based sampling can be generalized to manifolds for geoscience, robotics, or molecular applications via Riemannian SDEs (Bortoli et al., 2022).

7. Limitations, Open Problems, and Future Directions

Algorithmic Limitations and Remedies

Sampling speed: Despite splitting and preconditioning, sampling remains slower than non-iterative adversarial or flow-based approaches; research continues on distillation and non-Markovian samplers.
Memorization: Even with vanishing $L^2$ score error, empirical SGMs can degenerate to kernel density estimators over the training set, failing to generalize (i.e., only reproducing blurred versions of seen data points) (Li et al., 10 Jan 2024). This necessitates regularization or model choices that enforce broader support and generative diversity—e.g., kernel-based Wasserstein proximal models (Zhang et al., 9 Feb 2024).

Open Directions

Adaptive step size and higher-order integrators: Symplectic and adaptive integrators from molecular dynamics could further reduce discretization error in stiff SDEs and phase-space dilations (Dockhorn et al., 2021).
Noise schedule optimization: The optimal scheduling of noise injection is critical to balancing mixing and approximation error—joint tuning of schedule parameters and network is increasingly tractable via explicit upper bounds (Strasman et al., 7 Feb 2024).
Generalization theory: Improving algorithm-dependent bounds, incorporating interaction with architecture capacity, and quantifying diversity (beyond marginal metrics) remain open.
Manifold learning & interpretability: Leveraging kernel-based viewpoints and explicit covariance adaptation to learn (and interpret) the underlying data geometry is a promising direction for improving SGM generalization and transparency.

SGMs provide a mathematically principled and practically potent toolkit for generative modeling across domains. Recent advances in diffusion process design, score matching methodology, and numerical integration yield flexible, robust, and efficient frameworks that continue to unify and strengthen the broader field of probabilistic generative modeling.