Diffusion-Based Generative Models

Updated 24 June 2025

Diffusion-based generative models are a class of probabilistic models that synthesize data by reversing a process of progressive noise injection. These models are constructed by defining a forward process that systematically corrupts data into noise through stochastic dynamics—often formulated as discrete-time Markov chains or continuous-time stochastic differential equations (SDEs)—and a parameterized reverse process, typically realized as a neural network, that reconstructs samples from noise. Initially introduced for modeling high-dimensional images, diffusion-based methods now underpin state-of-the-art generative modeling across diverse modalities, with ongoing research expanding their theoretical and practical landscape.

1. Variational and Stochastic Foundations

The core of diffusion-based generative modeling leverages a variational inference framework applied to continuous-time stochastic processes. The data generative process is modeled by an Itō SDE: $\mathrm{d}X = \mu(X, t)\,\mathrm{d}t + \sigma(X, t)\,\mathrm{d}B_t,$ with an initial state $X_0 \sim p_0$ and time-evolving density $p(x, t)$ . The forward process incrementally adds noise, producing a sequence of progressively corrupted data distributions. The marginal likelihood at terminal time $T$ is characterized via the Feynman-Kac formula, relating the data density to expectations over the paths of a reverse-time SDE: $p(x, T) = \mathbb{E}\left[p_0(Y_T) \exp\left(\int_0^T -\nabla \cdot \mu(Y_s, T-s) \,\mathrm{d}s\right) \,\Bigg|\, Y_0 = x\right],$ where $Y_s$ follows the corresponding reverse-time SDE. A variational lower bound (ELBO) for the data likelihood is derived using the Girsanov theorem, with the Brownian path as a latent variable: $\log p(x, T) \geq \mathbb{E}_Q\left[\log \frac{\mathrm{d}\mathbb{P}}{\mathrm{d}Q} + \log p_0(Y_T) - \int_0^T \nabla\cdot\mu\, \mathrm{d}s\,\Bigg|\,Y_0=x\right],$ which can be written explicitly as

$\mathrm{ELBO}^{\infty}=\mathbb{E}\left[-\frac{1}{2} \int_0^T\|a(\omega, s)\|_2^2\,\mathrm{d}s + \log p_0(Y_T) - \int_0^T\nabla\cdot\mu\,\mathrm{d}s\Big|\,Y_0=x\right],$

where $a(\omega, s)$ is the variational posterior drift. This construction recovers discrete-time hierarchical VAEs in the appropriate limit and identifies diffusion models as infinitely deep VAEs.

2. Score Matching and Likelihood Maximization

A defining aspect of these models is the pivotal role of score functions: the gradient of the log-density of the perturbed data. The reverse SDE is driven by the score function $\nabla \log q(y, s)$ , which is unknown and learned via a parameterized network $s_\theta$ . Training proceeds through score matching—either in explicit (ESM) or implicit (ISM) forms—with associated losses such as: $\int_0^T \mathbb{E}_{Y_s}\left[\frac{1}{2} \|s(Y_s, s) - \nabla \log q(Y_s, s) \|_{\Lambda(s)}^2\right]\,\mathrm{d}s$ or

$\mathbb{E}\left[\frac{1}{2} \|s(Y_s, s)\|_{\Lambda}^2 + \nabla\cdot(\Lambda^\top s)\right].$

A central theoretical finding is that minimizing the score matching loss corresponds to maximizing a lower bound of the likelihood of the generative diffusion process. Specifically, the ELBO for the "plug-in" reverse SDE (where the learned score replaces the true score in the SDE) reduces to a negative ISM loss plus a boundary term, formally bridging empirical score-based approaches with a likelihood-based justification.

3. Connections with Normalizing Flows and Infinite-Depth VAEs

The proposed framework unifies diffusion models with continuous-time normalizing flows. When the diffusion coefficient $\sigma = 0$ , the SDE becomes an ODE and the variational lower bound (ELBO) is tight, precisely recovering the continuous normalizing flow (a "Neural ODE" formulation). Thus, normalizing flows are a deterministic boundary of the more general stochastic variational setup of diffusion models.

The variational interpretation also demonstrates that diffusion models can be viewed as infinitely deep hierarchical VAEs, as the depth of the Markovian latent variable model approaches infinity and the ELBO converges to the continuous-time bound. The optimal variational inference SDE drift aligns with the exact score function, tightly coupling inference and generative modeling objectives.

4. Theoretical Insights and Family of Reverse SDEs

A theoretical advancement is the identification of a family of plug-in reverse SDEs parameterized by $\lambda$ , interpolating between fully stochastic ( $\lambda=0$ ), partially deterministic, and fully deterministic ( $\lambda=1$ , ODE) regimes: $\mathrm{d} X = \left(\left(1-\frac{\lambda}{2}\right)g^2 s - f\right)dt + \sqrt{1-\lambda}\,g\,\mathrm{d}B_t.$ The maximization of the likelihood lower bound via score matching holds for this entire family, not just the stochastic process typically used in practice. This generality theoretically supports flexible implementation choices in designing reverse-time samplers.

5. Implications, Limitations, and Future Research

The variational perspective yields several impactful implications:

Principled Training of Diffusion Models: Empirically successful score-based methods are theoretically grounded as approximate likelihood maximization procedures, validating their use beyond heuristic observation.
Unified Generative Modeling: VAEs, normalizing flows, and diffusion models coalesce within a single stochastic variational framework, clarifying their connections and encouraging hybrid methodologies.
Objective Design and Evaluation: The link between score matching and likelihood reveals new avenues for loss weighting and estimator selection, such as non-uniform sampling of diffusion time or adjusted mixture weighting, aimed at reducing bias and variance.
Modeling Power and Generalization: The framework accommodates richer stochastic inference processes, hinting at extensions to non-Markovian, structured, or discrete data, and to diffusion models with more complex SDE dynamics.

Challenges and open questions remain, such as fully explaining observed sample quality, generalizing to high-dimensional and structured data, and establishing optimal objective formulations for complex domains.

Summary Table: Main Theoretical and Practical Features

Aspect	Contribution
Variational Framework	SDE-based, FK-Girsanov ELBO, unifies with VAEs and flows
Score Matching	Minimizes likelihood lower bound; practical and theoretical bridge
Continuous-Time Flows	Recovered as deterministic ( $\sigma=0$ ) limit
Infinite-Depth VAE	Diffusion models are infinitely deep hierarchical VAEs
Family of Reverse SDEs	Likelihood lower bound valid for entire plug-in SDE family ( $\lambda$ -param.)
Implications	Improved estimator design, richer inference models, theoretical clarification

The variational formalism for diffusion-based generative models and score matching provides the first rigorous likelihood-based justification for the core mechanics of these models, situates score-matching training within a maximum likelihood framework, and articulates their mathematical relationship to flows and VAEs. This theoretical foundation underpins current and future advances in training objectives, model architectures, and domain extensions for diffusion models in generative modeling.

PDF Markdown Bookmark Chat (Pro)