Diffusion-Based Generative Modeling

Updated 6 November 2025

Diffusion-based generative modeling is a family of probabilistic models that reverse a designed noising process using stochastic differential equations.
It employs deep neural networks to estimate score functions and invert complex forward processes, ensuring robust data synthesis.
The flexible framework enables geometry-aware noise schedules and regularization, leading to improved convergence and empirical performance.

Diffusion-based generative modeling encompasses a family of probabilistic frameworks in which the synthesis of new data is formulated as the reversal of a carefully constructed “noising” process. By incrementally transforming structured data into noise via a parameterized Markov process—often realized through stochastic differential equations (SDEs)—and then learning to invert this process with deep neural networks, diffusion models provide a theoretically robust and highly expressive foundation for modern generative modeling. The continued evolution of this field integrates advances in geometry, optimization, sampling, and downstream applications, with numerous theoretical and practical consequences.

1. Mathematical Foundations and Core Mechanisms

Diffusion-based generative models are constructed on a pair of stochastic processes defined on the data space $\mathbb{R}^d$ : a forward (noising) process that gradually corrupts a sample $X_0 \sim p_\text{data}$ into a known simple distribution (usually isotropic Gaussian), and a reverse (generative) process that attempts to invert this transformation. The forward process is most generally specified as the solution to an SDE: $dX_t = f(X_t, t) dt + g(X_t, t) dW_t,$ where $f$ is the drift, $g$ is the diffusion coefficient, and $W_t$ is standard Brownian motion.

The canonical forms for $f$ , $g$ —such as those in DDPMs (variance-preserving), SMLD/VE (variance-exploding), or critically-damped Langevin—are special, hand-designed cases. In modern frameworks, like “A Flexible Diffusion Model” (Du et al., 2022), the spatial part of the SDE is generalized: $dX_t = \frac{1}{2}\left(-R^{-1}(X_t) X_t - 2\omega X_t + \nabla \cdot R^{-1}(X_t)\right)dt + \sqrt{R^{-1}(X_t)} dW_t,$ where $R^{-1}(X_t)$ is a data-dependent positive-definite symmetric matrix (Riemannian metric), and $\omega$ is an anti-symmetric (symplectic) matrix, enabling anisotropic and potentially Hamiltonian structure in the forward process.

The reverse process, crucially, relies on estimates of the score function $\nabla_{x} \log p_t(x)$ , with the corresponding SDE given by: $dX_t = \left[f(X_t, t) - g(X_t, t)^2 \nabla_{x} \log p_t(X_t)\right] dt + g(X_t, t) d\bar{W}_t,$ with all relevant marginals and conditionals typically intractable and learned via a neural network $\mathbf{s}_\theta(x, t)$ .

2. Theoretical Advances: Abstract Parameterization and Guarantees

Recent developments have revealed that the family of valid forward SDEs is much broader than previously exploited. The abstract formalism in (Du et al., 2022) shows:

Gaussian Stationarity: The parameterization with arbitrary symmetric positive-definite $R^{-1}$ and anti-symmetric $\omega$ preserves the stationary distribution as a standard Gaussian, providing theoretical assurance for generative modeling.
Completeness: The decomposition into Riemannian/anisotropic ( $R$ ) and symplectic/Hamiltonian ( $\omega$ ) components is shown to be complete for flows preserving stationarity.
Ergodicity and Mixing: Under regularity (e.g., Hörmander’s condition), the reverse-time SDEs remain ergodic, even if the spatial diffusion matrix is degenerate, provided the symplectic/Hamiltonian component is sufficiently nontrivial to mix the space.

These properties enable the “FP-diffusion” model to efficiently represent inhomogeneous and anisotropic noising processes, opening avenues for geometry-aware training and sampling.

3. Optimization and Training Methodology

Diffusion models are commonly trained with denoising score matching (ESM/DSM losses): $L_\text{ESM} = \int_0^T \mathbb{E}_{X_s}\left[\frac{1}{2} \|\mathbf{s}_\theta(X_s, s) - \nabla \log p_s(X_s)\|^2_{\Lambda(s)}\right] ds,$ where $\Lambda(s)$ is the (possibly anisotropic) preconditioning attached to the local diffusion geometry. Under the abstract SDE parameterization, conditional training—using known conditional scores of $p_t(X_t|X_s)$ —can be utilized to sidestep intractable global score computation.

This approach is variationally justified: as established in (Huang et al., 2021), minimizing the score matching loss is equivalent to maximizing a variational evidence lower bound (ELBO) on the likelihood associated with the learned reverse SDE. The variational perspective unifies diffusion models, continuous-time normalizing flows, and VAEs into a single framework for likelihood-based estimation using path-wise latent variables.

Furthermore, FP-diffusion facilitates explicit regularization on the forward paths, such as penalties on the kinetic energy—a concept drawn from continuous normalizing flows—enabling enhanced control and robustness during model learning.

4. Unification and Extension Beyond Fixed Schedules

The spatially flexible SDE formalism includes earlier models as special cases (VP, VE, critically-damped Langevin), but transcends them by permitting spatially inhomogeneous, learnable noising schedules. This generalization allows the forward diffusion to be better matched to dataset geometry—for example, focusing noise addition on directions aligned with data subspaces for low-dimensional manifold or physical data.

Broader theoretical contributions, such as the connection between bridge processes and diffusion generative modeling (Liu et al., 2022), as well as the links to optimal control and Hamilton–Jacobi–Bellman equations (Berner et al., 2022), further consolidate these unifications, expanding the design space for future diffusion-based approaches.

5. Empirical Performance and Architectural Details

Empirical validation on synthetic, MNIST, and CIFAR10 datasets demonstrates the practical efficacy of flexible, geometry-aware SDE parameterization:

Low-dimensional synthetic data: FP-diffusion models with learned anisotropic/noising outperform fixed-isotropic baselines in aligning generated samples to the data manifold, particularly when regularization is employed to shape transport toward optimal projection directions.
Image data (MNIST, CIFAR10): Two-stage (“Mix”) training—in which the SDE parameters and score network are first optimized jointly, then the score function is fine-tuned with fixed SDE—achieves competitive or superior negative log-likelihood and FID scores relative to both normalizing flow and state-of-the-art diffusion models, often with fewer model parameters. This indicates a substantial efficiency gain. Sharp and diverse samples are obtained, with the model filling the data manifold efficiently and avoiding unnatural denoising paths.

For practical implementation, the model relies on a neural network parameterizing the score, with conditional sampling for efficient marginals. With spatial flexibility, specialized architectures—e.g., U-Nets with position-dependent modulation—can further enhance performance by capturing inhomogeneous data structure.

6. Comparison with Classical and Alternative Diffusion Models

The key improvements of the flexible, geometry-aware approach over standard diffusion models are:

Aspect	Standard SDE/VP/VE	Flexible Parameterization (FP-Diffusion)
Noise schedule	Scalar, fixed	Learnable, spatially varying, anisotropic
Stationarity	Fixed Gaussian	Guaranteed Gaussian, regardless of $R$ , $\omega$
Geometry adaptation	None	Explicit adaptation to data manifold geometry
Model coverage	VP/VE/Langevin only	Strict superset, general SDEs
Optimization	Standard DSM/ELBO	Score matching + regularization in path space

This approach explicitly allows the forward process to be tailored and regularized in a data-dependent manner, creating smoother, more efficient generative trajectories—especially important in data with pronounced geometric (manifold) structure.

7. Implementation Considerations and Practical Guidance

Model parameterization: $R^{-1}(x)$ and $\omega$ can be chosen as either fixed or parameterized matrices (potentially low-rank for efficiency).
Training: Use a two-stage strategy, optimizing SDE parameters and score jointly, then fixing the SDE and fine-tuning the score. The necessity of this “Mix” phase was empirically established, as pure joint training underperforms.
Regularization: Incorporate terms penalizing the kinetic energy or deep path metrics to stabilize trajectory learning.
Computation: While more flexible SDEs slightly increase per-step computation due to spatial dependency, overall efficiency gains arise from improved convergence and better data matching, allowing smaller models and/or fewer sampling steps.
Applications: The model is applicable to a range of domains from synthetic low-dimensional data to complex image datasets, and is especially suited to settings where data exhibit significant geometric or manifold structure.

Diffusion-based generative modeling, particularly with flexible SDE parameterization, represents a theoretically grounded generalization of earlier models, accommodating data-dependent, geometry-aware noising strategies and supporting explicit regularization. These developments not only yield empirical improvements but also create a unified mathematical framework for diffusion, flow, and score-based generation, with robust guarantees on stationary distribution, convergence, and extensibility, as detailed in (Du et al., 2022) and related literature.