Diffusion Generative Models Overview
- Diffusion generative models are deep probabilistic techniques that generate data by inverting an iterative noise-corruption process inspired by thermodynamic diffusion.
- They employ a forward noising process and a learned reverse denoising chain, enabling high-fidelity synthesis across images, audio, and molecular data.
- Their training leverages variational lower bounds and score-matching losses, ensuring stable optimization and adaptability to diverse high-dimensional domains.
Diffusion generative models are a class of deep probabilistic models that synthesize data by inverting a progressive noise-corruption process, leveraging a close analogy to non-equilibrium thermodynamics. Their core design is built around two coupled stochastic processes: a forward (“diffusion” or “noising”) chain that incrementally destroys structure in the data, and a learned reverse (“denoising” or “generative”) process that reconstructs high-fidelity samples by sequentially removing noise. The success of diffusion models in image, audio, molecular, and multimodal generation tasks is rooted in rigorous variational learning principles, stable training even in high dimensions, and a broad array of algorithmic enhancements that allow the basic framework to adapt to numerous data domains and operational constraints (Torre, 2023, Gallon et al., 2024, Cao et al., 2022, Higham et al., 2023).
1. Thermodynamic Analogy and Fundamental Processes
Diffusion models are motivated by the analogy to physical diffusion in non-equilibrium thermodynamics, in which two miscible fluids mix and form a homogeneous solution as entropy increases. In generative modeling, the forward process is an engineered Markov chain that adds small amounts of Gaussian noise at each discrete or continuous step: with
where each for reversibility. This gradually transforms complex data distributions into nearly isotropic Gaussian noise (Torre, 2023, Higham et al., 2023).
In the reverse direction, a neural network parameterizes the denoising kernel: where is often defined through explicit noise-prediction: and predicts the noise added in the forward process (Torre, 2023, Scotta et al., 2023).
In the continuous limit, the forward SDE is
and the reverse SDE is
The "score function" is estimated by a neural net (Torre, 2023, Scotta et al., 2023).
2. Variational Training Objective and Score Matching
Diffusion models maximize likelihood via a variational lower bound (ELBO), which decomposes into sums of KL divergences corresponding to the discrepancy between the true (forward) and learned (reverse) transitions: For the Gaussian case, these KLs reduce to mean-squared error losses on the predicted noise: This loss can be interpreted equivalently as denoising score-matching (Torre, 2023, Scotta et al., 2023, Higham et al., 2023, Gallon et al., 2024). In continuous-time, the ELBO and the score-matching loss are linked through the Fokker–Planck equation governing the evolution of densities(Torre, 2023, Ding et al., 2024).
3. Noise Schedules, Sampling Algorithms, and Inference
The choice of noise schedule is