Diffusion Probabilistic Models

Updated 19 October 2025

Diffusion probabilistic models are deep generative frameworks that gradually reverse a noising process using neural networks to synthesize complex data distributions.
They utilize advanced ODE and SDE solvers to accelerate sampling, achieving high-quality outputs with significant speed improvements and competitive FID scores.
Recent research optimizes architecture, covariance design, and efficiency while expanding applications to representation learning, inverse problems, and multi-modal data synthesis.

Diffusion probabilistic models (DPMs) are a class of deep generative models that synthesize high-quality data by learning to reverse a gradual noising process, transforming simple noise distributions into complex data distributions (such as natural images, audio, or graphs). DPMs operate via a forward process (diffusion) that iteratively corrupts data with noise and a reverse process (denoising) trained to invert this diffusion, typically parameterized by neural networks. The theoretical foundation and practical utility of DPMs have been substantially advanced over the past several years, with numerous developments in acceleration, coverage of diverse modalities, improved sample quality, and applications extending well beyond generation, such as representation learning and inverse problems.

1. Mathematical Formulation and Theoretical Foundations

The canonical discrete-time DPM, known as the denoising diffusion probabilistic model (DDPM), constructs a Markov chain from a data distribution $q(x_0)$ :

$q(x_{1:T}|x_0) = \prod_{t=1}^T q(x_t|x_{t-1}), \qquad q(x_t|x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t} x_{t-1}, \beta_t I)$

where $\{\beta_t\}$ is a variance schedule. The forward process produces $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ , with $\alpha_t = 1-\beta_t$ and $\bar{\alpha}_t = \prod_{k=1}^t \alpha_k$ . At synthesis time, a neural network estimates either the score (gradient of the log-density) or the mean of the denoising transition $p_\theta(x_{t-1}|x_t)$ . The continuous-time analogue uses stochastic differential equations (SDEs) of the form $dx = f(x,t)\,dt + g(t)\,dw$ , with the reverse SDE given by $dx = [f(x,t) - g^2(t) \nabla_x \log q_t(x)]\,dt + g(t)\,d\bar{w}$ . This connection to score-matching is particularly salient, as DPMs can be viewed as generative frameworks that advance and unify ideas from both VAEs and score-based models.

Recent theoretical treatments, such as FreeFlow (Sun et al., 2023), recast DPM dynamics as Wasserstein gradient flows in the space of probability measures, equating DPM evolution to time-dependent optimal transport. Both stochastic (SDE-based) and deterministic (ODE-based) solvers—such as DPM-Solver—fall within this geometric framework, which incorporates both Lagrangian (particle-following) and Eulerian (density-evolving) perspectives, bringing new clarity to the mechanisms and potential defects (e.g., shock formation in trajectory-based sampling).

2. Accelerated Sampling and Solver Development

A critical bottleneck of classical DPMs is the slow sampling speed, often requiring hundreds or thousands of neural network evaluations. FastDPM (Kong et al., 2021) and DPM-Solver (Lu et al., 2022) exemplify the core methodology for accelerated inference: they reformulate DPM sampling as the integration of a structured ODE,

$\frac{d x_t}{dt} = f(t) x_t + \frac{g^2(t)}{2 \sigma_t} \epsilon_\theta(x_t, t),$

and solve it using high-order Taylor or exponential integrator techniques. For example, DPM-Solver analytically computes the linear term and applies explicit high-order solvers for the nonlinear term, yielding sampling of comparable or better quality to the original DPM (e.g., FID 4.70 with 10 steps and FID 2.87 with 20 steps on CIFAR-10), and realizing a $4\times$ – $16\times$ sampling speedup without retraining. The fast sampling design space has further expanded with advances in solver schedule optimization (Liu et al., 2023), where per-timestep choices for prediction type, expansion order, and corrector are adaptively chosen via search frameworks, achieving FID 2.69 with 10 evaluations on CIFAR-10 (attributed directly to these solver decisions).

Methods such as lookahead extrapolation (Zhang et al., 2023) have also been proposed to refine mean estimates in the reverse process by exploiting correlation in predictions across adjacent timesteps, further reducing FID—especially in the fewer-step regime. Quantum algorithms for DPM solution (Wang et al., 20 Feb 2025) have recently been explored, embedding nonlinear diffusion ODEs in larger linear systems via Carleman linearization, and solving these with quantum linear system solvers or Hamiltonian simulation—foreshadowing potential quantum acceleration for high-dimensional generative models.

3. Model Architecture Optimizations and Efficiency

The need to deploy DPMs on resource-limited platforms has driven research into model architecture slimness and parameter efficiency. The Spectral Diffusion (SD) model (Yang et al., 2022) integrates wavelet gating modules, replacing standard down/up-sampling units with discrete wavelet transforms, enabling frequency dynamic feature extraction and efficient high-frequency recovery. Spectrum-aware distillation further weights training loss by spectrum magnitude, emphasizing rare high-frequency details. This combination leads to 8–18 $\times$ reductions in parameter count and compute (e.g., 21M UNet vs several hundred million for latent diffusion models, %%%%11 $\times$ 12%%%% MAC reduction), and 3–5 $\times$ faster sampling, while preserving competitive sample fidelity.

Slim models have also motivated dynamic model schedule optimization (Liu et al., 2023), where mixtures of large and small pre-trained models are assigned to specific reverse steps, improving the quality-efficiency trade-off (e.g., $2\times$ faster generation for Stable Diffusion with maintained output quality).

4. Covariance Design, Stochasticity, and Robustness

The variance (covariance) structure in the reverse process governs the generative expressiveness and robustness of DPMs: diagonal or full covariances allow more expressive modeling than isotropic designs. Optimal diagonal covariance estimators (and their corrections under imperfect mean prediction) have been derived (Bao et al., 2022), with a two-stage training procedure yielding significant improvements in sample quality and ELBO, particularly when few steps are used (critical in accelerated sampling).

FastDPM (Kong et al., 2021) demonstrates that the optimal degree of stochasticity in the reverse process (DDPM-rev vs DDIM-rev, controlled by $\kappa$ ) depends on both the data domain (images vs audio) and the amount of conditioning information. Images benefit from determinism (lowest FID at $\kappa=0$ ), while audio requires higher stochasticity to model natural variability and conditional cues decrease this requirement.

Contractive DPMs (CDPMs) (Tang et al., 23 Jan 2024) further address error robustness by designing the reverse SDE to be contractive, enforcing a monotonicity (dissipativity) condition on the drift. This contractivity provably reduces both score-matching and discretization error contributions, enabling retraining-free model upgrades and delivering empirical improvements on MNIST, CIFAR-10 (FID 2.47), and Swiss Roll, with bounded Wasserstein-2 errors.

5. Representation Learning, Disentanglement, and Non-Generative Tasks

Beyond generation, DPMs have shown considerable promise in representation learning and unsupervised disentanglement. Models such as Graffe (Chen et al., 8 May 2025) adapt DPMs for self-supervised graph representation learning by maximizing the conditional mutual information between graphs and their encoded representations, with the negative log denoising score-matching loss as a tractable lower bound. Strong empirical results are demonstrated across node and graph classification benchmarks (state-of-the-art on 9/11 real-world datasets).

In image domains, DPMs act as denoising autoencoders; for example, pre-trained DPM autoencoding (PDAE) (Zhang et al., 2022) leverages existing pretrained DPM weights as decoders, filling the “information gap” between the predicted and true mean via a learned mean-shift network, enabling efficient and effective representation learning, smooth latent space interpolation, and attribute manipulation. Disentanglement is further advanced by DisDiff (Yang et al., 2023), which decomposes gradient fields into semantically aligned sub-fields—the unsupervised approach yields sharper disentanglement than leading VAE- and GAN-based methods on synthetic and real datasets.

For vision tasks, knowledge transfer methods (RepFusion) (Yang et al., 2023) extract features from DPMs at optimal denoising steps (found by reinforcement learning), and dynamically supervise discriminative student networks, improving performance on semantic segmentation and keypoint detection.

6. Extensions to New Data Modalities and Practical Applications

DPMs have been adapted to diverse modalities and practical challenges. Categorical diffusion models extend DPMs to synthetic location trajectory generation (Dirmeier et al., 19 Feb 2024) by encoding discrete sequences in a continuous latent space, diffusing in this space with transformer-based networks and mapping back to discrete outputs; such models capture high-entropy, realistic discrete sequence statistics for synthetic mobility data.

MRI applications (Fan et al., 2023) span reconstruction (integrating data-consistency projections), conditional cross-modality translation, segmentation (including under weak supervision), anomaly detection using inpainting, and processing of high-dimensional (3D/4D) MRI requiring latent diffusion. DPMs in biomedical and other signal domains benefit from their explicit likelihood characterization and fidelity-diversity advantages, though training and inference costs remain substantial.

Research into frequency-based noise control (Jiralerspong et al., 14 Feb 2025) demonstrates that the choice of noising operator in the forward process can modulate DPM inductive bias, selectively improving learning in frequency bands relevant to specific datasets (e.g., enhancing recovery for image corruption tasks when high-frequency content is masked). Dynamic or data-driven frequency schedules are posited as a frontier for principled DPM design.

7. Future Directions and Open Problems

Ongoing and future challenges in DPM research include:

Further acceleration and few-step sampling without retraining, via adaptive per-timestep solver and model schedules, quantum ODE solvers (Wang et al., 20 Feb 2025), or lookahead mean refinement.
Improved robustness and error contraction without losing sample diversity, e.g., through dynamic contractive models or adaptive stochasticity.
Automated and theoretically principled selection of frequency- or data-driven inductive biases for targeted applications.
Scalable, disentangled, or semantically controllable DPMs for tasks in both Euclidean and non-Euclidean (graph) domains.
Extending DPMs for probabilistic modeling and conditional generation in diverse tasks (text-to-image, inverse problems, self-supervised learning).
Direct integration into applied domains, such as privacy-preserving data synthesis, complex medical imaging pipelines, and multi-modal generation unifying text, image, and audio.

The field continues to unify, expand, and interpret diffusion probabilistic models, steadily establishing them as foundational tools in modern generative modeling and beyond.