A Mathematical Introduction to Diffusion Models

Published 2 Jul 2026 in cs.LG and math.PR | (2607.01693v1)

Abstract: These notes give a proof-oriented introduction to diffusion models from the viewpoint of sampling, tracing a single arc from classical sampling dynamics to modern diffusion samplers, their error analysis, and inference-time control. Throughout, the material is layered into core definitions and identities proved in full, representative estimates proved under simplifying assumptions, and research-level theorems stated with a proof roadmap. The intended audience is beginning graduate students with a background in probability but no prior exposure to stochastic differential equations, stochastic numerics, or diffusion models.

Abstract PDF Upgrade to Chat

Authors (1)

Jianfeng Lu

Summary

The paper formulates diffusion models using discrepancies, Langevin dynamics, and Markov chains, establishing explicit error bounds and convergence properties.
It bridges continuous SDE formulations with discrete implementations by leveraging score-based estimation and the Tweedie identity for effective denoising.
The paper rigorously decomposes errors from discretization, score estimation, and rejection sampling, highlighting trade-offs between computational cost and sampling accuracy.

Technical Overview of "A Mathematical Introduction to Diffusion Models" (2607.01693)

Structural Foundations of Sampling: Discrepancy, MCMC, and Langevin Dynamics

The manuscript grounds diffusion models within the rigorous context of probabilistic sampling and the evolution of distributions. The sampling objective is formulated using discrepancies such as total variation, Wasserstein-$2$, and KL divergence. A recurring theme is the reduction of sampling from an intractable target to a process of constructing Markov chains or diffusions that converge to the desired stationary measure, with attention to ergodicity, invariance, and mixing mechanisms. The Langevin diffusion paradigm—continuous-time stochastic differential equations (SDEs) driven by gradients of log-densities—is emphasized for its role both in theoretical convergence (entropy dissipation via log-Sobolev inequalities) and practical algorithmic implementations through Unadjusted Langevin Algorithm (ULA) and its Metropolis-corrected variant (MALA).

Numerical discretization error, initialization bias, and the separation of mixing and approximation error are treated with full technical rigor. ULA is shown to exhibit stationary bias with fixed step size, quantifiable in closed form for Gaussians; the transition to MALA, leveraging density ratio accept-reject steps, eliminates stationary bias and is accompanied by explicit, non-asymptotic mixing bounds under strong log-concavity and isoperimetry. The technical machinery deployed here—KL and Wasserstein discrepancies, log-Sobolev contraction, and Talagrand inequalities—serves throughout the later development of diffusion and score-based methodologies.

Score-Based Diffusion Models: Continuous, Discrete, and Reverse-Time Formulations

The central arc constructs modern diffusion models as a continuum and discretization of Langevin-type processes, transitioning from explicit density-driven sampling to data-driven score (denoising) estimation. Forward noising SDEs are parameterized both as variance-exploding and variance-preserving processes, with Gaussian channels used to model the progressive corruption of data. The Tweedie identity is leveraged as a formal bridge between score estimation and denoising posterior mean estimation: learning the conditional expectation of the clean data given noisy observations is equivalent to estimating the score function of the noised law.

Reverse-time sampling is formalized in both SDE (stochastic) and probability-flow ODE (deterministic) representations using the Fokker-Planck framework. Reverse SDEs incorporate both drift reversal and score-driven denoising terms, and probabilistic flow ODEs are derived as exact transport equations for evolution of the marginals. Numerical discretization (DDPM) is systematically connected to these dynamics, with rigorous analysis of errors arising from initialization, score mismatch, and local discretization. Notably, the score-matching paradigm is established as a population-level regression—regression on Gaussian perturbations—while the error decomposition for the full sampling algorithm is expressed as a sum of initialization, discretization, and time-weighted score errors.

Stochastic Localization, Polchinski Flow, and Geometric Covariance Budgets

The text introduces stochastic localization as a measure-valued martingale process elucidating the geometric and analytic behavior of posteriors under progressive Gaussian observation. Localization connects directly to posterior mean, covariance evolution, and martingale structure, relating the Hessian of the log-density (score Jacobian) to posterior covariance via explicit identities. The total covariance budget along the localization path is bounded dimensionally, providing a key analytic tool for quantifying score-smoothness regularization and discretization error in diffusion models.

The Polchinski equation, a finite-dimensional version of the renormalization group flow, describes the time evolution of the effective potential $U_t = -\log p_t$ under forward diffusion via a nonlinear, viscous Hamilton–Jacobi PDE. This perspective highlights the progressive smoothing of the data distribution at positive noise and the sharpening of the learned score in reverse. These geometric and renormalization interpretations are structurally aligned with the analysis of sampling errors using pathwise and covariance-based arguments.

Rigorous Error Decomposition: Discretization, Score Estimation, and High-Accuracy Correction

A full path-space telescoping KL criterion is developed, reducing global error in the final sample distribution to the sum of initialization error and integrated one-step kernel errors. For DDPM-type samplers, the local one-step kernel error is decomposed into mean (score) error and discretization (freezing the score) error, both of which are made explicit via the Gaussian structure of the kernels and Girsanov’s theorem. Non-uniform grid choices and time-adaptive step sizes are shown to efficiently balance time-resolved discretization error; further, the overall discretization cost is tightly controlled by the cumulative posterior covariance along the path rather than naive Lipschitz constants.

First-order rejection sampling (FORS) is introduced as a log-density-free high-accuracy correction mechanism, enabling local proposals to be corrected through randomized, score-only accept-reject steps. This allows polylogarithmic dependence on target accuracy for the total number of reverse steps, in contrast to polynomial costs under Euler–Maruyama. The limiting dependence on ambient dimension $d$ is made explicit through covariance budget arguments, with references to recent work achieving nearly linear scaling in $d$ for high-dimensional problems.

Discrete Diffusion and Finite State Spaces

The framework is extended to discrete state spaces through the construction of discrete diffusion models using time-inhomogeneous Markov chains and their continuous-time (CTMC) analogues. Reverse kernels are characterized via exact Bayes formulas, and ratio scores (density ratios over edges) supplant gradients in the estimation of optimal denoisers and reverse probabilities. The unique structure of masked diffusion is highlighted, wherein token masking—rather than continuous perturbation—serves as the corruption mechanism, and denoising posteriors are learned via cross-entropy losses. Theoretical analysis parallels the continuous case, with initialization, per-step reverse kernel error, and block factorization bias (for parallel decoding) all rigorously represented in the overall KL objective.

For continuous-time discrete diffusion, reverse-time rates are derived via Bayes along edges, and score-entropy learning objectives are formulated directly from forward-path likelihood ratios. High-accuracy sampling is achieved via uniformization, with event counts and runtime scaling made explicit in terms of the state-space structure.

Inference-Time Control: Reward Tilting, Guidance, and RL

The latter sections formalize inference-time modification of pretrained diffusion samplers through reward tilting and KL-regularized control. Reward-tilted targets are characterized as exponentiated reweightings of the base law, and their score functions differ from the base at inference time by explicit additive corrections, computable as the gradient of noisy likelihoods or value functions (posterior-expected reward). The Doob $h$ -transform construction establishes the optimal guided process in both path-space and Markov kernel levels, with value functions propagated backward as martingales. Sequential Monte Carlo schemes, Feynman–Kac reweighting, and inference-time reinforcement learning methods are rigorously compared under a unifying KL-regularized control objective, making precise the mathematical duality between reward improvement and closeness to the pretrained measure. All constructions are extended to the discrete case without reliance on calculus.

Conclusion

This manuscript presents a technically rigorous, proof-oriented, and self-contained treatise on diffusion models, unifying classical sampling theory, stochastic processes, and modern score-based generative modeling under a common analytic umbrella. The exposition meticulously develops explicit error bounds, geometric perspectives, and the path-space underpinnings that support both the practical deployment of diffusion models and the design of advanced control and steering methodologies at inference time. Theoretical implications are emphasized through tight error decompositions and geometric budget identities, and practical considerations are made concrete via detailed analysis of initialization, discretization, and high-accuracy correction mechanisms. The survey’s technical depth makes it a valuable resource for researchers seeking to understand or extend the mathematical infrastructure underlying modern diffusion and score-based generative models.