Lecture Notes in Probabilistic Diffusion Models
(2312.10393v1)
Published 16 Dec 2023 in cs.LG and cs.AI
Abstract: Diffusion models are loosely modelled based on non-equilibrium thermodynamics, where \textit{diffusion} refers to particles flowing from high-concentration regions towards low-concentration regions. In statistics, the meaning is quite similar, namely the process of transforming a complex distribution $p_{\text{complex}}$ on $\mathbb{R}d$ to a simple distribution $p_{\text{prior}}$ on the same domain. This constitutes a Markov chain of diffusion steps of slowly adding random noise to data, followed by a reverse diffusion process in which the data is reconstructed from the noise. The diffusion model learns the data manifold to which the original and thus the reconstructed data samples belong, by training on a large number of data points. While the diffusion process pushes a data sample off the data manifold, the reverse process finds a trajectory back to the data manifold. Diffusion models have -- unlike variational autoencoder and flow models -- latent variables with the same dimensionality as the original data, and they are currently\footnote{At the time of writing, 2023.} outperforming other approaches -- including Generative Adversarial Networks (GANs) -- to modelling the distribution of, e.g., natural images.
Summary
The paper outlines a self-contained, rigorous mathematical framework for probabilistic diffusion models, highlighting the forward noising process.
It details the reverse diffusion process where neural networks approximate intractable distributions using an ELBO-based loss.
The lecture notes further discuss advanced sampling strategies like DDPM and DDIM, and explore both classifier and classifier-free guidance methods.
These lecture notes provide a self-contained mathematical description of probabilistic diffusion models, focusing on the fundamental concepts rather than specific implementation details (2312.10393).
1. Forward Diffusion Process
The core idea is to gradually destroy structure in a data sample x0 (e.g., an image) by adding Gaussian noise over T discrete time steps. This creates a sequence x1,x2,…,xT.
This process is defined as a Markov chain where each step xt depends only on xt−1:
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
Here, βt is the variance of the noise added at step t, determined by a predefined variance schedule (e.g., linear or cosine). Typically, βt increases with t, starting small (10−4) and ending larger (e.g., $0.02$).
Letting αt=1−βt and αˉt=∏i=1tαi, a key property is that we can sample xt directly from x0 without iterating through all intermediate steps:
q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I)
This can be written using the reparameterization trick:
xt=αˉtx0+1−αˉtϵ, where ϵ∼N(0,I).
As t→T (often T=1000), xT approaches a standard Gaussian distribution N(0,I), effectively transforming the complex data distribution pcomplex into a simple prior pprior.
This forward process is fixed and does not require any training.
2. Reverse Diffusion Process (Generative Modelling)
The goal of generative modeling is to reverse this process: starting from noise xT∼N(0,I), gradually denoise it step-by-step to obtain a sample x0 that looks like it came from the original data distribution pcomplex.
Ideally, we would use the true reverse probability q(xt−1∣xt), but calculating this is intractable because it requires knowledge of the entire data distribution q(x0).
Instead, we approximate the reverse step with a parameterized distribution, typically a Gaussian, learned by a neural network:
pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))
The neural network (often a U-Net architecture) takes the noisy sample xt and the time step t as input and outputs the parameters (mean μθ and variance Σθ) of the Gaussian distribution for the previous step xt−1.
3. The Loss Function
The objective is to maximize the likelihood of the training data x0 under the learned reverse process pθ. Directly maximizing logpθ(x0) is intractable.
Instead, we maximize the Evidence Lower Bound (ELBO), similar to Variational Autoencoders (VAEs). The ELBO for diffusion models can be written as:
This ELBO can be further decomposed into a sum of KL divergence terms comparing the forward and reverse processes at each step:
L=L0+∑t=2TLt−1+LT
Lt−1=Eq(xt∣x0)[DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))] for 2≤t≤T
The term q(xt−1∣xt,x0) can be computed analytically and is also a Gaussian:
q(xt−1∣xt,x0)=N(xt−1;μ~t(xt,x0),β~tI)
where μ~t and β~t are functions of αt, βt, αˉt, xt, and x0.
To minimize the KL divergence Lt−1, the variance of pθ(xt−1∣xt) is often fixed to β~tI, and the neural network only needs to learn the mean μθ(xt,t) to match μ~t(xt,x0).
Since μ~t depends on the unknown original x0, the network can be trained to predict x0 from xt and t, denoted x^θ(xt,t).
Alternatively, using the relation xt=αˉtx0+1−αˉtϵt, the network can be trained to predict the noise ϵt added at step t, denoted ϵ^θ(xt,t). This formulation relates μθ to ϵ^θ.
Empirically, a simplified loss function works well [ho_2020]:
repeat
sample x_0 from data distribution
sample t from Uniform({1, ..., T})
sample epsilon from N(0, I)
calculate x_t = sqrt(alpha_bar_t)*x_0 + sqrt(1 - alpha_bar_t)*epsilon
take gradient descent step on: || epsilon_hat_theta(x_t, t) - epsilon ||^2
until convergence
4. Reverse Samplers
DDPM (Denoising Diffusion Probabilistic Models):
Uses the learned noise prediction ϵ^θ(xt,t) to sample xt−1 from xt stochastically.
The sampling step is derived from pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),β~tI) where μθ is expressed using ϵ^θ.
sample x_T from N(0, I)
for t = T down to 1:
sample z from N(0, I) if t > 1 else z = 0
calculate epsilon_hat = epsilon_hat_theta(x_t, t)
x_{t-1} = (1/sqrt(alpha_t)) * (x_t - (1-alpha_t)/sqrt(1-alpha_bar_t) * epsilon_hat) + sqrt(beta_tilde_t) * z
return x_0
sample x_T from N(0, I)
for t = T down to 1:
calculate epsilon_hat = epsilon_hat_theta(x_t, t)
predicted_x0 = (1/sqrt(alpha_bar_t)) * (x_t - sqrt(1-alpha_bar_t) * epsilon_hat)
# Direction pointing to x_t
term2 = sqrt(1 - alpha_bar_{t-1}) * epsilon_hat
x_{t-1} = sqrt(alpha_bar_{t-1}) * predicted_x0 + term2
return x_0
DDIM often allows for faster sampling by using fewer steps (evaluating the reverse process on a subset of {1,…,T}) without significant quality loss [ddim_arxiv].
5. Text-Prompting
Classifier Guidance: Uses a separately trained classifier pϕ(y∣xt) (where y is a class label, e.g., derived from text) to guide the generation process towards samples matching the condition y. The mean of the reverse step is modified by adding a term proportional to the gradient of the classifier's log-likelihood: ∇xtlogpϕ(y∣xt). CLIP embeddings can be used to provide this guidance signal from text prompts [dhariwal_nichol_arxiv, Nichol2022glide].
Modified mean for sampling xt−1: μθ(xt,t)+s⋅Σθ(xt,t)∇xtlogpϕ(y∣xt) (where s is a guidance scale).
Classifier-Free Guidance: Avoids the need for an external classifier by training a single conditional diffusion model ϵ^θ(xt,y,t) that takes the conditioning information y (e.g., text embedding) as input. During training, the condition y is randomly dropped (replaced with a null embedding ∅). At sampling time, the effective noise prediction is extrapolated:
This ϵ~θ is then used in the DDPM or DDIM sampling loop. s is the guidance scale hyperparameter [ho2021classifierfree].
6. Appendix: Mathematical Concepts
Monte Carlo Estimation: Explains how to approximate expectations EX∼p[f(X)] by averaging f(x) over samples x drawn from p.
Reparameterization Trick: Describes how to compute gradients of expectations EX∼pθ[f(X)] with respect to parameters θ of the distribution pθ. This is crucial for training models like VAEs and diffusion models, allowing gradients to flow through the sampling process. It involves expressing the random variable X as a deterministic, differentiable function of θ and an auxiliary random variable with a fixed distribution (e.g., X=μθ+σθ⋅Z, where Z∼N(0,I)).
In summary, the lecture notes provide a rigorous mathematical foundation for diffusion models, explaining the forward noising process, the learned reverse denoising process, the derivation of the ELBO-based loss function (simplified to noise prediction error), different sampling strategies (DDPM, DDIM), and methods for conditioning the generation on text prompts (classifier and classifier-free guidance) (2312.10393).