Lecture Notes in Probabilistic Diffusion Models (2312.10393v1)

Published 16 Dec 2023 in cs.LG and cs.AI

Abstract: Diffusion models are loosely modelled based on non-equilibrium thermodynamics, where \textit{diffusion} refers to particles flowing from high-concentration regions towards low-concentration regions. In statistics, the meaning is quite similar, namely the process of transforming a complex distribution $p_{\text{complex}}$ on $\mathbb{R}^d$ to a simple distribution $p_{\text{prior}}$ on the same domain. This constitutes a Markov chain of diffusion steps of slowly adding random noise to data, followed by a reverse diffusion process in which the data is reconstructed from the noise. The diffusion model learns the data manifold to which the original and thus the reconstructed data samples belong, by training on a large number of data points. While the diffusion process pushes a data sample off the data manifold, the reverse process finds a trajectory back to the data manifold. Diffusion models have -- unlike variational autoencoder and flow models -- latent variables with the same dimensionality as the original data, and they are currently\footnote{At the time of writing, 2023.} outperforming other approaches -- including Generative Adversarial Networks (GANs) -- to modelling the distribution of, e.g., natural images.

Summary

The paper outlines a self-contained, rigorous mathematical framework for probabilistic diffusion models, highlighting the forward noising process.
It details the reverse diffusion process where neural networks approximate intractable distributions using an ELBO-based loss.
The lecture notes further discuss advanced sampling strategies like DDPM and DDIM, and explore both classifier and classifier-free guidance methods.

These lecture notes provide a self-contained mathematical description of probabilistic diffusion models, focusing on the fundamental concepts rather than specific implementation details (2312.10393).

1. Forward Diffusion Process

The core idea is to gradually destroy structure in a data sample $x_0$ (e.g., an image) by adding Gaussian noise over $T$ discrete time steps. This creates a sequence $x_1, x_2, \dots, x_T$ .
This process is defined as a Markov chain where each step $x_t$ depends only on $x_{t-1}$ :

$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t \mathbf{I})$
Here, $\beta_t$ is the variance of the noise added at step $t$ , determined by a predefined variance schedule (e.g., linear or cosine). Typically, $\beta_t$ increases with $t$ , starting small ( $10^{-4}$ ) and ending larger (e.g., $0.02$).
Letting $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$ , a key property is that we can sample $x_t$ directly from $x_0$ without iterating through all intermediate steps:

$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) \mathbf{I})$
This can be written using the reparameterization trick: $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ , where $\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ .
As $t \to T$ (often $T=1000$ ), $x_T$ approaches a standard Gaussian distribution $\mathcal{N}(\mathbf{0}, \mathbf{I})$ , effectively transforming the complex data distribution $p_{\text{complex}}$ into a simple prior $p_{\text{prior}}$ .
This forward process is fixed and does not require any training.

2. Reverse Diffusion Process (Generative Modelling)

The goal of generative modeling is to reverse this process: starting from noise $x_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ , gradually denoise it step-by-step to obtain a sample $x_0$ that looks like it came from the original data distribution $p_{\text{complex}}$ .
Ideally, we would use the true reverse probability $q(x_{t-1} | x_t)$ , but calculating this is intractable because it requires knowledge of the entire data distribution $q(x_0)$ .
Instead, we approximate the reverse step with a parameterized distribution, typically a Gaussian, learned by a neural network:

$p_{\theta}(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_{\theta}(x_t, t), \Sigma_{\theta}(x_t, t))$
The neural network (often a U-Net architecture) takes the noisy sample $x_t$ and the time step $t$ as input and outputs the parameters (mean $\mu_{\theta}$ and variance $\Sigma_{\theta}$ ) of the Gaussian distribution for the previous step $x_{t-1}$ .

3. The Loss Function

The objective is to maximize the likelihood of the training data $x_0$ under the learned reverse process $p_{\theta}$ . Directly maximizing $\log p_{\theta}(x_0)$ is intractable.
Instead, we maximize the Evidence Lower Bound (ELBO), similar to Variational Autoencoders (VAEs). The ELBO for diffusion models can be written as:

$\log p_{\theta}(x_0) \geq E_{q(x_{1:T}|x_0)} \left[ \log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T} | x_0)} \right]$
This ELBO can be further decomposed into a sum of KL divergence terms comparing the forward and reverse processes at each step:

$L = L_0 + \sum_{t=2}^T L_{t-1} + L_T$

$L_{t-1} = E_{q(x_{t}|x_0)} \left[ D_{KL}(q(x_{t-1}|x_t,x_0) || p_{\theta}(x_{t-1}|x_t)) \right]$ for $2 \le t \le T$
The term $q(x_{t-1}|x_t, x_0)$ can be computed analytically and is also a Gaussian:

$q(x_{t-1}|x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t \mathbf{I})$

where $\tilde{\mu}_t$ and $\tilde{\beta}_t$ are functions of $\alpha_t$ , $\beta_t$ , $\bar{\alpha}_t$ , $x_t$ , and $x_0$ .
To minimize the KL divergence $L_{t-1}$ , the variance of $p_{\theta}(x_{t-1}|x_t)$ is often fixed to $\tilde{\beta}_t \mathbf{I}$ , and the neural network only needs to learn the mean $\mu_{\theta}(x_t, t)$ to match $\tilde{\mu}_t(x_t, x_0)$ .
Since $\tilde{\mu}_t$ depends on the unknown original $x_0$ , the network can be trained to predict $x_0$ from $x_t$ and $t$ , denoted $\hat{x}_{\theta}(x_t, t)$ .
Alternatively, using the relation $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon_t$ , the network can be trained to predict the noise $\epsilon_t$ added at step $t$ , denoted $\hat{\epsilon}_{\theta}(x_t, t)$ . This formulation relates $\mu_{\theta}$ to $\hat{\epsilon}_{\theta}$ .
Empirically, a simplified loss function works well [ho_2020]:

$L_{\text{simple}} = E_{t \sim U(1, T), x_0 \sim p_{\text{data}}, \epsilon_t \sim \mathcal{N}(0,I)} \left[ || \hat{\epsilon}_{\theta}(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon_t, t) - \epsilon_t ||_2^2 \right]$

This is a mean squared error between the true noise $\epsilon_t$ and the noise predicted by the network $\hat{\epsilon}_{\theta}$ .
Training Algorithm (Simplified DDPM Loss):

repeat
    sample x_0 from data distribution
    sample t from Uniform({1, ..., T})
    sample epsilon from N(0, I)
    calculate x_t = sqrt(alpha_bar_t)*x_0 + sqrt(1 - alpha_bar_t)*epsilon
    take gradient descent step on: || epsilon_hat_theta(x_t, t) - epsilon ||^2
until convergence

4. Reverse Samplers

DDPM (Denoising Diffusion Probabilistic Models):

Uses the learned noise prediction $\hat{\epsilon}_{\theta}(x_t, t)$ to sample $x_{t-1}$ from $x_t$ stochastically.
The sampling step is derived from $p_{\theta}(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_{\theta}(x_t, t), \tilde{\beta}_t \mathbf{I})$ where $\mu_{\theta}$ is expressed using $\hat{\epsilon}_{\theta}$ .
Sampling Algorithm (DDPM):

sample x_T from N(0, I)
for t = T down to 1:
    sample z from N(0, I) if t > 1 else z = 0
    calculate epsilon_hat = epsilon_hat_theta(x_t, t)
    x_{t-1} = (1/sqrt(alpha_t)) * (x_t - (1-alpha_t)/sqrt(1-alpha_bar_t) * epsilon_hat) + sqrt(beta_tilde_t) * z
return x_0

DDIM (Denoising Diffusion Implicit Models):

Introduces a non-Markovian forward process that shares the same marginals $q(x_t|x_0)$ as DDPM, allowing the same trained model $\hat{\epsilon}_{\theta}$ to be used.
Allows for deterministic sampling ( $\sigma_t = 0$ ), meaning a fixed $x_T$ always maps to the same $x_0$ .
Introduces a parameter $\sigma_t$ controlling the stochasticity of the reverse step. Setting $\sigma_t=0$ leads to the deterministic DDIM sampler.
Sampling Algorithm (DDIM, deterministic case $\sigma_t=0$ ):

sample x_T from N(0, I)
for t = T down to 1:
    calculate epsilon_hat = epsilon_hat_theta(x_t, t)
    predicted_x0 = (1/sqrt(alpha_bar_t)) * (x_t - sqrt(1-alpha_bar_t) * epsilon_hat)
    # Direction pointing to x_t
    term2 = sqrt(1 - alpha_bar_{t-1}) * epsilon_hat
    x_{t-1} = sqrt(alpha_bar_{t-1}) * predicted_x0 + term2
return x_0

DDIM often allows for faster sampling by using fewer steps (evaluating the reverse process on a subset of $\{1, \dots, T\}$ ) without significant quality loss [ddim_arxiv].

5. Text-Prompting

Classifier Guidance: Uses a separately trained classifier $p_{\phi}(y|x_t)$ $p_{ϕ} (y ∣ x_{t})$ (where $y$ $y$ is a class label, e.g., derived from text) to guide the generation process towards samples matching the condition $y$ $y$ . The mean of the reverse step is modified by adding a term proportional to the gradient of the classifier's log-likelihood: $\nabla_{x_t} \log p_{\phi}(y|x_t)$ $\nabla_{x_{t}} lo g p_{ϕ} (y ∣ x_{t})$ . CLIP embeddings can be used to provide this guidance signal from text prompts [dhariwal_nichol_arxiv, Nichol2022glide].
- Modified mean for sampling $x_{t-1}$ : $\mu_{\theta}(x_t, t) + s \cdot \Sigma_{\theta}(x_t, t) \nabla_{x_t} \log p_{\phi}(y|x_t)$ (where $s$ is a guidance scale).
Classifier-Free Guidance: Avoids the need for an external classifier by training a single conditional diffusion model $\hat{\epsilon}_{\theta}(x_t, y, t)$ that takes the conditioning information $y$ (e.g., text embedding) as input. During training, the condition $y$ is randomly dropped (replaced with a null embedding $\emptyset$ ). At sampling time, the effective noise prediction is extrapolated:

$\tilde{\epsilon}_{\theta}(x_t, y, t) = \hat{\epsilon}_{\theta}(x_t, \emptyset, t) + s \cdot (\hat{\epsilon}_{\theta}(x_t, y, t) - \hat{\epsilon}_{\theta}(x_t, \emptyset, t))$

This $\tilde{\epsilon}_{\theta}$ is then used in the DDPM or DDIM sampling loop. $s$ is the guidance scale hyperparameter [ho2021classifierfree].

6. Appendix: Mathematical Concepts

Monte Carlo Estimation: Explains how to approximate expectations $E_{X \sim p}[f(X)]$ by averaging $f(x)$ over samples $x$ drawn from $p$ .
Reparameterization Trick: Describes how to compute gradients of expectations $E_{X \sim p_{\theta}}[f(X)]$ with respect to parameters $\theta$ of the distribution $p_{\theta}$ . This is crucial for training models like VAEs and diffusion models, allowing gradients to flow through the sampling process. It involves expressing the random variable $X$ as a deterministic, differentiable function of $\theta$ and an auxiliary random variable with a fixed distribution (e.g., $X = \mu_{\theta} + \sigma_{\theta} \cdot Z$ , where $Z \sim \mathcal{N}(0, I)$ ).

In summary, the lecture notes provide a rigorous mathematical foundation for diffusion models, explaining the forward noising process, the learned reverse denoising process, the derivation of the ELBO-based loss function (simplified to noise prediction error), different sampling strategies (DDPM, DDIM), and methods for conditioning the generation on text prompts (classifier and classifier-free guidance) (2312.10393).

PDF Markdown

Related Papers

PhysDiff: Physics-Guided Human Motion Diffusion Model (2022)
A Cheaper and Better Diffusion Language Model with Soft-Masked Noise (2023)
On Calibrating Diffusion Probabilistic Models (2023)
Distributional Diffusion Models with Scoring Rules (2025)
The Diffusion Duality (2025)