Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lecture Notes in Probabilistic Diffusion Models (2312.10393v1)

Published 16 Dec 2023 in cs.LG and cs.AI

Abstract: Diffusion models are loosely modelled based on non-equilibrium thermodynamics, where \textit{diffusion} refers to particles flowing from high-concentration regions towards low-concentration regions. In statistics, the meaning is quite similar, namely the process of transforming a complex distribution $p_{\text{complex}}$ on $\mathbb{R}d$ to a simple distribution $p_{\text{prior}}$ on the same domain. This constitutes a Markov chain of diffusion steps of slowly adding random noise to data, followed by a reverse diffusion process in which the data is reconstructed from the noise. The diffusion model learns the data manifold to which the original and thus the reconstructed data samples belong, by training on a large number of data points. While the diffusion process pushes a data sample off the data manifold, the reverse process finds a trajectory back to the data manifold. Diffusion models have -- unlike variational autoencoder and flow models -- latent variables with the same dimensionality as the original data, and they are currently\footnote{At the time of writing, 2023.} outperforming other approaches -- including Generative Adversarial Networks (GANs) -- to modelling the distribution of, e.g., natural images.

Summary

  • The paper outlines a self-contained, rigorous mathematical framework for probabilistic diffusion models, highlighting the forward noising process.
  • It details the reverse diffusion process where neural networks approximate intractable distributions using an ELBO-based loss.
  • The lecture notes further discuss advanced sampling strategies like DDPM and DDIM, and explore both classifier and classifier-free guidance methods.

These lecture notes provide a self-contained mathematical description of probabilistic diffusion models, focusing on the fundamental concepts rather than specific implementation details (2312.10393).

1. Forward Diffusion Process

  • The core idea is to gradually destroy structure in a data sample x0x_0 (e.g., an image) by adding Gaussian noise over TT discrete time steps. This creates a sequence x1,x2,,xTx_1, x_2, \dots, x_T.
  • This process is defined as a Markov chain where each step xtx_t depends only on xt1x_{t-1}:

    q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t \mathbf{I})

  • Here, βt\beta_t is the variance of the noise added at step tt, determined by a predefined variance schedule (e.g., linear or cosine). Typically, βt\beta_t increases with tt, starting small (10410^{-4}) and ending larger (e.g., $0.02$).
  • Letting αt=1βt\alpha_t = 1 - \beta_t and αˉt=i=1tαi\bar{\alpha}_t = \prod_{i=1}^t \alpha_i, a key property is that we can sample xtx_t directly from x0x_0 without iterating through all intermediate steps:

    q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) \mathbf{I})

  • This can be written using the reparameterization trick: xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon, where ϵN(0,I)\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}).
  • As tTt \to T (often T=1000T=1000), xTx_T approaches a standard Gaussian distribution N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I}), effectively transforming the complex data distribution pcomplexp_{\text{complex}} into a simple prior ppriorp_{\text{prior}}.
  • This forward process is fixed and does not require any training.

2. Reverse Diffusion Process (Generative Modelling)

  • The goal of generative modeling is to reverse this process: starting from noise xTN(0,I)x_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), gradually denoise it step-by-step to obtain a sample x0x_0 that looks like it came from the original data distribution pcomplexp_{\text{complex}}.
  • Ideally, we would use the true reverse probability q(xt1xt)q(x_{t-1} | x_t), but calculating this is intractable because it requires knowledge of the entire data distribution q(x0)q(x_0).
  • Instead, we approximate the reverse step with a parameterized distribution, typically a Gaussian, learned by a neural network:

    pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_{\theta}(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_{\theta}(x_t, t), \Sigma_{\theta}(x_t, t))

  • The neural network (often a U-Net architecture) takes the noisy sample xtx_t and the time step tt as input and outputs the parameters (mean μθ\mu_{\theta} and variance Σθ\Sigma_{\theta}) of the Gaussian distribution for the previous step xt1x_{t-1}.

3. The Loss Function

  • The objective is to maximize the likelihood of the training data x0x_0 under the learned reverse process pθp_{\theta}. Directly maximizing logpθ(x0)\log p_{\theta}(x_0) is intractable.
  • Instead, we maximize the Evidence Lower Bound (ELBO), similar to Variational Autoencoders (VAEs). The ELBO for diffusion models can be written as:

    logpθ(x0)Eq(x1:Tx0)[logpθ(x0:T)q(x1:Tx0)]\log p_{\theta}(x_0) \geq E_{q(x_{1:T}|x_0)} \left[ \log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T} | x_0)} \right]

  • This ELBO can be further decomposed into a sum of KL divergence terms comparing the forward and reverse processes at each step:

    L=L0+t=2TLt1+LTL = L_0 + \sum_{t=2}^T L_{t-1} + L_T

    Lt1=Eq(xtx0)[DKL(q(xt1xt,x0)pθ(xt1xt))]L_{t-1} = E_{q(x_{t}|x_0)} \left[ D_{KL}(q(x_{t-1}|x_t,x_0) || p_{\theta}(x_{t-1}|x_t)) \right] for 2tT2 \le t \le T

  • The term q(xt1xt,x0)q(x_{t-1}|x_t, x_0) can be computed analytically and is also a Gaussian:

    q(xt1xt,x0)=N(xt1;μ~t(xt,x0),β~tI)q(x_{t-1}|x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t \mathbf{I})

    where μ~t\tilde{\mu}_t and β~t\tilde{\beta}_t are functions of αt\alpha_t, βt\beta_t, αˉt\bar{\alpha}_t, xtx_t, and x0x_0.

  • To minimize the KL divergence Lt1L_{t-1}, the variance of pθ(xt1xt)p_{\theta}(x_{t-1}|x_t) is often fixed to β~tI\tilde{\beta}_t \mathbf{I}, and the neural network only needs to learn the mean μθ(xt,t)\mu_{\theta}(x_t, t) to match μ~t(xt,x0)\tilde{\mu}_t(x_t, x_0).
  • Since μ~t\tilde{\mu}_t depends on the unknown original x0x_0, the network can be trained to predict x0x_0 from xtx_t and tt, denoted x^θ(xt,t)\hat{x}_{\theta}(x_t, t).
  • Alternatively, using the relation xt=αˉtx0+1αˉtϵtx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon_t, the network can be trained to predict the noise ϵt\epsilon_t added at step tt, denoted ϵ^θ(xt,t)\hat{\epsilon}_{\theta}(x_t, t). This formulation relates μθ\mu_{\theta} to ϵ^θ\hat{\epsilon}_{\theta}.
  • Empirically, a simplified loss function works well [ho_2020]:

    Lsimple=EtU(1,T),x0pdata,ϵtN(0,I)[ϵ^θ(αˉtx0+1αˉtϵt,t)ϵt22]L_{\text{simple}} = E_{t \sim U(1, T), x_0 \sim p_{\text{data}}, \epsilon_t \sim \mathcal{N}(0,I)} \left[ || \hat{\epsilon}_{\theta}(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon_t, t) - \epsilon_t ||_2^2 \right]

    This is a mean squared error between the true noise ϵt\epsilon_t and the noise predicted by the network ϵ^θ\hat{\epsilon}_{\theta}.

  • Training Algorithm (Simplified DDPM Loss):

1
2
3
4
5
6
7
repeat
    sample x_0 from data distribution
    sample t from Uniform({1, ..., T})
    sample epsilon from N(0, I)
    calculate x_t = sqrt(alpha_bar_t)*x_0 + sqrt(1 - alpha_bar_t)*epsilon
    take gradient descent step on: || epsilon_hat_theta(x_t, t) - epsilon ||^2
until convergence

4. Reverse Samplers

  • DDPM (Denoising Diffusion Probabilistic Models):
    • Uses the learned noise prediction ϵ^θ(xt,t)\hat{\epsilon}_{\theta}(x_t, t) to sample xt1x_{t-1} from xtx_t stochastically.
    • The sampling step is derived from pθ(xt1xt)=N(xt1;μθ(xt,t),β~tI)p_{\theta}(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_{\theta}(x_t, t), \tilde{\beta}_t \mathbf{I}) where μθ\mu_{\theta} is expressed using ϵ^θ\hat{\epsilon}_{\theta}.
    • Sampling Algorithm (DDPM):
    • 1
      2
      3
      4
      5
      6
      
      sample x_T from N(0, I)
      for t = T down to 1:
          sample z from N(0, I) if t > 1 else z = 0
          calculate epsilon_hat = epsilon_hat_theta(x_t, t)
          x_{t-1} = (1/sqrt(alpha_t)) * (x_t - (1-alpha_t)/sqrt(1-alpha_bar_t) * epsilon_hat) + sqrt(beta_tilde_t) * z
      return x_0
  • DDIM (Denoising Diffusion Implicit Models):
    • Introduces a non-Markovian forward process that shares the same marginals q(xtx0)q(x_t|x_0) as DDPM, allowing the same trained model ϵ^θ\hat{\epsilon}_{\theta} to be used.
    • Allows for deterministic sampling (σt=0\sigma_t = 0), meaning a fixed xTx_T always maps to the same x0x_0.
    • Introduces a parameter σt\sigma_t controlling the stochasticity of the reverse step. Setting σt=0\sigma_t=0 leads to the deterministic DDIM sampler.
    • Sampling Algorithm (DDIM, deterministic case σt=0\sigma_t=0):
    • 1
      2
      3
      4
      5
      6
      7
      8
      
      sample x_T from N(0, I)
      for t = T down to 1:
          calculate epsilon_hat = epsilon_hat_theta(x_t, t)
          predicted_x0 = (1/sqrt(alpha_bar_t)) * (x_t - sqrt(1-alpha_bar_t) * epsilon_hat)
          # Direction pointing to x_t
          term2 = sqrt(1 - alpha_bar_{t-1}) * epsilon_hat
          x_{t-1} = sqrt(alpha_bar_{t-1}) * predicted_x0 + term2
      return x_0
    • DDIM often allows for faster sampling by using fewer steps (evaluating the reverse process on a subset of {1,,T}\{1, \dots, T\}) without significant quality loss [ddim_arxiv].

5. Text-Prompting

  • Classifier Guidance: Uses a separately trained classifier pϕ(yxt)p_{\phi}(y|x_t) (where yy is a class label, e.g., derived from text) to guide the generation process towards samples matching the condition yy. The mean of the reverse step is modified by adding a term proportional to the gradient of the classifier's log-likelihood: xtlogpϕ(yxt)\nabla_{x_t} \log p_{\phi}(y|x_t). CLIP embeddings can be used to provide this guidance signal from text prompts [dhariwal_nichol_arxiv, Nichol2022glide].
    • Modified mean for sampling xt1x_{t-1}: μθ(xt,t)+sΣθ(xt,t)xtlogpϕ(yxt)\mu_{\theta}(x_t, t) + s \cdot \Sigma_{\theta}(x_t, t) \nabla_{x_t} \log p_{\phi}(y|x_t) (where ss is a guidance scale).
  • Classifier-Free Guidance: Avoids the need for an external classifier by training a single conditional diffusion model ϵ^θ(xt,y,t)\hat{\epsilon}_{\theta}(x_t, y, t) that takes the conditioning information yy (e.g., text embedding) as input. During training, the condition yy is randomly dropped (replaced with a null embedding \emptyset). At sampling time, the effective noise prediction is extrapolated:

    ϵ~θ(xt,y,t)=ϵ^θ(xt,,t)+s(ϵ^θ(xt,y,t)ϵ^θ(xt,,t))\tilde{\epsilon}_{\theta}(x_t, y, t) = \hat{\epsilon}_{\theta}(x_t, \emptyset, t) + s \cdot (\hat{\epsilon}_{\theta}(x_t, y, t) - \hat{\epsilon}_{\theta}(x_t, \emptyset, t))

    This ϵ~θ\tilde{\epsilon}_{\theta} is then used in the DDPM or DDIM sampling loop. ss is the guidance scale hyperparameter [ho2021classifierfree].

6. Appendix: Mathematical Concepts

  • Monte Carlo Estimation: Explains how to approximate expectations EXp[f(X)]E_{X \sim p}[f(X)] by averaging f(x)f(x) over samples xx drawn from pp.
  • Reparameterization Trick: Describes how to compute gradients of expectations EXpθ[f(X)]E_{X \sim p_{\theta}}[f(X)] with respect to parameters θ\theta of the distribution pθp_{\theta}. This is crucial for training models like VAEs and diffusion models, allowing gradients to flow through the sampling process. It involves expressing the random variable XX as a deterministic, differentiable function of θ\theta and an auxiliary random variable with a fixed distribution (e.g., X=μθ+σθZX = \mu_{\theta} + \sigma_{\theta} \cdot Z, where ZN(0,I)Z \sim \mathcal{N}(0, I)).

In summary, the lecture notes provide a rigorous mathematical foundation for diffusion models, explaining the forward noising process, the learned reverse denoising process, the derivation of the ELBO-based loss function (simplified to noise prediction error), different sampling strategies (DDPM, DDIM), and methods for conditioning the generation on text prompts (classifier and classifier-free guidance) (2312.10393).