Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tutorial on Diffusion Models for Imaging and Vision

Published 26 Mar 2024 in cs.LG and cs.CV | (2403.18103v3)

Abstract: The astonishing growth of generative tools in recent years has empowered many exciting applications in text-to-image generation and text-to-video generation. The underlying principle behind these generative tools is the concept of diffusion, a particular sampling mechanism that has overcome some shortcomings that were deemed difficult in the previous approaches. The goal of this tutorial is to discuss the essential ideas underlying the diffusion models. The target audience of this tutorial includes undergraduate and graduate students who are interested in doing research on diffusion models or applying these models to solve other problems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. D. P. Kingma and M. Welling, “An introduction to variational autoencoders,” Foundations and Trends in Machine Learning, vol. 12, no. 4, pp. 307–392, 2019. https://arxiv.org/abs/1906.02691.
  2. C. Doersch, “Tutorial on variational autoencoders,” 2016. https://arxiv.org/abs/1606.05908.
  3. D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in ICLR, 2014. https://openreview.net/forum?id=33X9fd2-9FyZd.
  4. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, 2020. https://arxiv.org/abs/2006.11239.
  5. D. P. Kingma, T. Salimans, B. Poole, and J. Ho, “Variational diffusion models,” in NeurIPS, 2021. https://arxiv.org/abs/2107.00630.
  6. M. Delbracio and P. Milanfar, “Inversion by direct iteration: An alternative to denoising diffusion for image restoration,” Transactions on Machine Learning Research, 2023. https://openreview.net/forum?id=VmyFF5lL3F.
  7. S. H. Chan, Introduction to Probability for Data Science. Michigan Publishing, 2021. https://probability4datascience.com/.
  8. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in ICLR, 2021. https://openreview.net/forum?id=PxTIG12RRHS.
  9. P. Vincent, “A connection between score matching and denoising autoencoders,” Neural Computation, vol. 23, no. 7, pp. 1661–1674, 2011. https://www.iro.umontreal.ca/~vincentp/Publications/smdae_techreport.pdf.
  10. J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML, vol. 37, pp. 2256–2265, 2015. https://arxiv.org/abs/1503.03585.
  11. C. Luo, “Understanding diffusion models: A unified perspective,” 2022. https://arxiv.org/abs/2208.11970.
  12. J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in ICLR, 2023. https://openreview.net/forum?id=St1giarCHLP.
  13. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, pp. 10684–10695, 2022. https://arxiv.org/abs/2112.10752.
  14. C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to-image diffusion models with deep language understanding,” in NeurIPS, vol. 35, pp. 36479–36494, 2022. https://arxiv.org/abs/2205.11487.
  15. Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” in NeurIPS, 2019. https://arxiv.org/abs/1907.05600.
  16. Y. Song and S. Ermon, “Improved techniques for training score-based generative models,” in NeurIPS, 2020. https://arxiv.org/abs/2006.09011.
  17. B. Anderson, “Reverse-time diffusion equation models,” Stochastic Process. Appl., vol. 12, pp. 313–326, May 1982. https://www.sciencedirect.com/science/article/pii/0304414982900515.
  18. Wiley, 2009. https://homepage.math.uiowa.edu/~atkinson/papers/NAODE_Book.pdf.
  19. T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,” in NeurIPS, 2022. https://arxiv.org/abs/2206.00364.
  20. C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps,” in NeurIPS, 2022. https://arxiv.org/abs/2206.00927.
  21. G. Nagy, “MTH 235 differential equations,” 2024. https://users.math.msu.edu/users/gnagy/teaching/ade.pdf.
  22. M. S. Albergo, N. M. Boffi, and E. Vanden-Eijnden, “Stochastic interpolants: A unifying framework for flows and diffusions.” https://arxiv.org/abs/2303.08797.
Citations (15)

Summary

  • The paper introduces a tutorial overview of diffusion models, detailing core techniques such as VAEs, DDPMs, SMLD, and SDEs.
  • The paper demonstrates how incremental denoising processes and noise prediction variants enable effective image synthesis.
  • The paper unifies multiple generative approaches under a stochastic differential equations framework, highlighting applications to inverse imaging problems.

Diffusion Models for Imaging and Vision: A Tutorial Overview

This tutorial elucidates the foundational concepts of diffusion models, a class of generative models that have recently achieved remarkable success in image and video synthesis. It focuses on the core ideas, providing a pedagogical treatment suitable for students and researchers entering the field. The tutorial covers variational autoencoders (VAEs), denoising diffusion probabilistic models (DDPMs), score-matching Langevin dynamics (SMLD), and stochastic differential equations (SDEs).

VAEs as a Foundation

The tutorial begins with a review of variational autoencoders (VAEs), framing them as an encoder-decoder architecture that maps data $\vx$ to a latent space $\vz$ and back. The "variational" aspect arises from the use of probability distributions, specifically $p(\vx)$, $p(\vz)$, $p(\vz|\vx)$, and $p(\vx|\vz)$. Since $p(\vx)$ and the conditional distributions are generally intractable, VAEs introduce proxy distributions $q_{\vphi}(\vz|\vx)$ and $p_{\vtheta}(\vx|\vz)$, typically chosen as Gaussians. The encoder learns the parameters $\vphi$ of $q_{\vphi}(\vz|\vx)$, while the decoder learns the parameters $\vtheta$ of $p_{\vtheta}(\vx|\vz)$.

To optimize $\vphi$ and $\vtheta$, the Evidence Lower Bound (ELBO) is introduced:

$\text{ELBO}(\vx) = \E_{q_{\phi}(\vz|\vx)}\left[ \log \frac{p(\vx,\vz)}{q_{\vphi}(\vz|\vx)} \right].$

Maximizing the ELBO is equivalent to minimizing the KL divergence between $q_{\vphi}(\vz|\vx)$ and $p(\vz|\vx)$, while also ensuring good reconstruction. The ELBO can be decomposed into a reconstruction term and a prior matching term:

$\text{ELBO}(\vx) = \E_{q_{\phi}(\vz|\vx)}[ \log p_{\vtheta}(\vx|\vz) ] - \mathbb{D}_{\text{KL}(q_{\phi}(\vz|\vx) \,\|\, p(\vz))}.$

The reconstruction term encourages the decoder to generate realistic images, while the prior matching term forces the latent distribution to be close to a standard Gaussian $\calN(0,\mI)$.

DDPMs: Incremental Refinement

DDPMs are presented as incremental updates within an encoder-decoder structure. Key to DDPMs is the concept of a denoiser, which facilitates transitions between states. A variational diffusion model consists of a sequence of states $\vx_0, \vx_1, \ldots, \vx_T$, where $\vx_0$ is the original image and $\vx_T$ is a latent variable following a standard Gaussian distribution. The forward process gradually adds noise to the image, while the reverse process iteratively denoises the image starting from $\vx_T$.

The transition distribution $q_{\vphi}(\vx_t|\vx_{t-1})$ is defined as a Gaussian:

$q_{\vphi}(\vx_t|\vx_{t-1}) = \calN(\vx_t \,|\, \sqrt{\alpha_t} \vx_{t-1},(1-\alpha_t)\mI),$

where αt\alpha_t is a variance schedule. This choice ensures that as tt increases, $\vx_t$ converges to a white Gaussian noise vector. The conditional distribution $q_{\vphi}(\vx_t|\vx_0)$ is also Gaussian:

$q_{\vphi}(\vx_t|\vx_0) = \calN(\vx_t \,|\, \sqrt{\overline{\alpha}_t} \vx_0, \;\; (1-\overline{\alpha}_t)\mI),$

where αt=i=1tαi\overline{\alpha}_t = \prod_{i=1}^{t} \alpha_i.

The ELBO for DDPMs is:

$\text{ELBO}_{\vphi,\vtheta}(\vx) = \E_{q_{\vphi}(\vx_1|\vx_0)}[\log p_{\vtheta}(\vx_0|\vx_1)] - \E_{q_{\vphi}(\vx_{T-1}|\vx_0)} \Big[ \mathbb{D}_{\text{KL}\Big(q_{\vphi}(\vx_T|\vx_{T-1}) \| p(\vx_T) \Big)} \Big] - \sum_{t=1}^{T-1} \E_{q_{\vphi}(\vx_{t-1},\vx_{t+1}|\vx_0)} \Big[ \mathbb{D}_{\text{KL}\Big(q_{\vphi}(\vx_t|\vx_{t-1}) \| p_{\vtheta}(\vx_t|\vx_{t+1}) \Big) } \Big].$

To simplify training, the consistency term is rewritten using Bayes' theorem:

$q(\vx_t|\vx_{t-1},\vx_0) = \frac{q(\vx_{t-1}|\vx_t,\vx_0) q(\vx_t|\vx_0)}{q(\vx_{t-1}|\vx_0)}.$

This leads to a more tractable ELBO:

$\text{ELBO}_{\vphi,\vtheta}(\vx) = \E_{q_{\vphi}(\vx_1|\vx_0)}[\log p_{\vtheta}(\vx_0|\vx_1)] - \mathbb{D}_{\text{KL}\Big(q_{\vphi}(\vx_T|\vx_0) \| p(\vx_T) \Big)} -\sum_{t=2}^{T} \E_{q_{\vphi}(\vx_{t}|\vx_0)} \Big[ \mathbb{D}_{\text{KL}\Big(q_{\vphi}(\vx_{t-1}|\vx_t,\vx_0) \| p_{\vtheta}(\vx_{t-1}|\vx_{t}) \Big) } \Big].$

The distribution $q_{\vphi}(\vx_{t-1}|\vx_t,\vx_0)$ is Gaussian with a mean $\vmu_q(\vx_t,\vx_0)$ and a covariance $\mSigma_q(t)$ that can be computed analytically. The reverse process $p_{\vtheta}(\vx_{t-1}|\vx_t)$ is also modeled as a Gaussian, with a mean $\vmu_{\vtheta}(\vx_t)$ predicted by a neural network and a variance $\sigma_q^2(t)\mI$.

The training objective then becomes:

$\vtheta^* = \argmin{\vtheta} \sum_{t=1}^{T} \frac{1}{2\sigma_q^2(t)} \frac{(1-\alpha_t)^2 \overline{\alpha}_{t-1}}{(1-\overline{\alpha}_t)^2} \E_{q(\vx_{t}|\vx_0)} \Big[ \left\| \widehat{\vx}_{\vtheta}(\vx_t)-\vx_0 \right\|^2 \Big],$

where $\widehat{\vx}_{\vtheta}(\vx_t)$ is a neural network that estimates $\vx_0$ from $\vx_t$.

The tutorial also discusses a noise prediction variant, where the network learns to predict the noise $\vepsilon_0$ added to $\vx_0$ to obtain $\vx_t$.

Finally, the tutorial introduces Inversion by Direct Denoising (InDI), which frames generative diffusion models from a pure denoising perspective.

SMLD: Langevin Dynamics

SMLD provides an alternative approach to generative modeling based on Langevin dynamics, which iteratively refines samples using the gradient of the data distribution. Langevin dynamics uses the following iterative procedure:

$\vx_{t+1} = \vx_t + \tau \nabla_{\vx} \log p(\vx_t) + \sqrt{2\tau} \vz, \qquad \vz \sim \calN(0,\mI),$

where τ\tau is a step size. The gradient $\nabla_{\vx} \log p(\vx)$ is known as the Stein's score function, denoted by $\vs_{\vtheta}(\vx)$. Since $p(\vx)$ is unknown, it is approximated using score matching techniques.

Explicit score matching (ESM) approximates $p(\vx)$ using kernel density estimation and minimizes the following loss:

$J_{\text{ESM}(\vtheta)} = \E_{q(\vx)} \|\vs_{\vtheta}(\vx) - \nabla_{\vx} \log q(\vx)\|^2.$

Denoising score matching (DSM) avoids the need for kernel density estimation by minimizing:

$J_{\text{DSM}(\vtheta)} = \E_{q(\vx,\vx')}\left[ \frac{1}{2}\left\|\vs_{\vtheta}(\vx) - \nabla_{\vx} q(\vx|\vx') \right\|^2\right],$

where $q(\vx|\vx')$ is a conditional distribution, typically a Gaussian. This loss is equivalent to training the network to predict the noise added to the data.

SDEs: A Unifying Framework

The tutorial presents stochastic differential equations (SDEs) as a unifying framework for understanding diffusion models. SDEs describe the evolution of a probability distribution over time. The forward SDE is given by:

$d\vx = \vf (\vx,t) \; dt + g(t) \; d\vw,$

where $\vf (\vx,t)$ is the drift coefficient and g(t)g(t) is the diffusion coefficient. The reverse SDE is:

$d \vx = [\vf(\vx,t) - g(t)^2 \nabla_{\vx} \log p_t(\vx)] \; dt + g(t) d\overline{\vw},$

where $p_t(\vx)$ is the probability distribution of $\vx$ at time tt.

The tutorial shows how DDPMs and SMLD can be expressed as specific instances of this general SDE framework. The DDPM forward sampling equation can be written as:

$d\vx = -\frac{\beta(t)}{2}\; \vx \; dt + \sqrt{\beta(t)}d \vw,$

while the SMLD forward sampling equation can be written as:

$d\vx = \sqrt{ \frac{d[\sigma(t)^2]}{dt} \; d\vw}.$

Numerical methods for solving SDEs, such as Euler and Runge-Kutta methods, are briefly discussed. A predictor-corrector approach that combines an SDE solver with score matching is also presented.

Conclusion

The tutorial concludes by emphasizing the underlying unity of diffusion models, highlighting the incremental nature of the denoising process, and noting potential limitations and future research directions. It also highlights the applicability of diffusion models to inverse problems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 48 tweets with 7559 likes about this paper.