Diffusion Models in Vision: A Survey (2209.04747v5)

Published 10 Sep 2022 in cs.CV, cs.AI, and cs.LG

Abstract: Denoising diffusion models represent a recent emerging topic in computer vision, demonstrating remarkable results in the area of generative modeling. A diffusion model is a deep generative model that is based on two stages, a forward diffusion stage and a reverse diffusion stage. In the forward diffusion stage, the input data is gradually perturbed over several steps by adding Gaussian noise. In the reverse stage, a model is tasked at recovering the original input data by learning to gradually reverse the diffusion process, step by step. Diffusion models are widely appreciated for the quality and diversity of the generated samples, despite their known computational burdens, i.e. low speeds due to the high number of steps involved during sampling. In this survey, we provide a comprehensive review of articles on denoising diffusion models applied in vision, comprising both theoretical and practical contributions in the field. First, we identify and present three generic diffusion modeling frameworks, which are based on denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations. We further discuss the relations between diffusion models and other deep generative models, including variational auto-encoders, generative adversarial networks, energy-based models, autoregressive models and normalizing flows. Then, we introduce a multi-perspective categorization of diffusion models applied in computer vision. Finally, we illustrate the current limitations of diffusion models and envision some interesting directions for future research.

PDF Abstract

Comprehensive Analysis of "Diffusion Models in Vision: A Survey"

The research work "Diffusion Models in Vision: A Survey" authored by Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah provides an extensive review of diffusion models in the context of computer vision. This survey meticulously classifies the advancements and methodologies while elaborating on their practical applications and theoretical implications.

Overview of Diffusion Models

Diffusion models are a category of deep generative models known for their ability to generate high-quality and diverse samples, albeit at the cost of significant computational resources. These models operate in two stages:

Forward Diffusion Stage: The input data is incrementally perturbed by adding Gaussian noise over several steps.
Reverse Diffusion Stage: A model is tasked with recovering the original input by learning to reverse the diffusion process step-by-step.

This survey categorizes diffusion models into three generic frameworks:

Denoising Diffusion Probabilistic Models (DDPMs): Rooted in non-equilibrium thermodynamics, these models perturb data with Gaussian noise in a Markovian manner, requiring extensive steps for the generative process.
Noise-Conditioned Score Networks (NCSNs): These models employ score matching to estimate the score function of the perturbed data distribution at varying noise levels. The denoising process uses Langevin dynamics for sampling.
Stochastic Differential Equations (SDEs): This formulation generalizes both DDPMs and NCSNs by treating the diffusion as a continuous process modeled by SDEs. The reverse-time SDE enables efficient sample generation.

Relations to Other Generative Models

The paper draws comparisons between diffusion models and other deep generative models, pinpointing similarities and distinctions:

Variational Auto-Encoders (VAEs): Both models encode data into a latent space. However, VAEs use compressed representations, whereas diffusion models maintain the original data dimensions.
Generative Adversarial Networks (GANs): Despite GANs generating images faster due to fewer training steps, diffusion models provide diverse and high-quality samples, avoiding mode collapse issues typical in GANs.
Autoregressive Models: Unlike these models, which inherently carry a unidirectional bias, hybrid approaches such as DiffFlow offer solutions by blending autoregressive processes with diffusion models.
Normalizing Flows: Although both models map data to Gaussian noise, normalizing flows employ invertible transformations, imposing significant architectural constraints.

Categorization and Applications in Vision

The survey introduces a multi-perspective categorization based on task, denoising condition, and underlying framework. It discusses:

Image Generation: Diffusion models have surpassed GANs in generating high-quality images. Extensive work has been done to improve sampling efficiency, such as substituting Gaussian distributions with more complex ones and integrating hybrid architectures.
Conditional Image Synthesis: By conditioning on various signals (e.g., text, class labels), these models have shown capabilities in guided image generation. Recent approaches introduce classifier-free guidance methods.
Image-to-Image Translation: These models facilitate tasks such as colorization, image inpainting, and super-resolution. Techniques like annealed Langevin dynamics and Brownian bridges improve performance and efficiency.
Text-to-Image Synthesis: State-of-the-art models like Stable Diffusion and Imagen exhibit strong text-to-image generative capabilities, blending creative elements with textual descriptions proficiently.
Medical Image Analysis: Applications extend to reconstructing images from measurements, addressing inverse problems, and anomaly detection.

Current Limitations and Future Directions

Despite their success, the primary limitation of diffusion models is their computational cost and slower sampling process compared to GANs. Recent advancements have focused on reducing the number of denoising steps while preserving image quality.

Looking ahead, several future research directions are proposed:

Model Efficiency: Incorporating efficient update rules from gradient-based optimization methods to enhance the sampling process.
Task Exploration: Expanding applications to uncharted areas such as video anomaly detection and visual question answering.
Representation Learning: Exploring the representational utility of diffusion models for downstream tasks and data augmentation.
Multimodal Integration: Combining conditioning signals from various domains (e.g., text, class label) for more versatile multi-purpose models.
Long-term Temporal Modeling: Improving generative capabilities in video by capturing long-term object interactions.

In conclusion, the survey by Croitoru and colleagues provides a thorough and insightful analysis of the significant advancements and extensive applications of diffusion models in computer vision. As the field progresses, these models are poised to unlock new horizons, offering robust solutions across various domains.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Florinel-Alin Croitoru (10 papers)
Vlad Hondru (8 papers)
Radu Tudor Ionescu (103 papers)
Mubarak Shah (207 papers)

Citations (881)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/TheTuringPost/status/1771596297709191543