Text-to-image Diffusion Models in Generative AI: A Survey (2303.07909v3)

Published 14 Mar 2023 in cs.CV, cs.AI, and cs.LG

Abstract: This survey reviews the progress of diffusion models in generating images from text, ~\textit{i.e.} text-to-image diffusion models. As a self-contained work, this survey starts with a brief introduction of how diffusion models work for image synthesis, followed by the background for text-conditioned image synthesis. Based on that, we present an organized review of pioneering methods and their improvements on text-to-image generation. We further summarize applications beyond image generation, such as text-guided generation for various modalities like videos, and text-guided image editing. Beyond the progress made so far, we discuss existing challenges and promising future directions.

PDF Abstract

Text-to-Image Diffusion Models in Generative AI: An Expert Analysis

This paper offers a comprehensive survey of text-to-image diffusion models, situated within the broader context of diffusion models in generative AI. The authors begin with a foundational overview of how diffusion models operate in image synthesis, before discussing methods for conditioning or guiding models to improve learning. The survey then transitions to state-of-the-art methods in text-conditioned image synthesis, addressing both standard text-to-image generation and applications beyond it, such as text-guided creative generation and image editing. The paper also outlines current challenges and proposes promising future research avenues.

The authors segment the analysis of text-to-image diffusion models into two primary frameworks: those operating in pixel space and those in latent space. Notable among pixel space models are GLIDE and Imagen, each adopting the classifier-free guidance approach. GLIDE utilizes a transformer-based text encoder and exhibits strong performance across fidelity metrics. Conversely, Imagen leverages a large pretrained LLM for text encoding, reinforcing the notion that larger LLMs enhance both image fidelity and text-image alignment.

Latent space models, such as Stable Diffusion and DALL-E 2, represent another category. Stable Diffusion, an extension of the Latent Diffusion Model (LDM), employs a VQ-GAN for latent representation, offering a compelling reduction in complexity and improved detail preservation. DALL-E 2 utilizes the CLIP model for generating images from multimodal latent spaces, bridging the gap between text and image embeddings in a manner that highlights the advantages of learning text-image latent priors.

To build on pioneering works, the paper reviews advances in architecture and methodology. This discussion covers the use of multimodal guidance, specialized denoisers, sketch-assisted spatial control, and concept control via textual inversion. The need for retrieval mechanisms to handle out-of-distribution cases is also examined.

Evaluations of text-to-image methods consider both technical benchmarks and ethical considerations. The FID, CLIP score, and other metrics facilitate technical evaluation, while recent benchmarks emphasize fidelity and text-image alignment. Ethical issues range from dataset biases to the potential for misuse and privacy risks, necessitating continual monitoring and adjustment of models.

Practical applications beyond text-to-image generation, such as visual art generation, text-to-video synthesis, and 3D content creation, underscore the broad utility of diffusion models. Text-guided image editing, which benefits from diffusion models’ inversion properties, represents another significant application area.

In contemplating future directions, the authors emphasize the need for diverse evaluation methods, unified multi-modality frameworks, and collaboration across research fields. Addressing these avenues promises to extend the capabilities of generative models and cement their role in advancing computational creativity.

Overall, this paper provides a robust landscape of text-to-image diffusion approaches, offering valuable insights and serving as a foundational resource for experienced researchers in the domain of generative AI.