Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text-to-image Diffusion Models in Generative AI: A Survey (2303.07909v3)

Published 14 Mar 2023 in cs.CV, cs.AI, and cs.LG

Abstract: This survey reviews the progress of diffusion models in generating images from text, ~\textit{i.e.} text-to-image diffusion models. As a self-contained work, this survey starts with a brief introduction of how diffusion models work for image synthesis, followed by the background for text-conditioned image synthesis. Based on that, we present an organized review of pioneering methods and their improvements on text-to-image generation. We further summarize applications beyond image generation, such as text-guided generation for various modalities like videos, and text-guided image editing. Beyond the progress made so far, we discuss existing challenges and promising future directions.

Text-to-Image Diffusion Models in Generative AI: An Expert Analysis

This paper offers a comprehensive survey of text-to-image diffusion models, situated within the broader context of diffusion models in generative AI. The authors begin with a foundational overview of how diffusion models operate in image synthesis, before discussing methods for conditioning or guiding models to improve learning. The survey then transitions to state-of-the-art methods in text-conditioned image synthesis, addressing both standard text-to-image generation and applications beyond it, such as text-guided creative generation and image editing. The paper also outlines current challenges and proposes promising future research avenues.

The authors segment the analysis of text-to-image diffusion models into two primary frameworks: those operating in pixel space and those in latent space. Notable among pixel space models are GLIDE and Imagen, each adopting the classifier-free guidance approach. GLIDE utilizes a transformer-based text encoder and exhibits strong performance across fidelity metrics. Conversely, Imagen leverages a large pretrained LLM for text encoding, reinforcing the notion that larger LLMs enhance both image fidelity and text-image alignment.

Latent space models, such as Stable Diffusion and DALL-E 2, represent another category. Stable Diffusion, an extension of the Latent Diffusion Model (LDM), employs a VQ-GAN for latent representation, offering a compelling reduction in complexity and improved detail preservation. DALL-E 2 utilizes the CLIP model for generating images from multimodal latent spaces, bridging the gap between text and image embeddings in a manner that highlights the advantages of learning text-image latent priors.

To build on pioneering works, the paper reviews advances in architecture and methodology. This discussion covers the use of multimodal guidance, specialized denoisers, sketch-assisted spatial control, and concept control via textual inversion. The need for retrieval mechanisms to handle out-of-distribution cases is also examined.

Evaluations of text-to-image methods consider both technical benchmarks and ethical considerations. The FID, CLIP score, and other metrics facilitate technical evaluation, while recent benchmarks emphasize fidelity and text-image alignment. Ethical issues range from dataset biases to the potential for misuse and privacy risks, necessitating continual monitoring and adjustment of models.

Practical applications beyond text-to-image generation, such as visual art generation, text-to-video synthesis, and 3D content creation, underscore the broad utility of diffusion models. Text-guided image editing, which benefits from diffusion models’ inversion properties, represents another significant application area.

In contemplating future directions, the authors emphasize the need for diverse evaluation methods, unified multi-modality frameworks, and collaboration across research fields. Addressing these avenues promises to extend the capabilities of generative models and cement their role in advancing computational creativity.

Overall, this paper provides a robust landscape of text-to-image diffusion approaches, offering valuable insights and serving as a foundational resource for experienced researchers in the domain of generative AI.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Chenshuang Zhang (16 papers)
  2. Chaoning Zhang (66 papers)
  3. Mengchun Zhang (9 papers)
  4. In So Kweon (156 papers)
  5. Junmo Kim (90 papers)
Citations (212)
Youtube Logo Streamline Icon: https://streamlinehq.com