Controllable Generation with Text-to-Image Diffusion Models: A Survey (2403.04279v1)

Published 7 Mar 2024 in cs.CV

Abstract: In the rapidly advancing realm of visual generation, diffusion models have revolutionized the landscape, marking a significant shift in capabilities with their impressive text-guided generative functions. However, relying solely on text for conditioning these models does not fully cater to the varied and complex requirements of different applications and scenarios. Acknowledging this shortfall, a variety of studies aim to control pre-trained text-to-image (T2I) models to support novel conditions. In this survey, we undertake a thorough review of the literature on controllable generation with T2I diffusion models, covering both the theoretical foundations and practical advancements in this domain. Our review begins with a brief introduction to the basics of denoising diffusion probabilistic models (DDPMs) and widely used T2I diffusion models. We then reveal the controlling mechanisms of diffusion models, theoretically analyzing how novel conditions are introduced into the denoising process for conditional generation. Additionally, we offer a detailed overview of research in this area, organizing it into distinct categories from the condition perspective: generation with specific conditions, generation with multiple conditions, and universal controllable generation. For an exhaustive list of the controllable generation literature surveyed, please refer to our curated repository at \url{https://github.com/PRIV-Creation/Awesome-Controllable-T2I-Diffusion-Models}.

PDF HTML Abstract

Overview of "Controllable Generation with Text-to-Image Diffusion Models: A Survey"

The paper "Controllable Generation with Text-to-Image Diffusion Models: A Survey" addresses the evolving landscape of text-guided visual generation through diffusion models. Recognizing the limitations of relying solely on text conditions, the authors present a comprehensive review of literature focusing on control mechanisms that accommodate novel conditions beyond text prompts.

Theoretical Foundations

The survey begins with an introduction to denoising diffusion probabilistic models (DDPMs) and their foundational role in generating high-quality images from noise. The text-to-image diffusion models discussed include GLIDE, Imagen, DALL·E 2, Latent Diffusion Models (LDM), and Stable Diffusion, each characterized by distinct architectures and training datasets. These models form the basis for exploring various condition-control mechanisms within diffusion models.

Mechanisms for Controllable Generation

Controlling text-to-image models with novel conditions is a central theme, and the authors outline two primary mechanisms: conditional score prediction and condition-guided score estimation.

Conditional Score Prediction involves incorporating new conditions directly into the generative model, either through model-based, tuning-based, or training-free approaches. Each method integrates novel conditions to steer generation effectively within the sampling process.
Condition-Guided Score Estimation leverages additional models to estimate conditional scores from latent features, enhancing the generation process without the need for classifier-free guidance (CFG).

Categorization of Conditional Generation Methods

The survey categorizes approaches into specific applications based on novel conditions:

Personalization: Focuses on subject, person, style, interaction, image, and distribution-driven generation methods. By tuning model parameters or using model-based score prediction, personalized outputs reflect unique subjects or styles from reference images.
Spatial Control: Explores methods using spatial conditions like layouts and masks to achieve structure-driven generation, utilizing both conditional score prediction and guided score estimation.
Advanced Text-Conditioned Generation: Tackles challenges of textual alignment and multilingual generation by refining attention mechanisms or integrating multilingual models.
In-Context and Brain/Sound-Guided Generation: Extends beyond visual cues, incorporating contextual understanding and brain/sound signals for generation.
Text Rendering: Enhances the capability of models to generate visually coherent text within images, leveraging text encoders and training adjustments.

Frameworks for Multiple Conditions and Universal Control

For scenarios requiring the integration of multiple conditions, the paper reviews methods utilizing joint training, continual learning, weight fusion, attention-based integration, and guidance composition. The authors also discuss frameworks for universal control, proposing approaches that accommodate varied conditions through generalized score prediction or condition-guided estimation.

Implications and Future Directions

This survey underscores the transformative potential of controllable generation techniques, not only enhancing image synthesis capabilities but also broadening applications in personalization, image manipulation, and 3D reconstruction. The exploration suggests a future where adaptable and responsive generative models can seamlessly align with multifaceted user requirements across diverse domains.

Overall, the paper provides a structured and detailed examination of the state-of-the-art in controllable text-to-image generation, offering valuable insights and a foundation for future advancements in artificial intelligence-driven content creation.

PDF Markdown Bookmark Chat (Pro)

References (249)

Authors (4)

Pu Cao (10 papers)
Feng Zhou (195 papers)
Qing Song (23 papers)
Lu Yang (82 papers)

Citations (20)

View on Semantic Scholar

GitHub

GitHub - PRIV-Creation/Awesome-Controllable-T2I-Diffusion-Models: A collection of resources on controllable generation with text-to-image diffusion models. (707 stars)