Text to Image Generation and Editing: A Survey (2505.02527v1)

Published 5 May 2025 in cs.CV

Abstract: Text-to-image generation (T2I) refers to the text-guided generation of high-quality images. In the past few years, T2I has attracted widespread attention and numerous works have emerged. In this survey, we comprehensively review 141 works conducted from 2021 to 2024. First, we introduce four foundation model architectures of T2I (autoregression, non-autoregression, GAN and diffusion) and the commonly used key technologies (autoencoder, attention and classifier-free guidance). Secondly, we systematically compare the methods of these studies in two directions, T2I generation and T2I editing, including the encoders and the key technologies they use. In addition, we also compare the performance of these researches side by side in terms of datasets, evaluation metrics, training resources, and inference speed. In addition to the four foundation models, we survey other works on T2I, such as energy-based models and recent Mamba and multimodality. We also investigate the potential social impact of T2I and provide some solutions. Finally, we propose unique insights of improving the performance of T2I models and possible future development directions. In summary, this survey is the first systematic and comprehensive overview of T2I, aiming to provide a valuable guide for future researchers and stimulate continued progress in this field.

Summary

The paper provides a comprehensive survey analyzing four core T2I architectures, including autoregressive, GAN, non-autoregressive, and diffusion models.
It details key technologies like autoencoders, attention mechanisms, and classifier-free guidance that enhance the mapping of text to high-quality images.
The survey addresses social impacts and ethical challenges, recommending dataset filtering and model transparency to mitigate bias in image generation.

Survey on Text to Image Generation and Editing

This essay provides a detailed analysis of the paper, "Text to Image Generation and Editing: A Survey" (2505.02527), which offers a comprehensive review of recent advancements in text-to-image (T2I) generation technologies. The paper systematically classifies existing models, assesses their performance across various dimensions, and discusses potential future developments in the field.

Foundation Models for T2I

The paper identifies four core architectures foundational to T2I generation: autoregressive models, non-autoregressive models, generative adversarial networks (GANs), and diffusion models. Each architecture leverages unique mechanisms to map descriptive text inputs to visual outputs. Autoregressive and non-autoregressive models exploit sequence modeling, with significant advances arising from transformer architectures. GANs, historically dominant in image synthesis, continually evolve to address mode collapse and training instability. Meanwhile, diffusion models, which add and then reverse noise in data, have gained prominence due to their robustness in generating high-quality and diverse images.

Key Technologies

Key technologies pivotal to these models include autoencoders, attention mechanisms, and classifier-free guidance. These technologies underpin the ability to effectively encode textual information, model dependencies in sequential data, and refine image generation to align closely with descriptive prompts.

Comparative Analysis of T2I Methods

The paper undertakes a systematic comparison of T2I generation and editing methods. By evaluating methods based on architectures, encoder usage, and key technologies, it highlights performance differences in terms of dataset usage, evaluation metrics, and resource efficiency. This comparative analysis extends to novel methods including energy-based models and multi-modal approaches that integrate text inputs with other data modalities.

Performance and Evaluation

The survey evaluates each model's performance across several benchmarks, focusing on fidelity, diversity, and alignment with textual descriptions. It discusses metrics such as Fréchet Inception Distance (FID), Inception Score (IS), and human evaluative measures. The chosen metrics assess not just the qualitative aspects of generated images but also their semantic and aesthetic relevance to input texts.

A significant portion of the survey addresses the social implications of T2I models. The paper highlights concerns surrounding ethical use, including biases inherent in large datasets and the risk of generating harmful content. It proposes solutions such as dataset filtering and the development of transparent models that mitigate these biases.

Future Directions

Future advancements are expected in several areas:

Model Scaling: The continued expansion of model scale in terms of parameters and training data to improve fidelity and diversity.
Prompt Optimization: Enhanced techniques for refining input prompts to better guide image synthesis.
Hybrid Models: Exploration of hybrid architectures combining the strengths of different foundational models like GANs and diffusion models.
Video Generation: Transitioning from static image generation to dynamic video synthesis, leveraging similar text-based guidance.

Conclusion

The survey conducted by Yang et al. provides a critical resource for understanding the current state and future trajectory of T2I research. It synthesizes extensive research findings and offers a roadmap for ongoing developments in AI-driven image generation and editing technologies. This work supports future innovations by elucidating the technical intricacies and societal considerations associated with T2I systems.