Evaluating Text to Image Synthesis: A Survey and Taxonomy of Image Quality Metrics
Introduction
The field of text-conditioned image generation has seen significant advancements, enabled by integrating language and vision through large-scale databases. This progression has increased the demand for high-quality image generation that aligns both text and images coherently. Novel evaluation metrics have been developed, aiming to mimic human judgments for validating the quality and alignment between the text and the generated images. In this work, a comprehensive survey of existing text-to-image (T2I) evaluation metrics is presented alongside a proposed taxonomy for categorizing these metrics. Additionally, the paper explores promising approaches for the optimization of T2I synthesis and discusses the ongoing challenges and limitations within current evaluation frameworks.
Taxonomy Development
The core contribution of this work lies in the development of a new taxonomy for T2I evaluation metrics. Prior to the emergence of diffusion-based image generation, the evaluation focused predominantly on image quality measures like the Inception Score (IS) and the Frechet Inception Distance (FID). The proposed taxonomy addresses the need for a structured approach to evaluate the more complex aspect of compositional quality between text and images. The taxonomy differentiates metrics into two principal categories: pure image-based metrics and text-conditioned image quality metrics, further subdivided based on their aims to measure general image quality or compositional quality.
Image Metrics
- Distribution-based metrics: These metrics, including IS and FID, use statistical measures to evaluate the differences between distributions of real and generated images, focusing solely on image quality without considering text conditions.
- Single image metrics: Unlike distribution-based metrics, these assess the quality of individual images based on structural and semantic composition. Recent approaches utilize deep learning models that predict human judgments for aesthetic and visual quality.
Text-Image Alignment Metrics
- Embedding-based metrics: Evaluate general image quality based on learned embedding representations for vision and language inputs, using models like CLIP and BLIP to calculate cosine similarity between text and image embeddings.
- Content-based metrics: Delve deeper into the qualitative aspects of generated images by examining compositional quality through content analysis, such as object accuracy, spatial relations, and attribute alignment.
Evaluation Metrics Overview
The review highlights the evolution of metrics tailored for specific aspects of T2I synthesis. Embedding-based metrics leverage pre-trained models to assess alignment between text and image representations. In contrast, content-based metrics offer a more granular evaluation by disassembling the prompt into distinct components to measure specific content alignments. Various approaches, like visual question answering (VQA) models and object detection techniques, are employed to validate the compositionality between the textual descriptions and their visual counterparts.
Optimization Approaches
Discussing optimization methods for T2I synthesis, the paper emphasizes the significance of incorporating human judgments into the modeling process. Techniques such as fine-tuning generators on high-quality samples selected by reward models and applying reinforcement learning highlight the potential to enhance text-image alignment, thereby aligning generated images closer to human preferences.
Challenges and Future Directions
One of the key challenges addressed is the development of evaluation frameworks that can account for the intricate and diverse aspects of image quality in relation to the text. The need for evaluation metrics that can offer detailed component-level insights and the importance of constructing more comprehensive and complex benchmark datasets are underscored. Additionally, the adaptation of existing models and metrics to understand and assess visio-linguistic compositionality more effectively is discussed as an avenue for future research.
Conclusion
Through the establishment of a new taxonomy for T2I evaluation metrics and the scrutiny of existing metrics and optimization approaches, this work sets the foundation for future advancements in T2I synthesis evaluation. By addressing current limitations and proposing directions for future research, the paper contributes to the evolving landscape of generative AI, pushing towards models that can generate images which not only are of high quality but also compositionally aligned with their textual descriptions.