Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

80 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

A Survey on Quality Metrics for Text-to-Image Models (2403.11821v4)

Published 18 Mar 2024 in cs.CV, cs.AI, and cs.GR

Abstract: Recent AI-based text-to-image models not only excel at generating realistic images, they also give designers more and more fine-grained control over the image content. Consequently, these approaches have gathered increased attention within the computer graphics research community, which has been historically devoted towards traditional rendering techniques that offer precise control over scene parameters such as objects, materials, and lighting, when generating realistic images. While the quality of rendered images is traditionally assessed through well-established image quality metrics, such as SSIM or PSNR, the unique challenges presented by text-to-image models, which in contrast to rendering interweave the control of scene and rendering parameters, necessitate the development of novel image quality metrics. Therefore, within this survey, we provide a comprehensive overview of existing text-to-image quality metrics addressing their nuances and the need for alignment with human preferences. Based on our findings, we propose a new taxonomy for categorizing these metrics, which is grounded in the assumption that there are two main quality criteria, namely compositionality and generality, which ideally map to human preferences. Ultimately, we derive guidelines for practitioners conducting text-to-image evaluation, discuss open challenges of evaluation mechanisms, and surface limitations of current metrics.

PDF HTML Abstract

Evaluating Text to Image Synthesis: A Survey and Taxonomy of Image Quality Metrics

Introduction

The field of text-conditioned image generation has seen significant advancements, enabled by integrating language and vision through large-scale databases. This progression has increased the demand for high-quality image generation that aligns both text and images coherently. Novel evaluation metrics have been developed, aiming to mimic human judgments for validating the quality and alignment between the text and the generated images. In this work, a comprehensive survey of existing text-to-image (T2I) evaluation metrics is presented alongside a proposed taxonomy for categorizing these metrics. Additionally, the paper explores promising approaches for the optimization of T2I synthesis and discusses the ongoing challenges and limitations within current evaluation frameworks.

Taxonomy Development

The core contribution of this work lies in the development of a new taxonomy for T2I evaluation metrics. Prior to the emergence of diffusion-based image generation, the evaluation focused predominantly on image quality measures like the Inception Score (IS) and the Frechet Inception Distance (FID). The proposed taxonomy addresses the need for a structured approach to evaluate the more complex aspect of compositional quality between text and images. The taxonomy differentiates metrics into two principal categories: pure image-based metrics and text-conditioned image quality metrics, further subdivided based on their aims to measure general image quality or compositional quality.

Image Metrics

Distribution-based metrics: These metrics, including IS and FID, use statistical measures to evaluate the differences between distributions of real and generated images, focusing solely on image quality without considering text conditions.
Single image metrics: Unlike distribution-based metrics, these assess the quality of individual images based on structural and semantic composition. Recent approaches utilize deep learning models that predict human judgments for aesthetic and visual quality.

Text-Image Alignment Metrics

Embedding-based metrics: Evaluate general image quality based on learned embedding representations for vision and language inputs, using models like CLIP and BLIP to calculate cosine similarity between text and image embeddings.
Content-based metrics: Delve deeper into the qualitative aspects of generated images by examining compositional quality through content analysis, such as object accuracy, spatial relations, and attribute alignment.

Evaluation Metrics Overview

The review highlights the evolution of metrics tailored for specific aspects of T2I synthesis. Embedding-based metrics leverage pre-trained models to assess alignment between text and image representations. In contrast, content-based metrics offer a more granular evaluation by disassembling the prompt into distinct components to measure specific content alignments. Various approaches, like visual question answering (VQA) models and object detection techniques, are employed to validate the compositionality between the textual descriptions and their visual counterparts.

Optimization Approaches

Discussing optimization methods for T2I synthesis, the paper emphasizes the significance of incorporating human judgments into the modeling process. Techniques such as fine-tuning generators on high-quality samples selected by reward models and applying reinforcement learning highlight the potential to enhance text-image alignment, thereby aligning generated images closer to human preferences.

Challenges and Future Directions

One of the key challenges addressed is the development of evaluation frameworks that can account for the intricate and diverse aspects of image quality in relation to the text. The need for evaluation metrics that can offer detailed component-level insights and the importance of constructing more comprehensive and complex benchmark datasets are underscored. Additionally, the adaptation of existing models and metrics to understand and assess visio-linguistic compositionality more effectively is discussed as an avenue for future research.

Conclusion

Through the establishment of a new taxonomy for T2I evaluation metrics and the scrutiny of existing metrics and optimization approaches, this work sets the foundation for future advancements in T2I synthesis evaluation. By addressing current limitations and proposing directions for future research, the paper contributes to the evolving landscape of generative AI, pushing towards models that can generate images which not only are of high quality but also compositionally aligned with their textual descriptions.

PDF Markdown Bookmark Chat (Pro)

References (168)

Authors (9)

Sebastian Hartwig (6 papers)
Dominik Engel (11 papers)
Leon Sick (7 papers)
Hannah Kniesel (3 papers)
Tristan Payer (3 papers)
Timo Ropinski (48 papers)
Poonam Poonam (2 papers)
Michael Glöckler (1 paper)
Alex Bäuerle (11 papers)

Citations (1)

View on Semantic Scholar

Tweets

https://twitter.com/Kopetri/status/1770002297344422123

https://twitter.com/woojinrad/status/1770498043609018601

https://twitter.com/az_mtl/status/1825006083376640333