Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

80 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2) (2404.04251v3)

Published 5 Apr 2024 in cs.CV, cs.AI, and cs.CL

Abstract: With advances in the quality of text-to-image (T2I) models has come interest in benchmarking their prompt faithfulness -- the semantic coherence of generated images to the prompts they were conditioned on. A variety of T2I faithfulness metrics have been proposed, leveraging advances in cross-modal embeddings and vision-LLMs (VLMs). However, these metrics are not rigorously compared and benchmarked, instead presented with correlation to human Likert scores over a set of easy-to-discriminate images against seemingly weak baselines. We introduce T2IScoreScore, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests. Surprisingly, we find that the state-of-the-art VLM-based metrics (e.g., TIFA, DSG, LLMscore, VIEScore) we tested fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore, particularly on a hard subset of naturally-occurring T2I model errors. TS2 will enable the development of better T2I prompt faithfulness metrics through more rigorous comparison of their conformity to expected orderings and separations under objective criteria.

PDF HTML Abstract

Who Evaluates the Evaluations? A Benchmark for Text-to-Image Prompt Coherence Metrics

Introduction

The landscape of text-to-image (T2I) models has witnessed rapid advancements, propelling the fidelity and semantic coherence of generated images to unprecedented levels. Despite this progress, the challenge of aligning generated images with their text prompts—a cornerstone for evaluating T2I model performance—persists. The heterogeneity among proposed automated prompt faithfulness metrics, developed to measure this alignment, underscores the pressing need for a standardized benchmark. This paper introduces , a meticulously curated set of semantic error graphs (SEGs) and corresponding meta-metrics, aiming to objectively assess the efficacy of various T2I prompt faithfulness metrics.

Related Work

A broad survey of existing benchmarks reveals a disjointed landscape where each metric employs a distinct evaluation methodology, often designed to highlight its strengths. While ad-hoc tests against prior baselines are common, they fall short in offering a consistent or objective comparison framework. Our investigation emphasizes the gap in objective benchmarks that rigorously compare T2I prompt coherence metrics based on clearly defined errors, rather than subjective human judgment correlating metrics.

The Dataset

distinguishes itself through a unique structure that emphasizes high image-to-prompt ratios. This design facilitates the construction of semantic error graphs (SEGs) where images are organized based on increasing deviation from the original prompt. The dataset comprises 165 SEGs, covering a spectrum from synthetic errors to natural misinterpretations, thereby setting the stage for comprehensive metric evaluations.

Meta-Metrics

The cornerstone of our evaluation framework lies in two novel meta-metrics: Ranking Correctness Assessment (Ordering) and Separation Assessment. The former leverages Spearman's rank correlation to assess a metric's ability to correctly order images by their semantic deviation from the prompt. Meanwhile, the Separation metric employs the two-sample Kolmogorov–Smirnov statistic to evaluate the capability to differentiate between sets of images reflecting unique semantic errors. Together, these meta-metrics provide a robust measure of a T2I prompt faithfulness metric's performance.

Experiments

Our experiments span a broad spectrum of T2I faithfulness benchmarks, evaluating each against the newly proposed . The paper showcases a comparative analysis across various metric classes, including embedding-based metrics like CLIPScore and novel vision-LLM (VLM)-based metrics such as TIFA and DSG. The results reveal intriguing findings; surprisingly, simpler feature-based metrics like CLIPScore display competitive performance, especially in challenging error subsets. This observation suggests the potential for feature-based metrics to provide a valuable benchmark alongside more sophisticated VLM-based approaches.

Discussion and Conclusion

The comparative analysis offered by yields critical insights into the current state of T2I prompt coherence metric development. Notably, the performance of simpler metrics in the face of complex, naturally-occurring model errors highlights a path forward for metric development focused not just on aligning with human judgment but also on objective semantic error identification. Our research emphasizes the necessity of bridging the gap between subjective preference and objective error-based evaluation, advocating for a multifaceted approach to metric development. As the T2I field continues to evolve, stands as a pivotal benchmark tool, guiding the refinement of evaluation metrics toward more accurate, reliable, and semantically coherent image generation.

Acknowledgements and Impact Statement

The research highlights the indispensable role of precise evaluation tools like in advancing T2I technology. By providing an objective benchmark, enables a deeper understanding and refinement of prompt faithfulness metrics, ensuring their alignment with the semantic content of text prompts. This contributes significantly to the development of more effective and semantically aware T2I models, bolstering the reliability of generated images for a wide array of applications.

PDF Markdown Bookmark Chat (Pro)

References (58)

Authors (6)

Michael Saxon (27 papers)
Fatima Jahara (1 paper)
Mahsa Khoshnoodi (3 papers)
Yujie Lu (42 papers)
Aditya Sharma (32 papers)
William Yang Wang (254 papers)

Citations (6)

View on Semantic Scholar

Tweets

https://twitter.com/m2saxon/status/1777255340175966498

https://twitter.com/m2saxon/status/1797152367001149585

https://twitter.com/FatimaJahara1/status/1777403061603893288

https://twitter.com/m2saxon/status/1818356992572113117