Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2) (2404.04251v3)

Published 5 Apr 2024 in cs.CV, cs.AI, and cs.CL

Abstract: With advances in the quality of text-to-image (T2I) models has come interest in benchmarking their prompt faithfulness -- the semantic coherence of generated images to the prompts they were conditioned on. A variety of T2I faithfulness metrics have been proposed, leveraging advances in cross-modal embeddings and vision-LLMs (VLMs). However, these metrics are not rigorously compared and benchmarked, instead presented with correlation to human Likert scores over a set of easy-to-discriminate images against seemingly weak baselines. We introduce T2IScoreScore, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests. Surprisingly, we find that the state-of-the-art VLM-based metrics (e.g., TIFA, DSG, LLMscore, VIEScore) we tested fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore, particularly on a hard subset of naturally-occurring T2I model errors. TS2 will enable the development of better T2I prompt faithfulness metrics through more rigorous comparison of their conformity to expected orderings and separations under objective criteria.

Who Evaluates the Evaluations? A Benchmark for Text-to-Image Prompt Coherence Metrics

Introduction

The landscape of text-to-image (T2I) models has witnessed rapid advancements, propelling the fidelity and semantic coherence of generated images to unprecedented levels. Despite this progress, the challenge of aligning generated images with their text prompts—a cornerstone for evaluating T2I model performance—persists. The heterogeneity among proposed automated prompt faithfulness metrics, developed to measure this alignment, underscores the pressing need for a standardized benchmark. This paper introduces , a meticulously curated set of semantic error graphs (SEGs) and corresponding meta-metrics, aiming to objectively assess the efficacy of various T2I prompt faithfulness metrics.

Related Work

A broad survey of existing benchmarks reveals a disjointed landscape where each metric employs a distinct evaluation methodology, often designed to highlight its strengths. While ad-hoc tests against prior baselines are common, they fall short in offering a consistent or objective comparison framework. Our investigation emphasizes the gap in objective benchmarks that rigorously compare T2I prompt coherence metrics based on clearly defined errors, rather than subjective human judgment correlating metrics.

The Dataset

distinguishes itself through a unique structure that emphasizes high image-to-prompt ratios. This design facilitates the construction of semantic error graphs (SEGs) where images are organized based on increasing deviation from the original prompt. The dataset comprises 165 SEGs, covering a spectrum from synthetic errors to natural misinterpretations, thereby setting the stage for comprehensive metric evaluations.

Meta-Metrics

The cornerstone of our evaluation framework lies in two novel meta-metrics: Ranking Correctness Assessment (Ordering) and Separation Assessment. The former leverages Spearman's rank correlation to assess a metric's ability to correctly order images by their semantic deviation from the prompt. Meanwhile, the Separation metric employs the two-sample Kolmogorov–Smirnov statistic to evaluate the capability to differentiate between sets of images reflecting unique semantic errors. Together, these meta-metrics provide a robust measure of a T2I prompt faithfulness metric's performance.

Experiments

Our experiments span a broad spectrum of T2I faithfulness benchmarks, evaluating each against the newly proposed . The paper showcases a comparative analysis across various metric classes, including embedding-based metrics like CLIPScore and novel vision-LLM (VLM)-based metrics such as TIFA and DSG. The results reveal intriguing findings; surprisingly, simpler feature-based metrics like CLIPScore display competitive performance, especially in challenging error subsets. This observation suggests the potential for feature-based metrics to provide a valuable benchmark alongside more sophisticated VLM-based approaches.

Discussion and Conclusion

The comparative analysis offered by yields critical insights into the current state of T2I prompt coherence metric development. Notably, the performance of simpler metrics in the face of complex, naturally-occurring model errors highlights a path forward for metric development focused not just on aligning with human judgment but also on objective semantic error identification. Our research emphasizes the necessity of bridging the gap between subjective preference and objective error-based evaluation, advocating for a multifaceted approach to metric development. As the T2I field continues to evolve, stands as a pivotal benchmark tool, guiding the refinement of evaluation metrics toward more accurate, reliable, and semantically coherent image generation.

Acknowledgements and Impact Statement

The research highlights the indispensable role of precise evaluation tools like in advancing T2I technology. By providing an objective benchmark, enables a deeper understanding and refinement of prompt faithfulness metrics, ensuring their alignment with the semantic content of text prompts. This contributes significantly to the development of more effective and semantically aware T2I models, bolstering the reliability of generated images for a wide array of applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Introducing our multimodal models (fuyu-8b), 2023. URL https://www.adept.ai/blog/fuyu-8b.
  2. V. W. Berger and Y. Zhou. Kolmogorov–smirnov test: Overview. Wiley statsref: Statistics reference online, 2014.
  3. Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, pages 1493–1504. Association for Computing Machinery. ISBN 9798400701924. doi: 10.1145/3593013.3594095. URL https://dl.acm.org/doi/10.1145/3593013.3594095.
  4. Microsoft coco captions: Data collection and evaluation server, 2015.
  5. Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models, 2023.
  6. Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-to-image generation, 2024.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  8. Deep generative image models using a laplacian pyramid of adversarial networks. Advances in neural information processing systems, 28, 2015.
  9. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022.
  10. Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  11. Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021.
  12. Framing image description as a ranking task: Data, models and evaluation metrics. In M. Wooldridge and Q. Yang, editors, IJCAI 2015 - Proceedings of the 24th International Joint Conference on Artificial Intelligence, IJCAI International Joint Conference on Artificial Intelligence, pages 4188–4192. International Joint Conferences on Artificial Intelligence, 2015. 24th International Joint Conference on Artificial Intelligence, IJCAI 2015 ; Conference date: 25-07-2015 Through 31-07-2015.
  13. TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering. URL http://arxiv.org/abs/2303.11897.
  14. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation, 2023.
  15. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  16. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  17. Llms can’t plan, but can help planning in llm-modulo frameworks. arXiv preprint arXiv:2402.01817, 2024.
  18. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
  19. A. N. Kolmogorov. Sulla determinazione empirica di una legge didistribuzione. Giorn Dell’inst Ital Degli Att, 4:89–91, 1933.
  20. Viescore: Towards explainable metrics for conditional image synthesis evaluation. arXiv preprint arXiv:2312.14867, 2023a.
  21. Imagenhub: Standardizing the evaluation of conditional image generation models, 2023b.
  22. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
  23. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022a.
  24. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022b.
  25. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023.
  26. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  27. Improved baselines with visual instruction tuning, 2023a.
  28. Visual instruction tuning, 2023b.
  29. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pages 423–439. Springer, 2022.
  30. Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation. arXiv preprint arXiv:2305.11116, 2023.
  31. Semantic complexity in end-to-end spoken language understanding. arXiv preprint arXiv:2008.02858, 2020.
  32. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  33. Improving the numerical reasoning skills of pretrained language models. arXiv preprint arXiv:2205.06733, 2022.
  34. Human evaluation of text-to-image models on a multi-task benchmark. arXiv preprint arXiv:2211.12112, 2022.
  35. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, 2016.
  36. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
  37. Kolmogorov-smirnov two-sample tests. Concepts of nonparametric theory, pages 318–344, 1981.
  38. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  39. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  40. Hierarchical text-conditional image generation with clip latents, 2022.
  41. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
  42. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  43. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  44. M. Saxon and W. Y. Wang. Multilingual conceptual coverage in text-to-image models. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4831–4848, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.266. URL https://aclanthology.org/2023.acl-long.266.
  45. Peco: Examining single sentence label leakage in natural language inference datasets through progressive evaluation of cluster outliers. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3053–3066, 2023.
  46. Lost in translation? translation errors and challenges for fair assessment of text-to-image models on multilingual concepts. arXiv preprint arXiv:2403.11092, 2024.
  47. C. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15(1):72–101, 1904. ISSN 00029556. URL http://www.jstor.org/stable/1412159.
  48. Numeracy enhances the literacy of language models. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6960–6967, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.557. URL https://aclanthology.org/2021.emnlp-main.557.
  49. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  50. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498, 2022.
  51. Imagine that! abstract-to-intricate text-to-image synthesis with scene graph hallucination diffusion. Advances in Neural Information Processing Systems, 36, 2024.
  52. What you see is what you read? improving text-image alignment evaluation, 2023.
  53. mplug-owl: Modularization empowers large language models with multimodality, 2023.
  54. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  55. Improving text-to-image generation with object layout guidance. Multimedia Tools and Applications, 80(18):27423–27443, 2021.
  56. Alignscore: Evaluating factual consistency with a unified alignment function. arXiv preprint arXiv:2305.16739, 2023.
  57. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  58. Towards understanding sample variance in visually grounded language generation: Evaluations and observations. arXiv preprint arXiv:2010.03644, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Michael Saxon (27 papers)
  2. Fatima Jahara (1 paper)
  3. Mahsa Khoshnoodi (3 papers)
  4. Yujie Lu (42 papers)
  5. Aditya Sharma (32 papers)
  6. William Yang Wang (254 papers)
Citations (6)