Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2) (2404.04251v3)
Abstract: With advances in the quality of text-to-image (T2I) models has come interest in benchmarking their prompt faithfulness -- the semantic coherence of generated images to the prompts they were conditioned on. A variety of T2I faithfulness metrics have been proposed, leveraging advances in cross-modal embeddings and vision-LLMs (VLMs). However, these metrics are not rigorously compared and benchmarked, instead presented with correlation to human Likert scores over a set of easy-to-discriminate images against seemingly weak baselines. We introduce T2IScoreScore, a curated set of semantic error graphs containing a prompt and a set of increasingly erroneous images. These allow us to rigorously judge whether a given prompt faithfulness metric can correctly order images with respect to their objective error count and significantly discriminate between different error nodes, using meta-metric scores derived from established statistical tests. Surprisingly, we find that the state-of-the-art VLM-based metrics (e.g., TIFA, DSG, LLMscore, VIEScore) we tested fail to significantly outperform simple (and supposedly worse) feature-based metrics like CLIPScore, particularly on a hard subset of naturally-occurring T2I model errors. TS2 will enable the development of better T2I prompt faithfulness metrics through more rigorous comparison of their conformity to expected orderings and separations under objective criteria.
- Introducing our multimodal models (fuyu-8b), 2023. URL https://www.adept.ai/blog/fuyu-8b.
- V. W. Berger and Y. Zhou. Kolmogorov–smirnov test: Overview. Wiley statsref: Statistics reference online, 2014.
- Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, pages 1493–1504. Association for Computing Machinery. ISBN 9798400701924. doi: 10.1145/3593013.3594095. URL https://dl.acm.org/doi/10.1145/3593013.3594095.
- Microsoft coco captions: Data collection and evaluation server, 2015.
- Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models, 2023.
- Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-to-image generation, 2024.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Deep generative image models using a laplacian pyramid of adversarial networks. Advances in neural information processing systems, 28, 2015.
- Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022.
- Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems, 36, 2024.
- Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021.
- Framing image description as a ranking task: Data, models and evaluation metrics. In M. Wooldridge and Q. Yang, editors, IJCAI 2015 - Proceedings of the 24th International Joint Conference on Artificial Intelligence, IJCAI International Joint Conference on Artificial Intelligence, pages 4188–4192. International Joint Conferences on Artificial Intelligence, 2015. 24th International Joint Conference on Artificial Intelligence, IJCAI 2015 ; Conference date: 25-07-2015 Through 31-07-2015.
- TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering. URL http://arxiv.org/abs/2303.11897.
- T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation, 2023.
- Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
- Llms can’t plan, but can help planning in llm-modulo frameworks. arXiv preprint arXiv:2402.01817, 2024.
- Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
- A. N. Kolmogorov. Sulla determinazione empirica di una legge didistribuzione. Giorn Dell’inst Ital Degli Att, 4:89–91, 1933.
- Viescore: Towards explainable metrics for conditional image synthesis evaluation. arXiv preprint arXiv:2312.14867, 2023a.
- Imagenhub: Standardizing the evaluation of conditional image generation models, 2023b.
- Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
- mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022b.
- Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning, 2023b.
- Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pages 423–439. Springer, 2022.
- Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation. arXiv preprint arXiv:2305.11116, 2023.
- Semantic complexity in end-to-end spoken language understanding. arXiv preprint arXiv:2008.02858, 2020.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- Improving the numerical reasoning skills of pretrained language models. arXiv preprint arXiv:2205.06733, 2022.
- Human evaluation of text-to-image models on a multi-task benchmark. arXiv preprint arXiv:2211.12112, 2022.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, 2016.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
- Kolmogorov-smirnov two-sample tests. Concepts of nonparametric theory, pages 318–344, 1981.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
- M. Saxon and W. Y. Wang. Multilingual conceptual coverage in text-to-image models. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4831–4848, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.266. URL https://aclanthology.org/2023.acl-long.266.
- Peco: Examining single sentence label leakage in natural language inference datasets through progressive evaluation of cluster outliers. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3053–3066, 2023.
- Lost in translation? translation errors and challenges for fair assessment of text-to-image models on multilingual concepts. arXiv preprint arXiv:2403.11092, 2024.
- C. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15(1):72–101, 1904. ISSN 00029556. URL http://www.jstor.org/stable/1412159.
- Numeracy enhances the literacy of language models. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6960–6967, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.557. URL https://aclanthology.org/2021.emnlp-main.557.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498, 2022.
- Imagine that! abstract-to-intricate text-to-image synthesis with scene graph hallucination diffusion. Advances in Neural Information Processing Systems, 36, 2024.
- What you see is what you read? improving text-image alignment evaluation, 2023.
- mplug-owl: Modularization empowers large language models with multimodality, 2023.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
- Improving text-to-image generation with object layout guidance. Multimedia Tools and Applications, 80(18):27423–27443, 2021.
- Alignscore: Evaluating factual consistency with a unified alignment function. arXiv preprint arXiv:2305.16739, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Towards understanding sample variance in visually grounded language generation: Evaluations and observations. arXiv preprint arXiv:2010.03644, 2020.
- Michael Saxon (27 papers)
- Fatima Jahara (1 paper)
- Mahsa Khoshnoodi (3 papers)
- Yujie Lu (42 papers)
- Aditya Sharma (32 papers)
- William Yang Wang (254 papers)