Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings (2404.16820v1)

Published 25 Apr 2024 in cs.CV

Abstract: While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings -- and thereby the prompt set used to compare models -- is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging. (2) We gather human ratings across four templates and four T2I models for a total of >100K annotations. This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality. (3) Finally, we introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160.

PDF Abstract

Comprehensive Evaluation of Text-to-Image Models with Enhanced Metrics and Benchmarking

Introduction

Text-to-Image (T2I) models have shown substantial progress in image generation from descriptive text. However, accurately evaluating these models' ability to align generated images with textual prompts remains challenging. This paper introduces Gecko2K, a new benchmark for rigorously testing T2I models across various skills and sub-skills, and proposes the Gecko metric, a novel auto-evaluation method that enhances assessment fidelity by prioritizing prompt coverage, filtering hallucinated content, and normalizing scores based on probabilistic outputs.

Benchmarking and Human Judgements

The paper contributes significantly to current methodologies by creating Gecko2K, which comprises two subsets: Gecko(R) and Gecko(S). Gecko(R) reshapes existing benchmark datasets to distribute skills more representatively, while Gecko(S) is an original creation that emphasizes a controlled representation of skills and sub-skills, built to expose model proficiency in fine-grained tasks.

Human judgements were gathered across four main annotation templates to compare model performances and metric efficacies. The paper highlights that all templates revealed SDXL as the superior model on Gecko(R) and Muse on Gecko(S), evidenced by statistically significant model ordering across templates in Gecko(S).

Automatic Evaluation Metrics

This work performs a thorough comparison of various T2I auto-evaluation metrics using the vast dataset of human annotations (over 100K). A novel metric, named Gecko, was introduced which outperformed the existing metrics across all test conditions in Gecko2K. This superiority is attributed to three main improvements:

Coverage enforcement ensures that each keyword in the prompt is addressed by at least one question, addressing common shortcomings in previous QA-based approaches.
NLI filtering removes questions likely generated from hallucinated contexts.
Improved scoring aggregates normalized VQA scores to better represent the nuance in model outputs.

The results are detailed with statistical acumen, showing that Gecko consistently correlates more strongly with human judgement than both traditional metrics like CLIP and advanced QA-based methods.

Discussion and Future Work

The paper provides an exhaustive analysis of model performances and evaluation metrics' effectiveness across a comprehensive and rigorously constructed dataset. The introduction of Gecko offers a significant advancement in automatic evaluation, with potential applications beyond just T2I alignment into other domains of AI that require nuanced interpretation of language and imagery.

Future developments could explore automated methods for further refining the selection and generation of prompts in Gecko(S), potentially integrating emerging insights from psycholinguistics and cognitive science to model human perception and interpretation more closely. Additionally, extending the Gecko metric to incorporate multimodal embeddings could offer even finer distinctions in model evaluations, particularly in handling more abstract or subjective prompt elements.

Through meticulous benchmark construction and novel metric formulation, this work advances our capacity to critically assess and compare the capabilities of text-to-image models, guiding future developments in the field.