Comprehensive Evaluation of Text-to-Image Models with Enhanced Metrics and Benchmarking
Introduction
Text-to-Image (T2I) models have shown substantial progress in image generation from descriptive text. However, accurately evaluating these models' ability to align generated images with textual prompts remains challenging. This paper introduces Gecko2K, a new benchmark for rigorously testing T2I models across various skills and sub-skills, and proposes the Gecko metric, a novel auto-evaluation method that enhances assessment fidelity by prioritizing prompt coverage, filtering hallucinated content, and normalizing scores based on probabilistic outputs.
Benchmarking and Human Judgements
The paper contributes significantly to current methodologies by creating Gecko2K, which comprises two subsets: Gecko(R) and Gecko(S). Gecko(R) reshapes existing benchmark datasets to distribute skills more representatively, while Gecko(S) is an original creation that emphasizes a controlled representation of skills and sub-skills, built to expose model proficiency in fine-grained tasks.
Human judgements were gathered across four main annotation templates to compare model performances and metric efficacies. The paper highlights that all templates revealed SDXL as the superior model on Gecko(R) and Muse on Gecko(S), evidenced by statistically significant model ordering across templates in Gecko(S).
Automatic Evaluation Metrics
This work performs a thorough comparison of various T2I auto-evaluation metrics using the vast dataset of human annotations (over 100K). A novel metric, named Gecko, was introduced which outperformed the existing metrics across all test conditions in Gecko2K. This superiority is attributed to three main improvements:
- Coverage enforcement ensures that each keyword in the prompt is addressed by at least one question, addressing common shortcomings in previous QA-based approaches.
- NLI filtering removes questions likely generated from hallucinated contexts.
- Improved scoring aggregates normalized VQA scores to better represent the nuance in model outputs.
The results are detailed with statistical acumen, showing that Gecko consistently correlates more strongly with human judgement than both traditional metrics like CLIP and advanced QA-based methods.
Discussion and Future Work
The paper provides an exhaustive analysis of model performances and evaluation metrics' effectiveness across a comprehensive and rigorously constructed dataset. The introduction of Gecko offers a significant advancement in automatic evaluation, with potential applications beyond just T2I alignment into other domains of AI that require nuanced interpretation of language and imagery.
Future developments could explore automated methods for further refining the selection and generation of prompts in Gecko(S), potentially integrating emerging insights from psycholinguistics and cognitive science to model human perception and interpretation more closely. Additionally, extending the Gecko metric to incorporate multimodal embeddings could offer even finer distinctions in model evaluations, particularly in handling more abstract or subjective prompt elements.
Through meticulous benchmark construction and novel metric formulation, this work advances our capacity to critically assess and compare the capabilities of text-to-image models, guiding future developments in the field.