GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment (2310.11513v1)

Published 17 Oct 2023 in cs.CV and cs.LG

Abstract: Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a holistic measure of image quality or image-text alignment, and are unsuited for fine-grained or instance-level analysis. In this paper, we introduce GenEval, an object-focused framework to evaluate compositional image properties such as object co-occurrence, position, count, and color. We show that current object detection models can be leveraged to evaluate text-to-image models on a variety of generation tasks with strong human agreement, and that other discriminative vision models can be linked to this pipeline to further verify properties like object color. We then evaluate several open-source text-to-image models and analyze their relative generative capabilities on our benchmark. We find that recent models demonstrate significant improvement on these tasks, though they are still lacking in complex capabilities such as spatial relations and attribute binding. Finally, we demonstrate how GenEval might be used to help discover existing failure modes, in order to inform development of the next generation of text-to-image models. Our code to run the GenEval framework is publicly available at https://github.com/djghosh13/geneval.

PDF HTML Abstract

An Assessment of GenEval: A Framework for Evaluating Text-to-Image Models

The presented paper introduces GenEval, a novel framework designed to evaluate the capabilities of text-to-image (T2I) models. Driven by the rapid development of diffusion models and multimodal pretraining, the paper seeks to address the deficiencies inherent in existing automated evaluation methods when dealing with the burgeoning number of T2I models. Unlike traditional metrics such as Frechet Inception Distance (FID) or CLIPScore, which primarily focus on image quality or image-text alignment, GenEval emphasizes a finer, object-focused evaluation approach that is capable of analyzing compositional and instance-level image properties.

The authors propose the framework due to the escalating impracticality of manual T2I model evaluation and the inadequacy of current automated evaluation methods for compositional analysis. GenEval uses an object detection model to verify the presence and properties of objects within a generated image. The framework's verification system involves evaluating multiple image attributes like object count, color, and relative positioning. This comprehensive approach yields more detailed insights into T2I model performance.

The research significantly leverages existing object detection and discriminative vision models to assess T2I model performance. In an admirable application, GenEval was utilized to assess various open-source T2I models, with findings indicating that while recent models exhibit substantial improvements in some areas, they still encounter difficulties with complex compositional tasks such as visual spatial relations and attribute binding.

Key performance insights from GenEval reveal that, while tasks of rendering single objects and color classification show high success rates, more sophisticated tasks—specifically spatial positioning and attribute binding—require substantial improvements in T2I models. For instance, the paper notes that an advanced model such as IF-XL correctly handles only 15% of spatial relationship tasks and 35% for attribute binding, underscoring significant room for further enhancement in these areas.

The paper also details human paper evaluations, wherein GenEval demonstrated an alignment rate of 83% with human annotations for image correctness. This suggests reliability in capturing image-text accuracy, superior to the existing CLIPScore metric on complex compositional tasks. Each task's performance was thoroughly benchmarked across popular T2I models, such as Stable Diffusion and DeepFloyd's IF models, emphasizing a consistent account of these models' current capabilities and limitations.

In conclusion, GenEval presents an automated, interpretable, and modular solution for evaluating T2I models, with applications extending into failure mode discovery that informs next-generation model development. The ongoing advancement in T2I models, and the distinct challenges in spatial reasoning and attribute binding highlighted by GenEval, suggest potential focal points for future research. The framework itself is a testament to the potential for utilizing discriminative models in innovating T2I model assessment, expanding both practical deployment and theoretical understanding within AI research. With GenEval's code publicly available, it establishes groundwork for expanded contributions to the growing field of AI-driven image generation, paving pathways toward enhancing T2I model comprehension and capability.