Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment (2310.11513v1)

Published 17 Oct 2023 in cs.CV and cs.LG

Abstract: Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a holistic measure of image quality or image-text alignment, and are unsuited for fine-grained or instance-level analysis. In this paper, we introduce GenEval, an object-focused framework to evaluate compositional image properties such as object co-occurrence, position, count, and color. We show that current object detection models can be leveraged to evaluate text-to-image models on a variety of generation tasks with strong human agreement, and that other discriminative vision models can be linked to this pipeline to further verify properties like object color. We then evaluate several open-source text-to-image models and analyze their relative generative capabilities on our benchmark. We find that recent models demonstrate significant improvement on these tasks, though they are still lacking in complex capabilities such as spatial relations and attribute binding. Finally, we demonstrate how GenEval might be used to help discover existing failure modes, in order to inform development of the next generation of text-to-image models. Our code to run the GenEval framework is publicly available at https://github.com/djghosh13/geneval.

An Assessment of GenEval: A Framework for Evaluating Text-to-Image Models

The presented paper introduces GenEval, a novel framework designed to evaluate the capabilities of text-to-image (T2I) models. Driven by the rapid development of diffusion models and multimodal pretraining, the paper seeks to address the deficiencies inherent in existing automated evaluation methods when dealing with the burgeoning number of T2I models. Unlike traditional metrics such as Frechet Inception Distance (FID) or CLIPScore, which primarily focus on image quality or image-text alignment, GenEval emphasizes a finer, object-focused evaluation approach that is capable of analyzing compositional and instance-level image properties.

The authors propose the framework due to the escalating impracticality of manual T2I model evaluation and the inadequacy of current automated evaluation methods for compositional analysis. GenEval uses an object detection model to verify the presence and properties of objects within a generated image. The framework's verification system involves evaluating multiple image attributes like object count, color, and relative positioning. This comprehensive approach yields more detailed insights into T2I model performance.

The research significantly leverages existing object detection and discriminative vision models to assess T2I model performance. In an admirable application, GenEval was utilized to assess various open-source T2I models, with findings indicating that while recent models exhibit substantial improvements in some areas, they still encounter difficulties with complex compositional tasks such as visual spatial relations and attribute binding.

Key performance insights from GenEval reveal that, while tasks of rendering single objects and color classification show high success rates, more sophisticated tasks—specifically spatial positioning and attribute binding—require substantial improvements in T2I models. For instance, the paper notes that an advanced model such as IF-XL correctly handles only 15% of spatial relationship tasks and 35% for attribute binding, underscoring significant room for further enhancement in these areas.

The paper also details human paper evaluations, wherein GenEval demonstrated an alignment rate of 83% with human annotations for image correctness. This suggests reliability in capturing image-text accuracy, superior to the existing CLIPScore metric on complex compositional tasks. Each task's performance was thoroughly benchmarked across popular T2I models, such as Stable Diffusion and DeepFloyd's IF models, emphasizing a consistent account of these models' current capabilities and limitations.

In conclusion, GenEval presents an automated, interpretable, and modular solution for evaluating T2I models, with applications extending into failure mode discovery that informs next-generation model development. The ongoing advancement in T2I models, and the distinct challenges in spatial reasoning and attribute binding highlighted by GenEval, suggest potential focal points for future research. The framework itself is a testament to the potential for utilizing discriminative models in innovating T2I model assessment, expanding both practical deployment and theoretical understanding within AI research. With GenEval's code publicly available, it establishes groundwork for expanded contributions to the growing field of AI-driven image generation, paving pathways toward enhancing T2I model comprehension and capability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. URL https://openai.com/dall-e-2.
  2. URL https://www.midjourney.com/.
  3. Romain Beaumont. Clip retrieval: Easily compute clip embeddings and build a clip retrieval system with them. https://github.com/rom1504/clip-retrieval, 2022.
  4. Basic color terms: Their universality and evolution. CLSI Publ., 2000.
  5. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  6. X-iqe: explainable image quality evaluation for text-to-image generation with visual large language models, 2023.
  7. Masked-attention mask transformer for universal image segmentation. 2022.
  8. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2818–2829, 2023.
  9. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. 2022.
  10. Dall·e mini, 7 2021. URL https://github.com/borisdayma/dalle-mini.
  11. Deep-Floyd. Deep-floyd/if. URL https://github.com/deep-floyd/IF.
  12. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
  13. Training-free structured diffusion guidance for compositional text-to-image synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PUIqjT4rzq7.
  14. Benchmarking spatial relationships in text-to-image generation, 2022.
  15. CLIPScore: a reference-free evaluation metric for image captioning. In EMNLP, 2021.
  16. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp.  6629–6640, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  17. Semantic object accuracy for generative text-to-image synthesis. IEEE transactions on pattern analysis and machine intelligence, 44(3):1552–1565, 2020.
  18. Inferring semantic layout for hierarchical text-to-image synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  7986–7994, 2018.
  19. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. arXiv preprint arXiv:2303.11897, 2023.
  20. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  21. Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023.
  22. Kuprel. Kuprel/min-dalle: Min(dall·e) is a fast, minimal port of dall·e mini to pytorch. URL https://github.com/kuprel/min-dalle.
  23. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. URL http://arxiv.org/abs/1405.0312.
  24. Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation, 2023.
  25. Generating images from captions with attention. In ICLR, 2016.
  26. Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230, 2022.
  27. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  28. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pp.  311–318, USA, 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://doi.org/10.3115/1073083.1073135.
  29. Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URL https://openreview.net/forum?id=bKBhQhPeKaF.
  30. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
  31. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  32. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  33. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  8821–8831. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/ramesh21a.html.
  34. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  35. Generative adversarial text to image synthesis. CoRR, abs/1605.05396, 2016. URL http://arxiv.org/abs/1605.05396.
  36. High-resolution image synthesis with latent diffusion models, 2021.
  37. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
  38. Improved techniques for training gans. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp.  2234–2242, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
  39. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=M3Y74vmsMcY.
  40. Cider: Consensus-based image description evaluation. CoRR, abs/1411.5726, 2014. URL http://arxiv.org/abs/1411.5726.
  41. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023.
  42. Imagereward: Learning and evaluating human preferences for text-to-image generation, 2023.
  43. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. CoRR, abs/1711.10485, 2017. URL http://arxiv.org/abs/1711.10485.
  44. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=AFDcYJKhND. Featured Certification.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Dhruba Ghosh (7 papers)
  2. Hanna Hajishirzi (4 papers)
  3. Ludwig Schmidt (80 papers)
Citations (44)
Github Logo Streamline Icon: https://streamlinehq.com