Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Contrastive Compositional Benchmark for Text-to-Image Synthesis: A Study with Unified Text-to-Image Fidelity Metrics (2312.02338v2)

Published 4 Dec 2023 in cs.CV, cs.AI, and cs.MM

Abstract: Text-to-image (T2I) synthesis has recently achieved significant advancements. However, challenges remain in the model's compositionality, which is the ability to create new combinations from known components. We introduce Winoground-T2I, a benchmark designed to evaluate the compositionality of T2I models. This benchmark includes 11K complex, high-quality contrastive sentence pairs spanning 20 categories. These contrastive sentence pairs with subtle differences enable fine-grained evaluations of T2I synthesis models. Additionally, to address the inconsistency across different metrics, we propose a strategy that evaluates the reliability of various metrics by using comparative sentence pairs. We use Winoground-T2I with a dual objective: to evaluate the performance of T2I models and the metrics used for their evaluation. Finally, we provide insights into the strengths and weaknesses of these metrics and the capabilities of current T2I models in tackling challenges across a range of complex compositional categories. Our benchmark is publicly available at https://github.com/zhuxiangru/Winoground-T2I .

Text-to-Image (T2I) synthesis technology has developed impressively, with models like Stable Diffusion, Midjourney, and DALL-E becoming increasingly popular in creative fields. Despite their advancements, these models still face significant challenges in their capacity to handle compositionality - the ability to generate novel combinations from known components based on complex textual prompts.

Addressing this, researchers have created a new benchmark named Winoground-T2I to assess T2I models' compositional understanding. This benchmark includes approximately 11,000 contrastive sentence pairs across a diverse range of 20 categories. These pairs were crafted to reflect subtle yet distinct differences, thus allowing precise evaluations. To ensure quality and applicability in realistic scenarios, meticulous criteria have been established to filter out unreasonable or visually incoherent sentence pairs.

Moreover, the paper tackles the issue of inconsistency seen in various T2I evaluation metrics. It introduces a methodical strategy for evaluating these metrics, utilizing comparative sentence pairs for a fine-grained assessment. This evaluation focusses on metrics' alignment with human preferences, intra-pair consistency, discriminability, stability, and efficiency.

The benchmark and a reliable metric were used to rigorously test current T2I models. The analysis highlights these models' strengths in accurately generating images with attributes such as color, material, and spatial relationships. However, it also identifies significant room for improvement in less common attributes and relationships which present substantial difficulties for these models.

The Winoground-T2I benchmark, along with the insights gained from its use, promises to steer future research towards enhancing the compositionality and overall performance of T2I synthesis models. The comprehensive analysis of benchmark results and selection strategy for reliable metrics provide an essential foundation for developing models with a more nuanced understanding and generation capability. The repository for Winoground-T2I has also been made publicly accessible, offering researchers an additional tool for advancing the field of T2I synthesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models. CoRR, abs/2304.05390, 2023.
  2. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, pages 65–72. Association for Computational Linguistics, 2005.
  3. A note on the inception score. CoRR, abs/1801.01973, 2018.
  4. Improving image generation with better captions. 2023.
  5. Going beyond nouns with vision & language models using synthetic data. CoRR, abs/2303.17590, 2023.
  6. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph., 42(4):148:1–148:10, 2023.
  7. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. CoRR, abs/2202.04053, 2022.
  8. Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-to-image generation. CoRR, abs/2310.18235, 2023a.
  9. Visual programming for text-to-image generation and evaluation. CoRR, abs/2305.15328, 2023b.
  10. Noam Chomsky. Aspects of the Theory of Syntax. Number 11. MIT press, 2014.
  11. Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500, 2023.
  12. Chain-of-verification reduces hallucination in large language models. CoRR, abs/2309.11495, 2023.
  13. Why is winoground hard? investigating failures in visuolinguistic compositionality. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 2236–2250. Association for Computational Linguistics, 2022.
  14. Training-free structured diffusion guidance for compositional text-to-image synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  15. Benchmarking spatial relationships in text-to-image generation. CoRR, abs/2212.10015, 2022.
  16. Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 7514–7528. Association for Computational Linguistics, 2021.
  17. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6626–6637, 2017.
  18. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. CoRR, abs/2306.14610, 2023.
  19. TIFA: accurate and interpretable text-to-image faithfulness evaluation with question answering. CoRR, abs/2303.11897, 2023.
  20. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. CoRR, abs/2307.06350, 2023.
  21. Measuring compositional generalization: A comprehensive method on realistic data. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  22. Pick-a-pic: An open dataset of user preferences for text-to-image generation. CoRR, abs/2305.01569, 2023.
  23. Holistic evaluation of text-to-image models. CoRR, abs/2311.04287, 2023.
  24. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, pages 12888–12900. PMLR, 2022.
  25. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pages 19730–19742. PMLR, 2023.
  26. Microsoft COCO: common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pages 740–755. Springer, 2014.
  27. Deductive verification of chain-of-thought reasoning. CoRR, abs/2306.03872, 2023.
  28. Compositional visual generation with composable diffusion models. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XVII, pages 423–439. Springer, 2022.
  29. Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation. CoRR, abs/2305.11116, 2023.
  30. @ CREPE: can vision-language foundation models reason compositionally? In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 10910–10921. IEEE, 2023.
  31. A very preliminary analysis of DALL-E 2. CoRR, abs/2204.13807, 2022.
  32. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pages 311–318. ACL, 2002.
  33. Benchmark for compositional text-to-image synthesis. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021.
  34. SDXL: improving latent diffusion models for high-resolution image synthesis. CoRR, abs/2307.01952, 2023.
  35. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8748–8763. PMLR, 2021.
  36. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
  37. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE, 2022.
  38. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  39. Winoground: Probing vision and language models for visio-linguistic compositionality. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 5228–5238. IEEE, 2022.
  40. Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4566–4575. IEEE Computer Society, 2015.
  41. Compositional text-to-image synthesis with attention map control of diffusion models. CoRR, abs/2305.13921, 2023.
  42. Imagereward: Learning and evaluating human preferences for text-to-image generation. CoRR, abs/2304.05977, 2023.
  43. Scaling autoregressive models for content-rich text-to-image generation. Trans. Mach. Learn. Res., 2022, 2022.
  44. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xiangru Zhu (4 papers)
  2. Penglei Sun (9 papers)
  3. Chengyu Wang (93 papers)
  4. Jingping Liu (18 papers)
  5. Zhixu Li (43 papers)
  6. Yanghua Xiao (151 papers)
  7. Jun Huang (126 papers)
Citations (2)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets