A Contrastive Compositional Benchmark for Text-to-Image Synthesis: A Study with Unified Text-to-Image Fidelity Metrics (2312.02338v2)

Published 4 Dec 2023 in cs.CV, cs.AI, and cs.MM

Abstract: Text-to-image (T2I) synthesis has recently achieved significant advancements. However, challenges remain in the model's compositionality, which is the ability to create new combinations from known components. We introduce Winoground-T2I, a benchmark designed to evaluate the compositionality of T2I models. This benchmark includes 11K complex, high-quality contrastive sentence pairs spanning 20 categories. These contrastive sentence pairs with subtle differences enable fine-grained evaluations of T2I synthesis models. Additionally, to address the inconsistency across different metrics, we propose a strategy that evaluates the reliability of various metrics by using comparative sentence pairs. We use Winoground-T2I with a dual objective: to evaluate the performance of T2I models and the metrics used for their evaluation. Finally, we provide insights into the strengths and weaknesses of these metrics and the capabilities of current T2I models in tackling challenges across a range of complex compositional categories. Our benchmark is publicly available at https://github.com/zhuxiangru/Winoground-T2I .

PDF HTML Abstract

Text-to-Image (T2I) synthesis technology has developed impressively, with models like Stable Diffusion, Midjourney, and DALL-E becoming increasingly popular in creative fields. Despite their advancements, these models still face significant challenges in their capacity to handle compositionality - the ability to generate novel combinations from known components based on complex textual prompts.

Addressing this, researchers have created a new benchmark named Winoground-T2I to assess T2I models' compositional understanding. This benchmark includes approximately 11,000 contrastive sentence pairs across a diverse range of 20 categories. These pairs were crafted to reflect subtle yet distinct differences, thus allowing precise evaluations. To ensure quality and applicability in realistic scenarios, meticulous criteria have been established to filter out unreasonable or visually incoherent sentence pairs.

Moreover, the paper tackles the issue of inconsistency seen in various T2I evaluation metrics. It introduces a methodical strategy for evaluating these metrics, utilizing comparative sentence pairs for a fine-grained assessment. This evaluation focusses on metrics' alignment with human preferences, intra-pair consistency, discriminability, stability, and efficiency.

The benchmark and a reliable metric were used to rigorously test current T2I models. The analysis highlights these models' strengths in accurately generating images with attributes such as color, material, and spatial relationships. However, it also identifies significant room for improvement in less common attributes and relationships which present substantial difficulties for these models.

The Winoground-T2I benchmark, along with the insights gained from its use, promises to steer future research towards enhancing the compositionality and overall performance of T2I synthesis models. The comprehensive analysis of benchmark results and selection strategy for reliable metrics provide an essential foundation for developing models with a more nuanced understanding and generation capability. The repository for Winoground-T2I has also been made publicly accessible, offering researchers an additional tool for advancing the field of T2I synthesis.

PDF Markdown Bookmark Chat (Pro)

References (44)

Authors (7)

Xiangru Zhu (4 papers)
Penglei Sun (9 papers)
Chengyu Wang (93 papers)
Jingping Liu (18 papers)
Zhixu Li (43 papers)
Yanghua Xiao (151 papers)
Jun Huang (126 papers)

Citations (2)

View on Semantic Scholar

GitHub

GitHub - zhuxiangru/Winoground-T2I (7 stars)

A Contrastive Compositional Benchmark for Text-to-Image Synthesis: A Study with Unified Text-to-Image Fidelity Metrics (2312.02338v2)

Related Papers

GitHub

Tweets