Text-to-Image (T2I) synthesis technology has developed impressively, with models like Stable Diffusion, Midjourney, and DALL-E becoming increasingly popular in creative fields. Despite their advancements, these models still face significant challenges in their capacity to handle compositionality - the ability to generate novel combinations from known components based on complex textual prompts.
Addressing this, researchers have created a new benchmark named Winoground-T2I to assess T2I models' compositional understanding. This benchmark includes approximately 11,000 contrastive sentence pairs across a diverse range of 20 categories. These pairs were crafted to reflect subtle yet distinct differences, thus allowing precise evaluations. To ensure quality and applicability in realistic scenarios, meticulous criteria have been established to filter out unreasonable or visually incoherent sentence pairs.
Moreover, the paper tackles the issue of inconsistency seen in various T2I evaluation metrics. It introduces a methodical strategy for evaluating these metrics, utilizing comparative sentence pairs for a fine-grained assessment. This evaluation focusses on metrics' alignment with human preferences, intra-pair consistency, discriminability, stability, and efficiency.
The benchmark and a reliable metric were used to rigorously test current T2I models. The analysis highlights these models' strengths in accurately generating images with attributes such as color, material, and spatial relationships. However, it also identifies significant room for improvement in less common attributes and relationships which present substantial difficulties for these models.
The Winoground-T2I benchmark, along with the insights gained from its use, promises to steer future research towards enhancing the compositionality and overall performance of T2I synthesis models. The comprehensive analysis of benchmark results and selection strategy for reliable metrics provide an essential foundation for developing models with a more nuanced understanding and generation capability. The repository for Winoground-T2I has also been made publicly accessible, offering researchers an additional tool for advancing the field of T2I synthesis.