T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation (2307.06350v2)

Published 12 Jul 2023 in cs.CV

Abstract: Despite the stunning ability to generate high-quality images by recent text-to-image models, current approaches often struggle to effectively compose objects with different attributes and relationships into a complex and coherent scene. We propose T2I-CompBench, a comprehensive benchmark for open-world compositional text-to-image generation, consisting of 6,000 compositional text prompts from 3 categories (attribute binding, object relationships, and complex compositions) and 6 sub-categories (color binding, shape binding, texture binding, spatial relationships, non-spatial relationships, and complex compositions). We further propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation and explore the potential and limitations of multimodal LLMs for evaluation. We introduce a new approach, Generative mOdel fine-tuning with Reward-driven Sample selection (GORS), to boost the compositional text-to-image generation abilities of pretrained text-to-image models. Extensive experiments and evaluations are conducted to benchmark previous methods on T2I-CompBench, and to validate the effectiveness of our proposed evaluation metrics and GORS approach. Project page is available at https://karine-h.github.io/T2I-CompBench/.

PDF Abstract

Introduction

Text-to-image (T2I) generation has experienced significant advancements through state-of-the-art models like Stable Diffusion and techniques involving generative adversarial networks and transformer architectures. Nevertheless, challenges remain in effectively composing objects with different attributes and relationships within complex scenes. Recognizing this limitation, the paper under discussion introduces T2I-CompBench, a substantial benchmark dedicated to open-world compositional T2I generation.

Benchmark and Evaluation Metrics

T2I-CompBench comprises 6,000 prompts, divided into three main categories: attribute binding, object relationships, and complex compositions, which are further broken down into color, shape, texture, spatial and non-spatial relationships. This work underscores the insufficiency of existing evaluation metrics. To address this gap, the authors present distinct metrics for each compositional category, such as disentangled BLIP-VQA for attribute binding and UniDet-based spatial relationship evaluation. Additionally, a unified 3-in-1 metric is introduced for complex prompts, combining the best performing metrics for each sub-category. These metrics have been empirically validated to align closely with human perception.

Generative Model Fine-tuning with GORS

The paper proposes Generative mOdel finetuning with Reward-driven Sample selection (GORS), a new method tailored to enhance compositional T2I generation. GORS applies a fine-tuning process where higher-quality generated images with compositional alignment are weighted more, refining the model's ability to handle complex scenes. Empirical results demonstrate the effectiveness of GORS, which not only outperforms existing methods but also displays better alignment with human judgments.

Discussion and Conclusion

This paper culminates in the proposal of T2I-CompBench as a benchmark and newly developed evaluation metrics that promise to better evaluate compositional imagery. The novel GORS method sets a new standard for improving compositional capabilities in T2I models, with quantitative and qualitative outcomes reinforcing its efficacy. Limitations of the paper include the lack of a unified evaluation metric across all types of compositions and the necessity to cautiously consider possible biases and negative impacts from model-generated content. The paper offers paths for future exploration, especially the development of a unified metric leveraging multimodal LLMs’ reasoning capabilities.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Kaiyi Huang (8 papers)
Kaiyue Sun (3 papers)
Enze Xie (84 papers)
Zhenguo Li (195 papers)
Xihui Liu (92 papers)

Citations (140)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

T2I-CompBench

Tweets

https://twitter.com/ZhaiAndrew/status/1748079461088772445