Introduction
Text-to-image (T2I) generation has experienced significant advancements through state-of-the-art models like Stable Diffusion and techniques involving generative adversarial networks and transformer architectures. Nevertheless, challenges remain in effectively composing objects with different attributes and relationships within complex scenes. Recognizing this limitation, the paper under discussion introduces T2I-CompBench, a substantial benchmark dedicated to open-world compositional T2I generation.
Benchmark and Evaluation Metrics
T2I-CompBench comprises 6,000 prompts, divided into three main categories: attribute binding, object relationships, and complex compositions, which are further broken down into color, shape, texture, spatial and non-spatial relationships. This work underscores the insufficiency of existing evaluation metrics. To address this gap, the authors present distinct metrics for each compositional category, such as disentangled BLIP-VQA for attribute binding and UniDet-based spatial relationship evaluation. Additionally, a unified 3-in-1 metric is introduced for complex prompts, combining the best performing metrics for each sub-category. These metrics have been empirically validated to align closely with human perception.
Generative Model Fine-tuning with GORS
The paper proposes Generative mOdel finetuning with Reward-driven Sample selection (GORS), a new method tailored to enhance compositional T2I generation. GORS applies a fine-tuning process where higher-quality generated images with compositional alignment are weighted more, refining the model's ability to handle complex scenes. Empirical results demonstrate the effectiveness of GORS, which not only outperforms existing methods but also displays better alignment with human judgments.
Discussion and Conclusion
This paper culminates in the proposal of T2I-CompBench as a benchmark and newly developed evaluation metrics that promise to better evaluate compositional imagery. The novel GORS method sets a new standard for improving compositional capabilities in T2I models, with quantitative and qualitative outcomes reinforcing its efficacy. Limitations of the paper include the lack of a unified evaluation metric across all types of compositions and the necessity to cautiously consider possible biases and negative impacts from model-generated content. The paper offers paths for future exploration, especially the development of a unified metric leveraging multimodal LLMs’ reasoning capabilities.