ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty (2408.14339v1)

Published 26 Aug 2024 in cs.CV

Abstract: Compositionality is a critical capability in Text-to-Image (T2I) models, as it reflects their ability to understand and combine multiple concepts from text descriptions. Existing evaluations of compositional capability rely heavily on human-designed text prompts or fixed templates, limiting their diversity and complexity, and yielding low discriminative power. We propose ConceptMix, a scalable, controllable, and customizable benchmark which automatically evaluates compositional generation ability of T2I models. This is done in two stages. First, ConceptMix generates the text prompts: concretely, using categories of visual concepts (e.g., objects, colors, shapes, spatial relationships), it randomly samples an object and k-tuples of visual concepts, then uses GPT4-o to generate text prompts for image generation based on these sampled concepts. Second, ConceptMix evaluates the images generated in response to these prompts: concretely, it checks how many of the k concepts actually appeared in the image by generating one question per visual concept and using a strong VLM to answer them. Through administering ConceptMix to a diverse set of T2I models (proprietary as well as open ones) using increasing values of k, we show that our ConceptMix has higher discrimination power than earlier benchmarks. Specifically, ConceptMix reveals that the performance of several models, especially open models, drops dramatically with increased k. Importantly, it also provides insight into the lack of prompt diversity in widely-used training datasets. Additionally, we conduct extensive human studies to validate the design of ConceptMix and compare our automatic grading with human judgement. We hope it will guide future T2I model development.

PDF HTML Abstract

ConceptMix: A Comprehensive Benchmark for Compositional Image Generation

The paper "ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty" authored by Xindi Wu, Dingli Yu, Yangsibo Huang, Olga Russakovsky, and Sanjeev Arora addresses a fundamental challenge in the field of Text-to-Image (T2I) generation: the evaluation of models' compositional capabilities. This benchmark provides a scalable and customizable framework to assess how effectively T2I models can generate images that cohesively incorporate multiple visual concepts described in text prompts.

Main Contributions

The authors introduce ConceptMix, a benchmark that evaluates the compositional generation abilities of T2I models through a structured and scalable approach. ConceptMix's evaluation operates in two stages: prompt generation and image assessment.

Prompt Generation: Rather than relying on fixed templates, ConceptMix generates diverse and complex text prompts using GPT-4o. It randomly selects visual concept categories (e.g., objects, colors, shapes, spatial relationships) and creates prompts by combining one object with $k$ additional concepts.
Image Assessment: ConceptMix evaluates the generated images by checking how many visual concepts are correctly depicted. Each concept in the prompt is translated into a question answered by a robust Vision LLM (VLM), specifically GPT-4o, to determine if the image meets the criteria set by the prompt.

Evaluation Pipeline and Results

The evaluation pipeline involves:

Concept Sampling: Randomly sampling from eight categories of concepts.
Concept Binding: Creating a structured JSON representation of visual concepts, ensuring coherent binding.
Prompt Validation: Filtering out implausible prompts and maintaining ones that test T2I models' creativity and compositionality.
Concept Scoring: Generating questions for each concept to evaluate generated images rigorously.

By examining models with increasing $k$ , ConceptMix distinguishes itself from prior benchmarks through higher discrimination power regarding compositional generation abilities. Numerical results showcase significant performance drops for several models as $k$ increases, especially highlighting the disparity between proprietary models like DALL·E 3 and open models. DALL·E 3 consistently outperforms others, with a pronounced performance drop at $k=5$ , underscoring the challenge of generating complex compositions.

Analysis and Validation

The authors conducted extensive human studies approved by IRB to validate ConceptMix's design. The studies revealed high consistency between automated grading and human judgments, emphasizing ConceptMix's reliability. The human annotation analysis demonstrated variances, particularly in spatial reasoning or subjective task judgments like "style," but maintained robust alignment overall.

Implications and Future Directions

Theoretical and Practical Implications: ConceptMix provides a more granular understanding of T2I models' strengths and weaknesses. It encourages the development of more sophisticated models capable of handling complex visual descriptions. The benchmark's comprehensive approach to compositional evaluation fills a gap left by previous models focusing on limited concept categories.
Training Data Analysis: Investigations into models' training data, particularly LAION-5B, reveal a lack of complex concept combinations, contributing to the observed limitations in compositional capabilities. This insight underscores the need for diversified and complex training datasets to improve T2I models.
Future Developments: Further advancements in LLMs and VLMs can enhance benchmarks like ConceptMix, enabling even more precise evaluations. Additionally, methodologies emerging from ConceptMix can guide the creation of datasets with higher compositional complexity.

Conclusion

ConceptMix sets a new standard for evaluating compositionality in T2I models by providing a robust, diverse, and scalable benchmarking framework. Its innovative use of LLMs for prompt generation and VLMs for image assessment, validated by human judgments, positions it as a significant tool for advancing research in T2I generation. As T2I models evolve, benchmarks like ConceptMix will be crucial in driving their development towards achieving more complex and realistic image generation capabilities.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Xindi Wu (13 papers)
Dingli Yu (17 papers)
Yangsibo Huang (40 papers)
Olga Russakovsky (62 papers)
Sanjeev Arora (93 papers)

Citations (5)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/cindy_x_wu/status/1831682907032973418

https://twitter.com/gm8xx8/status/1828264537163714642