Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 227 tok/s Pro
2000 character limit reached

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play? (2509.03516v1)

Published 3 Sep 2025 in cs.CV

Abstract: Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, thereby corresponding to two core capabilities: composition and reasoning. However, with the emerging advances of T2I models in reasoning beyond composition, existing benchmarks reveal clear limitations in providing comprehensive evaluations across and within these capabilities. Meanwhile, these advances also enable models to handle more complex prompts, whereas current benchmarks remain limited to low scene density and simplified one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent complexities of real-world scenarios, we curate each prompt with high compositional density for composition and multi-step inference for reasoning. We also pair each prompt with a checklist that specifies individual yes/no questions to assess each intended element independently to facilitate fine-grained and reliable evaluation. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 27 current T2I models reveal that their composition capability still remains limited in complex high-density scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts. Our project page: https://t2i-corebench.github.io/.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces T2I-CoReBench, a benchmark that systematically evaluates both compositional and reasoning capabilities of text-to-image models using a 12-dimensional taxonomy.
  • Experimental results show that while compositional performance is advancing, a substantial performance gap remains in multi-step and abductive reasoning tasks.
  • The study highlights a trade-off between explicit intermediate reasoning and compositional accuracy, emphasizing the need for robust, checklist-based evaluation protocols.

Comprehensive Evaluation of Text-to-Image Models: T2I-CoReBench and the Limits of Reasoning

Introduction

The paper "Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?" (2509.03516) presents T2I-CoReBench, a benchmark designed to systematically evaluate both compositional and reasoning capabilities of text-to-image (T2I) generative models. The work addresses the lack of comprehensive and complex evaluation protocols in existing benchmarks, particularly in scenarios requiring high compositional density and multi-step reasoning. The authors introduce a 12-dimensional taxonomy, covering explicit scene composition and a broad spectrum of reasoning types, and provide a large-scale, fine-grained evaluation protocol using checklist-based questions and automated MLLM-based assessment.

Benchmark Design: Taxonomy and Complexity

T2I-CoReBench is constructed to address two core limitations in prior benchmarks: insufficient coverage of both composition and reasoning, and inadequate complexity in prompt design. The benchmark is structured around a 12-dimensional taxonomy, split between composition (multi-instance, multi-attribute, multi-relation, text rendering) and reasoning (deductive, inductive, abductive, each with multiple subtypes). Figure 1

Figure 2: The T2I-CoReBench taxonomy spans composition and reasoning, with high prompt complexity and checklist granularity.

Composition dimensions are grounded in scene graph theory, requiring models to generate images with numerous explicit elements, attributes, and relations, as well as complex text rendering. Reasoning dimensions are based on philosophical frameworks, including deductive (logical, behavioral, hypothetical, procedural), inductive (generalization, analogical), and abductive (commonsense, reconstructive) reasoning. Each prompt is paired with a checklist of atomic yes/no questions, enabling fine-grained, interpretable evaluation.

To ensure complexity, prompts are constructed with high scene density (e.g., ~20 instances per prompt), multi-step inference, and one-to-many or many-to-one causal chains. Data generation leverages multiple SOTA large reasoning models (LRMs) for diversity, followed by rigorous human verification. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: T2I-CoReBench provides challenging examples across all 12 dimensions, including dense composition and multi-step reasoning.

Evaluation Protocol and Automation

The evaluation protocol is centered on checklist-based verification, where each generated image is assessed against a set of objective, atomic questions. This approach overcomes the limitations of CLIPScore and direct MLLM-based scoring, which are unreliable for complex, multi-element scenes and implicit reasoning. The authors employ Gemini 2.5 Flash as the primary MLLM evaluator, selected for its strong human alignment and cost efficiency, and validate results with open-source MLLMs for reproducibility.

The protocol enforces strict visual evidence requirements: a "yes" is only assigned if the queried element is unambiguously present in the image, independent of the prompt. This design ensures that evaluation is robust to hallucinations and prompt-image mismatches.

Experimental Results and Analysis

The benchmark is used to evaluate 27 T2I models (21 open-source, 6 closed-source) spanning diffusion, autoregressive, and unified architectures. The results reveal several key findings:

  • Composition is improving but unsolved: Closed-source models (e.g., Imagen 4 Ultra) achieve up to 82.4 in composition, with open-source models (e.g., Qwen-Image) approaching this performance. However, all models struggle with high-density, multi-attribute, and multi-relation prompts, indicating that compositional control remains an open challenge.
  • Reasoning is the primary bottleneck: Even the best models exhibit a significant performance gap between composition and reasoning (e.g., Imagen 4 Ultra: 82.4 vs. 72.9; Qwen-Image: 78.0 vs. 49.3). Reasoning tasks involving implicit inference, multi-step deduction, and abductive reconstruction are particularly challenging.
  • Explicit intermediate reasoning (thinking) yields mixed results: Incorporating intermediate reasoning (e.g., BAGEL w/ Think) improves reasoning scores but can degrade composition, suggesting a trade-off between explicit and implicit information extraction.
  • MLLM-based condition encoding is advantageous: Models leveraging large multimodal encoders (e.g., Qwen-Image) outperform others, highlighting the importance of strong language-vision alignment for both composition and reasoning. Figure 2

    Figure 1: Example outputs from GPT-Image, illustrating the gap between compositional fidelity and reasoning accuracy.

A human alignment paper demonstrates that closed-source MLLMs (e.g., Gemini 2.5 Pro, OpenAI o3) outperform open-source models in checklist evaluation, but the best open-source MLLMs (e.g., Qwen2.5-VL-72B) provide reliable, reproducible results.

Implications and Future Directions

The findings have several implications for the development and evaluation of T2I models:

  • Benchmarking must address both explicit and implicit generation: Faithful image synthesis requires not only compositional accuracy but also the ability to infer and render implicit, contextually appropriate elements.
  • Reasoning remains a critical research frontier: Current architectures, even with LLM/MLLM integration, are insufficient for robust multi-step and abductive reasoning. Progress will require new training data, architectures, and possibly explicit reasoning modules.
  • Automated, fine-grained evaluation is essential: Checklist-based protocols, combined with strong MLLM evaluators, provide scalable, interpretable, and reproducible assessment, enabling rapid iteration and fair comparison across models.
  • Integration of LLM-style reasoning paradigms: Techniques such as chain-of-thought, self-consistency, and retrieval-augmented generation should be explored within T2I pipelines to improve implicit inference and compositional control.

Conclusion

T2I-CoReBench establishes a new standard for comprehensive, complex evaluation of T2I models, revealing that while compositional capabilities are advancing, reasoning remains a significant bottleneck. The benchmark's design and findings underscore the need for future research on reasoning-aware architectures, richer training data, and advanced evaluation protocols. Addressing these challenges is essential for T2I models to move beyond "setting the stage" to "directing the play" in real-world generative tasks.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com