Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation (2505.24787v1)

Published 30 May 2025 in cs.CV and cs.CL

Abstract: Recent advancements in text-to-image (T2I) generation have enabled models to produce high-quality images from textual descriptions. However, these models often struggle with complex instructions involving multiple objects, attributes, and spatial relationships. Existing benchmarks for evaluating T2I models primarily focus on general text-image alignment and fail to capture the nuanced requirements of complex, multi-faceted prompts. Given this gap, we introduce LongBench-T2I, a comprehensive benchmark specifically designed to evaluate T2I models under complex instructions. LongBench-T2I consists of 500 intricately designed prompts spanning nine diverse visual evaluation dimensions, enabling a thorough assessment of a model's ability to follow complex instructions. Beyond benchmarking, we propose an agent framework (Plan2Gen) that facilitates complex instruction-driven image generation without requiring additional model training. This framework integrates seamlessly with existing T2I models, using LLMs to interpret and decompose complex prompts, thereby guiding the generation process more effectively. As existing evaluation metrics, such as CLIPScore, fail to adequately capture the nuances of complex instructions, we introduce an evaluation toolkit that automates the quality assessment of generated images using a set of multi-dimensional metrics. The data and code are released at https://github.com/yczhou001/LongBench-T2I.

A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation

In the domain of text-to-image (T2I) generation, recent advancements have facilitated the creation of models capable of converting textual descriptions into high-quality images. However, as these models advance, they often encounter difficulties when tasked with processing complex instructions that involve multiple objects, detailed attributes, and intricate spatial relationships. The paper "Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation," addresses this challenge by introducing the LongBench-T2I benchmark and proposing a novel agent framework called Plan2Gen.

LongBench-T2I: Benchmark Overview

LongBench-T2I emerges as a comprehensive benchmark crafted to rigorously evaluate T2I models within the context of complex instructions. The benchmark consists of 500 prompts, each meticulously constructed to span nine distinct visual evaluation dimensions. This allows researchers to assess various aspects of a model's ability to adhere to and interpret detailed instructions.

Existing benchmarks, such as DrawBench, DPG-Bench, and T2I-CompBench, focus primarily on basic compositional capabilities, like object relation and attribute binding. These benchmarks, while useful, often lack the depth needed to fully evaluate a model's performance on multifaceted prompts. LongBench-T2I fills this gap by providing a standardized evaluation framework that captures more complex scene compositions and interactions, potentially catalyzing the development of more refined models.

Plan2Gen: Agent Framework

Plan2Gen is introduced as an innovative framework for generating images from complex instructions without requiring additional model training. By leveraging LLMs to interpret and decompose complex prompts, Plan2Gen directs the image generation process via a structured approach:

  1. Scene Decomposition: The framework commences by analyzing the complex instruction using an LLM, categorizing the scene into three primary components: background, midground, and foreground.
  2. Iterative Generation: Each layer is progressively generated and validated. If inconsistencies with the initial instructions are detected, the framework triggers an iterative refinement process. This process continues until the generated layer satisfactorily aligns with the sub-prompt or a predefined limit is met.

This novel approach not only enhances the alignment of generated images with textual prompts but also demonstrates superior performance when compared to existing generation methods in terms of compositional complexity and detail fidelity.

Experimental Insights

Through extensive experimental evaluations, Plan2Gen notably exceeds the performance of several leading proprietary and open-source models on the LongBench-T2I benchmark. The framework's capability of producing coherent and highly detailed scenes reflects its robustness, outperforming other models across multiple visual dimensions like background consistency, lighting accuracy, and composition fidelity.

Human evaluations further underscore Plan2Gen's efficacy, consistently rating its outputs favorably compared to other high-standard models like GPT-4o. This indicates that the framework's layered planning and iterative validation processes effectively address the intrinsic challenges of long-context image generation.

Implications and Future Work

The introduction of LongBench-T2I offers a substantial contribution to the field, enabling more nuanced evaluation of T2I models and encouraging innovations in complex instruction-following capabilities. The demonstration of Plan2Gen's effectiveness suggests promising paths for future research and development of T2I models that better adapt to intricate and user-specific demands.

Looking forward, the paper indicates several possible avenues for further exploration. These include refining the scene decomposition method, optimizing iterative validation processes, and investigating the potential for integrating additional multimodal cues to enhance the fidelity of generated scenes. As these developments unfold, they are poised to drive significant progress in AI's ability to process complex instructions with precision and creativity.

By addressing the limitations of existing evaluation metrics and introducing a structured generation approach, the paper lays an essential foundation for advancing T2I model capabilities, potentially leading AI toward more authentic interaction and response to complex human inputs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yucheng Zhou (37 papers)
  2. Jiahao Yuan (16 papers)
  3. Qianning Wang (7 papers)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com