Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation (2401.15688v2)

Published 28 Jan 2024 in cs.CV

Abstract: Despite significant advancements in text-to-image models for generating high-quality images, these methods still struggle to ensure the controllability of text prompts over images in the context of complex text prompts, especially when it comes to retaining object attributes and relationships. In this paper, we propose CompAgent, a training-free approach for compositional text-to-image generation, with a LLM agent as its core. The fundamental idea underlying CompAgent is premised on a divide-and-conquer methodology. Given a complex text prompt containing multiple concepts including objects, attributes, and relationships, the LLM agent initially decomposes it, which entails the extraction of individual objects, their associated attributes, and the prediction of a coherent scene layout. These individual objects can then be independently conquered. Subsequently, the agent performs reasoning by analyzing the text, plans and employs the tools to compose these isolated objects. The verification and human feedback mechanism is finally incorporated into our agent to further correct the potential attribute errors and refine the generated images. Guided by the LLM agent, we propose a tuning-free multi-concept customization model and a layout-to-image generation model as the tools for concept composition, and a local image editing method as the tool to interact with the agent for verification. The scene layout controls the image generation process among these tools to prevent confusion among multiple objects. Extensive experiments demonstrate the superiority of our approach for compositional text-to-image generation: CompAgent achieves more than 10\% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation. The extension to various related tasks also illustrates the flexibility of our CompAgent for potential applications.

PDF Abstract

Introduction to CompAgent

In the pursuit of enhancing compositional text-to-image generation, researchers from Tsinghua University and Huawei's Noah’s Ark Lab present CompAgent, a novel approach meticulously designed to address the intricate challenge of generating images from complex text prompts. The concept of compositional text-to-image synthesis is complex, often involving the need to depict multiple objects with varying attributes and relationships within a single scene. Current models struggle to retain object attributes and relationships accurately and coherently. To navigate this complexity, CompAgent employs a LLM as an orchestrating agent which adopts a divide-and-conquer strategy to methodically break down and process the multi-faceted elements of textual description.

CompAgent Framework and Mechanisms

The core mechanism of CompAgent is instantiated through a comprehensive sequence of phases governed by an LLM agent. The initial phase involves decomposing the text prompt into isolated objects and their attributes, simultaneously predicting a coherent scene layout. Proceeding to the planning and tool use phase, the LLM agent deliberates on the text, devising a strategic approach to image generation that takes into account object attributes and interrelations, thereafter employing a suite of specialized tools for image synthesis. The final checkpoint in the process is the verification and feedback mechanism, where the generated images are audited for attribute veracity and refined with potential attribute corrections, sometimes integrating human feedback to enhance quality.

Tools and Models for Image Composition

CompAgent introduces a tool triad for the composition of multiple objects into a seamless image, guided by the specified scene layout. The first is a tuning-free multi-concept customization model which ensures attribute fidelity through spatial layout constraints. The second, a layout-to-image generation model, facilitates the representation of object relationships by generating images conditioned on the bounding box layout. Lastly, the local image editing method attends to correct attribute errors identified during verification. Additionally, the toolkit encompasses state-of-the-art text-to-image generation models and multi-modal models for handling simpler text prompts and assessing attribute correctness.

Empirical Evaluation of CompAgent

The superiority of the CompAgent methodology is empirically validated on T2I-CompBench, an open-world benchmark for compositional text-to-image generation, where it achieves a remarkable improvement exceeding 10% over existing approaches. The versatility of CompAgent is further underscored in its adaptability to a variety of related tasks such as image editing and object placement, demonstrating its promise as a robust solution for generating high-fidelity images from complex descriptive texts.

Conclusion and Contributions

CompAgent represents a significant advancement in the domain of text-to-image generation, championing the need for controlled, detailed, and relationship-aware image synthesis. It underscores the potent combination of an LLM agent’s planning and reasoning capabilities with a diverse toolkit capable of addressing a spectrum of compositional scenarios. The comprehensive nature of CompAgent, from planning through execution to verification, encapsulates its main contributions to the field, offering a new paradigm in image generation that is poised to facilitate more intuitive and controlled visual content creation from textual descriptions.