Introduction to CompAgent
In the pursuit of enhancing compositional text-to-image generation, researchers from Tsinghua University and Huawei's Noah’s Ark Lab present CompAgent, a novel approach meticulously designed to address the intricate challenge of generating images from complex text prompts. The concept of compositional text-to-image synthesis is complex, often involving the need to depict multiple objects with varying attributes and relationships within a single scene. Current models struggle to retain object attributes and relationships accurately and coherently. To navigate this complexity, CompAgent employs a LLM as an orchestrating agent which adopts a divide-and-conquer strategy to methodically break down and process the multi-faceted elements of textual description.
CompAgent Framework and Mechanisms
The core mechanism of CompAgent is instantiated through a comprehensive sequence of phases governed by an LLM agent. The initial phase involves decomposing the text prompt into isolated objects and their attributes, simultaneously predicting a coherent scene layout. Proceeding to the planning and tool use phase, the LLM agent deliberates on the text, devising a strategic approach to image generation that takes into account object attributes and interrelations, thereafter employing a suite of specialized tools for image synthesis. The final checkpoint in the process is the verification and feedback mechanism, where the generated images are audited for attribute veracity and refined with potential attribute corrections, sometimes integrating human feedback to enhance quality.
Tools and Models for Image Composition
CompAgent introduces a tool triad for the composition of multiple objects into a seamless image, guided by the specified scene layout. The first is a tuning-free multi-concept customization model which ensures attribute fidelity through spatial layout constraints. The second, a layout-to-image generation model, facilitates the representation of object relationships by generating images conditioned on the bounding box layout. Lastly, the local image editing method attends to correct attribute errors identified during verification. Additionally, the toolkit encompasses state-of-the-art text-to-image generation models and multi-modal models for handling simpler text prompts and assessing attribute correctness.
Empirical Evaluation of CompAgent
The superiority of the CompAgent methodology is empirically validated on T2I-CompBench, an open-world benchmark for compositional text-to-image generation, where it achieves a remarkable improvement exceeding 10% over existing approaches. The versatility of CompAgent is further underscored in its adaptability to a variety of related tasks such as image editing and object placement, demonstrating its promise as a robust solution for generating high-fidelity images from complex descriptive texts.
Conclusion and Contributions
CompAgent represents a significant advancement in the domain of text-to-image generation, championing the need for controlled, detailed, and relationship-aware image synthesis. It underscores the potent combination of an LLM agent’s planning and reasoning capabilities with a diverse toolkit capable of addressing a spectrum of compositional scenarios. The comprehensive nature of CompAgent, from planning through execution to verification, encapsulates its main contributions to the field, offering a new paradigm in image generation that is poised to facilitate more intuitive and controlled visual content creation from textual descriptions.