GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

Published 8 Jul 2024 in cs.CV | (2407.05600v2)

Abstract: Despite the success achieved by existing image generation and editing methods, current models still struggle with complex problems including intricate text prompts, and the absence of verification and self-correction mechanisms makes the generated images unreliable. Meanwhile, a single model tends to specialize in particular tasks and possess the corresponding capabilities, making it inadequate for fulfilling all user requirements. We propose GenArtist, a unified image generation and editing system, coordinated by a multimodal LLM (MLLM) agent. We integrate a comprehensive range of existing models into the tool library and utilize the agent for tool selection and execution. For a complex problem, the MLLM agent decomposes it into simpler sub-problems and constructs a tree structure to systematically plan the procedure of generation, editing, and self-correction with step-by-step verification. By automatically generating missing position-related inputs and incorporating position information, the appropriate tool can be effectively employed to address each sub-problem. Experiments demonstrate that GenArtist can perform various generation and editing tasks, achieving state-of-the-art performance and surpassing existing models such as SDXL and DALL-E 3, as can be seen in Fig. 1. Project page is https://zhenyuw16.github.io/GenArtist_page.

Abstract PDF HTML Upgrade to Chat

Authors (4)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a unified multimodal LLM agent that systematically coordinates complex image generation and editing tasks.
It presents a planning tree with integrated step-by-step verification to ensure accuracy and aesthetic quality at every stage.
The system surpasses previous models, achieving over 7% improvement on the T2I-CompBench and excelling on the MagicBrush benchmark.

Overview of GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

The paper presents GenArtist, a system designed to address existing challenges in the domains of image generation and editing by employing a multimodal LLM (MLLM) as a central coordinating agent. Previous models, despite their advancements, are often limited in their ability to handle complex tasks that involve intricate text prompts and lack mechanisms for verification and self-correction, leading to inconsistent image quality and unreliability. GenArtist proposes an integrated approach to overcome these challenges by harnessing a diverse toolset and leveraging the MLLM agent to systematically manage task planning and execution.

GenArtist integrates a wide range of existing models within a comprehensive tool library, enabling the MLLM agent to select and execute the most appropriate tools for a given task. This approach allows GenArtist to decompose intricate problems into simpler sub-problems, thereby enhancing reliability. For generation tasks, text prompts are analyzed for discrete object concepts and background elements. During editing tasks, complex instructions are broken down into single-step editing actions, improving the execution's precision.

The core innovation of GenArtist lies in its planning tree structure enriched with step-by-step verification. This system constructs a planning tree, where each node represents a generation or editing operation, with verification mechanisms at each step to ensure correctness before progressing to subsequent tasks. The agent's verification process not only addresses correctness concerning object attributes and relationships but also assesses aesthetic quality.

The results reported in the paper indicate that GenArtist achieves state-of-the-art performance across multiple benchmarks. Specifically, GenArtist demonstrates over 7% improvements compared to DALL-E 3 on the T2I-CompBench benchmark for text-to-image generation and achieves leading performance on the MagicBrush benchmark for image editing. These results underscore the efficacy of GenArtist's unified approach that marries sophisticated verification processes with a diverse array of model-based tools.

The implications of this research are substantial both theoretically and practically. Theoretically, GenArtist exemplifies the potential of utilizing AI agents as central coordinators for complex visual tasks, laying a foundation for further exploration into agent-enriched multimodal systems. Practically, the system's unified approach to image generation and editing with high degrees of accuracy and control opens avenues for more reliable deployment in applications requiring precise visual outputs.

Speculating on future directions, it is likely that the field may witness increased integration of agents that can seamlessly interact with diverse model-based tools, expanding capabilities across broader task varieties. Moreover, enhancements in MLLMs' positional and contextual sensitivity could lead to even more nuanced insights and responses, thereby elevating the effectiveness of such systems.

Through its comprehensive design and demonstrable performance improvements, GenArtist presents a well-rounded advancement towards realizing highly reliable and versatile image generation and editing systems.

Markdown Report Issue