GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing (2503.10639v1)

Published 13 Mar 2025 in cs.CV

Abstract: Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/rongyaofang/GoT.

PDF Abstract

Overview of "GoT: Unleashing Reasoning Capability of Multimodal LLM for Visual Generation and Editing"

The paper "GoT: Unleashing Reasoning Capability of Multimodal LLM for Visual Generation and Editing" introduces the Generation Chain-of-Thought (GoT) framework. This paradigm integrates reasoning mechanisms within the text-to-image (T2I) creation and editing processes, addressing the gap between multimodal LLMs' (MLLMs) advanced reasoning capabilities and conventional image generation's direct prompt-to-image mappings.

Methodology

GoT Paradigm:

GoT leverages semantic-spatial reasoning, generating step-by-step natural language explanations that guide image generation. This approach introduces reasoning chains that comprise semantic analysis and spatial coordination, equipping the model with a comprehensive understanding of visual scenes. The reasoning process is distinctly multimodal, requiring precise spatial arrangements alongside semantic descriptions, thus enabling a more explicit interpretation of image compositions.

Data Construction:

The paper constructs a large-scale dataset, comprising over 9 million samples, to facilitate the training of semantic-spatial reasoning chains for visual tasks. Leveraging MLLMs and LLMs, this dataset captures sophisticated annotations of semantic-spatial relationships, providing a solid foundation for the reasoning-based framework.

Unified Framework:

The GoT framework integrates Qwen2.5-VL, a state-of-the-art MLLM, with a novel Semantic-Spatial Guidance Module incorporated within a diffusion model. This module facilitates end-to-end generation by channeling semantic guidance and explicit spatial control. The framework allows interaction and modification of reasoning steps, enhancing image edits according to explicit user preferences.

Results

The results demonstrate the effective integration of reasoning into visual generation, outperforming baseline methods in both image generation and editing tasks. Specifically, GoT achieves superior performance on benchmarks like GenEval, particularly excelling in tasks requiring object arrangement and attribute binding.

Qualitative Evaluation:

In comparison with state-of-the-art models, GoT maintains competitive accuracy in image attributes alignment while enabling interactive customization capabilities. This aspect marks a practical advantage over traditional models, as users can manipulate output images by interacting directly with reasoning steps, thus achieving desired edits with intricate semantic-spatial alignment.

Implications and Future Work

The introduction of reasoning-driven visual synthesis presents substantial implications in advancing AI's comprehension of visual tasks, mimicking human cognitive reasoning about scenes. The GoT's interactive capabilities signify potential enhancements in user-centric applications, enabling dynamic user engagement with automated image generation systems.

The research opens avenues for further exploration in multimodal interactions, where more complex reasoning capabilities could be integrated into diverse visual tasks, including video synthesis and real-time digital media applications. Moreover, this framework can potentially extend to other domains requiring reasoning-based synthesis, suggesting a paradigm shift in how AI models interpret and generate multimodal content.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Rongyao Fang (18 papers)
Chengqi Duan (5 papers)
Kun Wang (355 papers)
Linjiang Huang (12 papers)
Hao Li (803 papers)
Shilin Yan (20 papers)
Hao Tian (146 papers)
Xingyu Zeng (26 papers)
Rui Zhao (241 papers)
Jifeng Dai (131 papers)
Xihui Liu (92 papers)
Hongsheng Li (340 papers)

Related Papers

Find Related Papers

GitHub

GitHub - rongyaofang/GoT: Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing" (32 stars)

Tweets

https://twitter.com/XihuiLiu/status/1902017291098980584

https://twitter.com/kaleemcs/status/1900921767071236219