Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

464 2 1

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (2401.11708v3)

Published 22 Jan 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: https://github.com/YangLing0818/RPG-DiffusionMaster

PDF HTML Abstract

Introduction

The paper introduces a new text-to-image generation and editing framework called Recaption, Plan and Generate (RPG), leveraging the chain-of-thought reasoning ability of multimodal LLMs to enhance diffusion models' compositionality. The RPG framework employs MLLMs as a 'global planner' that decomposes complex imaging tasks into simpler sub-tasks, linked to distinct subregions within the image. It introduces regional diffusion techniques that allow for region-wise compositional image generation and proposes a unified, closed-loop approach for both image generation and image editing tasks. The experiments reveal that RPG outperforms established models like DALL-E 3 and SDXL, specifically in handling complex prompts with multiple categories and semantic alignments.

Methodology Overview

The RPG framework operates sans additional training, employing a three-step strategy that includes Multimodal Recaptioning, Chain-of-Thought Planning, and Complementary Regional Diffusion. MLLMs decompose text prompts into descriptive subprompts, which allow for detailed descriptions and semantic alignment during diffusion processes. CoT Planning is applied to allocate subprompts to complementary regions, treating the complex generation task as a collection of simpler ones. Complementary Regional Diffusion is proposed to realize regional generation and spatial merging, effectively navigating around the challenge of content conflicts in overlapping image components.

Compositional Generation and Editing

The RPG framework demonstrates versatility in handling both generation and editing tasks. For editing, it employs MLLMs to provide feedback identifying semantic discrepancies between generated images and target prompts, leverages CoT planning to delineate editing instructions, and utilizes contour-based diffusion for precise region modification. The framework showcases an ability to refine the generation process iteratively through a closed-loop implementation that incorporates feedback from earlier rounds of editing.

Experiments and Findings

The evaluation of RPG is carried out extensively; figures within the paper illustrate the framework's superiority in aligning complex textual prompts with generated image contents. Multiple datasets and benchmarks are utilized to assess RPG's performance, including the T2I-Compbench. RPG exhibits the capacity to adapt to different MLLM architectures and diffusion backbones, proving its flexibility and potential for wide application. In image editing comparisons, RPG outshines other state-of-the-art methods by producing more precise and semantically aligned edited images. Through iterative refinements, the method achieves further alignment and improvement in results.

Conclusion and Outlook

The RPG framework sets a new bar in handling complex and compositional text-to-image tasks, effectively leveraging the reasoning capabilities of MLLMs to plan image compositions for diffusion models. It presents a training-free, versatile approach and is compatible with various architecture types. Future research will aim at expanding the RPG framework to accommodate even more complex modalities and apply it to a broader spectrum of practical scenarios, solidifying text-to-image generation's position as a key technology in creative and design applications.

PDF Markdown Bookmark Chat (Pro)

References (80)

Authors (6)

Ling Yang (88 papers)
Zhaochen Yu (7 papers)
Chenlin Meng (39 papers)
Minkai Xu (40 papers)
Stefano Ermon (279 papers)
Bin Cui (165 papers)

Citations (72)

View on Semantic Scholar

GitHub

GitHub - YangLing0818/RPG-DiffusionMaster: Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (PRG) (1,560 stars)

Tweets

https://twitter.com/_akhaliq/status/1749633221514461489

https://twitter.com/pika_research/status/1749956062692757887

https://twitter.com/fly51fly/status/1749924565734822066

https://twitter.com/LingYang_PKU/status/1749815669510865235

https://twitter.com/LingYang_PKU/status/1749815610031428042

https://twitter.com/burny_tech/status/1751648421813834081

YouTube

Show All Videos