DivCon: Divide and Conquer for Progressive Text-to-Image Generation (2403.06400v2)

Published 11 Mar 2024 in cs.CV

Abstract: Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements. To further improve T2I models' capability in numerical and spatial reasoning, the layout is employed as an intermedium to bridge LLMs and layout-based diffusion models. However, these methods still struggle with generating images from textural prompts with multiple objects and complicated spatial relationships. To tackle this challenge, we introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks. Our approach divides the layout prediction stage into numerical & spatial reasoning and bounding box prediction. Then, the layout-to-image generation stage is conducted in an iterative manner to reconstruct objects from easy ones to difficult ones. We conduct experiments on the HRS and NSR-1K benchmarks and our approach outperforms previous state-of-the-art models with notable margins. In addition, visual results demonstrate that our approach significantly improves the controllability and consistency in generating multiple objects from complex textural prompts.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (1)

Álvaro Barbero Jiménez: Mixture of diffusers for scene composition and high-resolution image generation. arXiv preprint arXiv:2302.02412 (2023)

Authors (2)

Yuhao Jia (2 papers)
Wenhan Tan (1 paper)

Citations (1)

View on Semantic Scholar

Tweets

https://twitter.com/CSVisionPapers/status/1768079952413380877

DivCon: Divide and Conquer for Progressive Text-to-Image Generation (2403.06400v2)

Related Papers

Tweets