Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (2401.11708v3)

Published 22 Jan 2024 in cs.CV, cs.AI, and cs.LG
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Abstract: Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: https://github.com/YangLing0818/RPG-DiffusionMaster

Introduction

The paper introduces a new text-to-image generation and editing framework called Recaption, Plan and Generate (RPG), leveraging the chain-of-thought reasoning ability of multimodal LLMs to enhance diffusion models' compositionality. The RPG framework employs MLLMs as a 'global planner' that decomposes complex imaging tasks into simpler sub-tasks, linked to distinct subregions within the image. It introduces regional diffusion techniques that allow for region-wise compositional image generation and proposes a unified, closed-loop approach for both image generation and image editing tasks. The experiments reveal that RPG outperforms established models like DALL-E 3 and SDXL, specifically in handling complex prompts with multiple categories and semantic alignments.

Methodology Overview

The RPG framework operates sans additional training, employing a three-step strategy that includes Multimodal Recaptioning, Chain-of-Thought Planning, and Complementary Regional Diffusion. MLLMs decompose text prompts into descriptive subprompts, which allow for detailed descriptions and semantic alignment during diffusion processes. CoT Planning is applied to allocate subprompts to complementary regions, treating the complex generation task as a collection of simpler ones. Complementary Regional Diffusion is proposed to realize regional generation and spatial merging, effectively navigating around the challenge of content conflicts in overlapping image components.

Compositional Generation and Editing

The RPG framework demonstrates versatility in handling both generation and editing tasks. For editing, it employs MLLMs to provide feedback identifying semantic discrepancies between generated images and target prompts, leverages CoT planning to delineate editing instructions, and utilizes contour-based diffusion for precise region modification. The framework showcases an ability to refine the generation process iteratively through a closed-loop implementation that incorporates feedback from earlier rounds of editing.

Experiments and Findings

The evaluation of RPG is carried out extensively; figures within the paper illustrate the framework's superiority in aligning complex textual prompts with generated image contents. Multiple datasets and benchmarks are utilized to assess RPG's performance, including the T2I-Compbench. RPG exhibits the capacity to adapt to different MLLM architectures and diffusion backbones, proving its flexibility and potential for wide application. In image editing comparisons, RPG outshines other state-of-the-art methods by producing more precise and semantically aligned edited images. Through iterative refinements, the method achieves further alignment and improvement in results.

Conclusion and Outlook

The RPG framework sets a new bar in handling complex and compositional text-to-image tasks, effectively leveraging the reasoning capabilities of MLLMs to plan image compositions for diffusion models. It presents a training-free, versatile approach and is compatible with various architecture types. Future research will aim at expanding the RPG framework to accommodate even more complex modalities and apply it to a broader spectrum of practical scenarios, solidifying text-to-image generation's position as a key technology in creative and design applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Spatext: Spatio-textual representation for controllable image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18370–18380, 2023.
  2. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023.
  3. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2023.
  4. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18392–18402, 2023.
  5. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465, 2023.
  6. ChatGPT, I. Introducing chatgpt, 2022.
  7. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  8. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023a.
  9. Training-free layout control with cross-attention guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  5343–5353, 2024.
  10. Llava-interactive: An all-in-one demo for image chat, segmentation, generation and editing. arXiv preprint arXiv:2311.00571, 2023b.
  11. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  12. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  13. Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1):53–65, 2018.
  14. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  15. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
  16. Boosting text-to-image diffusion models with fine-grained semantic rewards. arXiv preprint arXiv:2305.19599, 2023.
  17. Training-free structured diffusion guidance for compositional text-to-image synthesis. In The Eleventh International Conference on Learning Representations, 2022.
  18. Layoutgpt: Compositional visual planning and generation with large language models. arXiv preprint arXiv:2305.15393, 2023.
  19. Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102, 2023.
  20. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14953–14962, 2023.
  21. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  22. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  23. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. arXiv preprint arXiv:2307.06350, 2023a.
  24. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023b.
  25. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022.
  26. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  27. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023.
  28. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  29. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22511–22521, 2023b.
  30. Stablellava: Enhanced visual instruction tuning with synthesized image-dialogue data. arXiv preprint arXiv:2308.10253, 2023c.
  31. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023.
  32. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  33. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pp.  423–439. Springer, 2022.
  34. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  35. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
  36. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  37. OpenAI, R. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2:3, 2023.
  38. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  39. Kosmos-g: Generating images in context with multimodal large language models. arXiv preprint arXiv:2310.02992, 2023.
  40. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  41. Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation. In Proceedings of the 31st ACM International Conference on Multimedia, pp.  643–654, 2023.
  42. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  43. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  44. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  45. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment. arXiv preprint arXiv:2306.08877, 2023.
  46. Generative adversarial text to image synthesis. In International conference on machine learning, pp.  1060–1069. PMLR, 2016.
  47. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  48. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22500–22510, 2023.
  49. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  50. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.  2256–2265. PMLR, 2015.
  51. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  52. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
  53. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  54. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  55. Dreamsync: Aligning text-to-image generation with image understanding feedback. arXiv preprint arXiv:2311.17946, 2023a.
  56. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023b.
  57. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
  58. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
  59. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  60. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  61. Compositional text-to-image synthesis with attention map control of diffusion models. arXiv preprint arXiv:2305.13921, 2023.
  62. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  63. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023a.
  64. Self-correcting llm-controlled diffusion models. arXiv preprint arXiv:2311.16090, 2023b.
  65. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7452–7461, 2023.
  66. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, 2023.
  67. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023a.
  68. Improving diffusion-based image synthesis with context prediction. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
  69. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023c.
  70. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023d.
  71. Reco: Region-controlled text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14246–14255, 2023e.
  72. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023.
  73. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  74. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3836–3847, 2023a.
  75. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  76. Controllable text-to-image generation with gpt-4. arXiv preprint arXiv:2305.18583, 2023b.
  77. Enhanced visual instruction tuning for text-rich image understanding. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023c.
  78. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023d.
  79. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  80. Generalized decoding for pixel, image, and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15116–15127, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ling Yang (88 papers)
  2. Zhaochen Yu (7 papers)
  3. Chenlin Meng (39 papers)
  4. Minkai Xu (40 papers)
  5. Stefano Ermon (279 papers)
  6. Bin Cui (165 papers)
Citations (72)
Youtube Logo Streamline Icon: https://streamlinehq.com