Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion (2402.12741v2)

Published 20 Feb 2024 in cs.CV

Abstract: Existing text-to-image models still struggle to generate images of multiple objects, especially in handling their spatial positions, relative sizes, overlapping, and attribute bindings. To efficiently address these challenges, we develop a training-free Multimodal-LLM agent (MuLan), as a human painter, that can progressively generate multi-object with intricate planning and feedback control. MuLan harnesses a LLM to decompose a prompt to a sequence of sub-tasks, each generating only one object by stable diffusion, conditioned on previously generated objects. Unlike existing LLM-grounded methods, MuLan only produces a high-level plan at the beginning while the exact size and location of each object are determined upon each sub-task by an LLM and attention guidance. Moreover, MuLan adopts a vision-LLM (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt. Hence, each model in every step of MuLan only needs to address an easy sub-task it is specialized for. The multi-step process also allows human users to monitor the generation process and make preferred changes at any intermediate step via text prompts, thereby improving the human-AI collaboration experience. We collect 200 prompts containing multi-objects with spatial relationships and attribute bindings from different benchmarks to evaluate MuLan. The results demonstrate the superiority of MuLan in generating multiple objects over baselines and its creativity when collaborating with human users. The code is available at https://github.com/measure-infinity/mulan-code.

Progressive Multi-Object Generation with a Multimodal LLM (MuLan)

Introduction

The development and refinement of diffusion models have been a cornerstone of progress in generative AI, particularly within the domain of text-to-image (T2I) synthesis. Despite notable achievements, existing state-of-the-art models such as Stable Diffusion and DALL-E struggle with generating images from prompts involving intricate object relations—be it spatial positioning, relative sizes, or attribute consistency. To bridge this gap, we introduce MuLan, a training-free, Multimodal-LLM Agent geared towards progressive multi-object generation, leveraging LLMs for task decomposition and vision-LLMs (VLMs) for iterative feedback control.

Related Work

The emergence of diffusion models has catalyzed breakthroughs in T2I generation, wherein models like Stable Diffusion XL have showcased near-commercial-grade performance. However, their limitation becomes evident when generating complex images with multiple objects. Previous endeavors to improve T2I model controllability have led to approaches that utilize LLMs for layout generation and optimization, but these techniques often fall short in addressing spatial reasoning and layout precision.

The MuLan Framework

MuLan addresses the aforementioned limitations by employing a sequential generation strategy, akin to how a human artist might approach a complex drawing. The process begins with an LLM decomposing a given prompt into manageable object-centric sub-tasks, guiding the generation of one object at a time while considering previously generated content. Each object's generation benefits from attention-guided diffusion, ensuring accurate positioning and attribute adherence. Critically, MuLan introduces a VLM-based feedback loop to correct any deviations from the initial prompt during the generative process. This innovative architecture allows for precise control over the composition of multiple objects, a notable advancement over existing methods.

Experimental Validation

To assess MuLan's efficacy, we compiled a test suite of 200 complex prompts from various benchmarks, analyzing performance across dimensions such as object completeness, attribute binding accuracy, and spatial relationship fidelity. Our findings demonstrate that MuLan significantly outperforms baseline models in these areas, as indicated by both quantitative results and human evaluations. This success underscores the potential of MuLan to redefine the standards for T2I generation, especially in scenarios demanding high degrees of compositional control.

Discussion and Future Directions

The introduction of MuLan represents a pivotal shift towards a more nuanced and capable form of T2I generation. By meticulously combining the strengths of LLMs and VLMs, MuLan not only surmounts the challenges posed by complex prompts but also showcases the untapped potential of multimodal AI collaboration. Looking forward, our work lays the foundational groundwork for further explorations into the synergistic integration of language and visual models, heralding a new era of generative AI that is both more creative and more controlled.

Limitations and Ethical Considerations

While MuLan advances the field of generative AI, its reliance on sequential processing for complex scenes introduces higher computational demands, potentially impacting scalability and efficiency. Additionally, the dependency on LLMs for prompt decomposition may introduce vulnerabilities to inaccuracies in understanding or processing complex prompts. As with all AI research, it is imperative to remain vigilant about the ethical implications, especially concerning the generation of misleading or harmful content. Continuous scrutiny and refinement of models like MuLan are essential to ensure their benefits are realized without unintended negative consequences.

In conclusion, MuLan's ability to navigate the challenges of multi-object T2I generation, backed by empirical validation, not only enhances our understanding of the field but also paves the way for more sophisticated and reliable generative models. Recognizing its potential and limitations will be pivotal in driving future AI research and applications toward more beneficial and ethical outcomes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3), 2023.
  3. Pixart-a⁢l⁢p⁢h⁢a𝑎𝑙𝑝ℎ𝑎alphaitalic_a italic_l italic_p italic_h italic_a: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
  4. Training-free layout control with cross-attention guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  5343–5353, 2024.
  5. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022.
  6. Layoutgpt: Compositional visual planning and generation with large language models. arXiv preprint arXiv:2305.15393, 2023.
  7. Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611, 2022.
  8. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  9. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  10. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. arXiv preprint arXiv:2307.06350, 2023.
  11. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22511–22521, 2023.
  12. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023.
  13. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  14. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  15. OpenAI. Gpt-4v(ision) system card. 2023.
  16. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  17. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  18. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  19. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.  2256–2265. PMLR, 2015.
  20. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  21. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sen Li (59 papers)
  2. Ruochen Wang (29 papers)
  3. Cho-Jui Hsieh (211 papers)
  4. Minhao Cheng (43 papers)
  5. Tianyi Zhou (172 papers)
Citations (1)