Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiffusionGPT: LLM-Driven Text-to-Image Generation System (2401.10061v1)

Published 18 Jan 2024 in cs.CV and cs.AI

Abstract: Diffusion models have opened up new avenues for the field of image generation, resulting in the proliferation of high-quality models shared on open-source platforms. However, a major challenge persists in current text-to-image systems are often unable to handle diverse inputs, or are limited to single model results. Current unified attempts often fall into two orthogonal aspects: i) parse Diverse Prompts in input stage; ii) activate expert model to output. To combine the best of both worlds, we propose DiffusionGPT, which leverages LLMs (LLM) to offer a unified generation system capable of seamlessly accommodating various types of prompts and integrating domain-expert models. DiffusionGPT constructs domain-specific Trees for various generative models based on prior knowledge. When provided with an input, the LLM parses the prompt and employs the Trees-of-Thought to guide the selection of an appropriate model, thereby relaxing input constraints and ensuring exceptional performance across diverse domains. Moreover, we introduce Advantage Databases, where the Tree-of-Thought is enriched with human feedback, aligning the model selection process with human preferences. Through extensive experiments and comparisons, we demonstrate the effectiveness of DiffusionGPT, showcasing its potential for pushing the boundaries of image synthesis in diverse domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Training diffusion models with reinforcement learning.
  2. Muse: Text-to-image generation via masked generative transformers.
  3. Palm: Scaling language modeling with pathways.
  4. Cogview2: Faster and better text-to-image generation via hierarchical transformers.
  5. Visual programming: Compositional visual reasoning without training. 2022.
  6. Denoising diffusion probabilistic models.
  7. Large language models are zero-shot reasoners.
  8. Aligning text-to-image models using human feedback. 2023.
  9. Training language models to follow instructions with human feedback.
  10. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  11. Tool learning with foundation models.
  12. Learning transferable visual models from natural language supervision. Cornell University - arXiv,Cornell University - arXiv, 2021.
  13. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv: Learning,arXiv: Learning, 2019.
  14. Hierarchical text-conditional image generation with clip latents.
  15. Generative adversarial text to image synthesis. Cornell University - arXiv,Cornell University - arXiv, 2016.
  16. High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  17. Photorealistic text-to-image diffusion models with deep language understanding.
  18. Toolformer: Language models can teach themselves to use tools. 2023.
  19. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface.
  20. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
  21. Llama: Open and efficient foundation language models.
  22. Chain of thought prompting elicits reasoning in large language models.
  23. Visual chatgpt: Talking, drawing and editing with visual foundation models.
  24. Imagereward: Learning and evaluating human preferences for text-to-image generation. 2023.
  25. Scaling autoregressive models for content-rich text-to-image generation. 2022.
  26. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
  27. Adding conditional control to text-to-image diffusion models.
  28. Automatic chain of thought prompting in large language models. 2022.
  29. Least-to-most prompting enables complex reasoning in large language models. 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Jie Qin (68 papers)
  2. Jie Wu (230 papers)
  3. Weifeng Chen (22 papers)
  4. Yuxi Ren (16 papers)
  5. Huixia Li (16 papers)
  6. Hefeng Wu (35 papers)
  7. Xuefeng Xiao (51 papers)
  8. Rui Wang (996 papers)
  9. Shilei Wen (42 papers)
Citations (13)

Summary

Overview of DiffusionGPT

The intersection of language understanding and image generation has been a rapidly advancing area in AI research. In this domain, generative models are pushing the boundaries of how machines can create visual content based on textual descriptions. A particular focus is placed on the recently developed DiffusionGPT system, which stands apart for its use of LLMs in orchestrating an advanced text-to-image generation process.

Challenge and Solution

Current image generation models, while powerful, often stumble across diverse prompts and domain specificity. Acknowledging this gap, the novel framework named DiffusionGPT emerges as a solution. The system capitalizes on LLMs to parse inputs and select the most apt generative model from a bespoke collection, termed the 'Tree-of-Thought,' based on the user's prompt. The inclusion of Advantage Databases helps tailor this selection to align more accurately with human preferences by harnessing feedback to refine choices further, making it an all-inclusive approach to image generation.

Methodology Insights

The core of DiffusionGPT encompasses several interconnected components, each serving a particular function. It begins with the Prompt Parse Agent, which uses the LLM to decipher various types of prompts, be they direct descriptions or metaphorical in nature. Following this parsing stage, the Tree-of-Thought of Models identifies the proper set of domain-expert models for the desired image. The process concludes with the Model Selection step, where human feedback plays a crucial role in determining the most suitable model to fulfil the request, leading to execution and generation of the final image.

Performance and Effectiveness

DiffusionGPT has shown a distinction in producing images that are not only semantically aligned with the given text prompts but also aesthetically pleasing. Comparisons with established models like SD1.5 demonstrate DiffusionGPT's superior capability in generating realistic and detailed images especially when the subjects are complex or human-oriented. Quantitative and user studies further underline its effectiveness, confirming that images generated by DiffusionGPT frequently outrank others in terms of visual appeal and alignment with human preferences.

Conclusion and Forward Look

Summarizing the contributions of DiffusionGPT, the paper highlights the system's unique position in providing a versatile, proficient, and user-aligned solution in the domain of text-to-image generation. It paves the way for further developments in this field, suggesting future enhancements including direct feedback incorporation, expansion of model candidates, and possible application to a wide array of generative tasks beyond image synthesis. DiffusionGPT encapsulates the current evolution in generative AI, offering a glimpse into a future where AI-generated images respond even more closely to the vast spectrum of human imagination.

Reddit Logo Streamline Icon: https://streamlinehq.com