DiffusionGPT: LLM-Driven Text-to-Image Generation System (2401.10061v1)

Published 18 Jan 2024 in cs.CV and cs.AI

Abstract: Diffusion models have opened up new avenues for the field of image generation, resulting in the proliferation of high-quality models shared on open-source platforms. However, a major challenge persists in current text-to-image systems are often unable to handle diverse inputs, or are limited to single model results. Current unified attempts often fall into two orthogonal aspects: i) parse Diverse Prompts in input stage; ii) activate expert model to output. To combine the best of both worlds, we propose DiffusionGPT, which leverages LLMs (LLM) to offer a unified generation system capable of seamlessly accommodating various types of prompts and integrating domain-expert models. DiffusionGPT constructs domain-specific Trees for various generative models based on prior knowledge. When provided with an input, the LLM parses the prompt and employs the Trees-of-Thought to guide the selection of an appropriate model, thereby relaxing input constraints and ensuring exceptional performance across diverse domains. Moreover, we introduce Advantage Databases, where the Tree-of-Thought is enriched with human feedback, aligning the model selection process with human preferences. Through extensive experiments and comparisons, we demonstrate the effectiveness of DiffusionGPT, showcasing its potential for pushing the boundaries of image synthesis in diverse domains.

References (29)

Authors (9)

Jie Qin (68 papers)
Jie Wu (230 papers)
Weifeng Chen (22 papers)
Yuxi Ren (16 papers)
Huixia Li (16 papers)
Hefeng Wu (35 papers)
Xuefeng Xiao (51 papers)
Rui Wang (996 papers)
Shilei Wen (42 papers)

Citations (13)

View on Semantic Scholar

Summary

Overview of DiffusionGPT

The intersection of language understanding and image generation has been a rapidly advancing area in AI research. In this domain, generative models are pushing the boundaries of how machines can create visual content based on textual descriptions. A particular focus is placed on the recently developed DiffusionGPT system, which stands apart for its use of LLMs in orchestrating an advanced text-to-image generation process.

Challenge and Solution

Current image generation models, while powerful, often stumble across diverse prompts and domain specificity. Acknowledging this gap, the novel framework named DiffusionGPT emerges as a solution. The system capitalizes on LLMs to parse inputs and select the most apt generative model from a bespoke collection, termed the 'Tree-of-Thought,' based on the user's prompt. The inclusion of Advantage Databases helps tailor this selection to align more accurately with human preferences by harnessing feedback to refine choices further, making it an all-inclusive approach to image generation.

Methodology Insights

The core of DiffusionGPT encompasses several interconnected components, each serving a particular function. It begins with the Prompt Parse Agent, which uses the LLM to decipher various types of prompts, be they direct descriptions or metaphorical in nature. Following this parsing stage, the Tree-of-Thought of Models identifies the proper set of domain-expert models for the desired image. The process concludes with the Model Selection step, where human feedback plays a crucial role in determining the most suitable model to fulfil the request, leading to execution and generation of the final image.

Performance and Effectiveness

DiffusionGPT has shown a distinction in producing images that are not only semantically aligned with the given text prompts but also aesthetically pleasing. Comparisons with established models like SD1.5 demonstrate DiffusionGPT's superior capability in generating realistic and detailed images especially when the subjects are complex or human-oriented. Quantitative and user studies further underline its effectiveness, confirming that images generated by DiffusionGPT frequently outrank others in terms of visual appeal and alignment with human preferences.

Conclusion and Forward Look

Summarizing the contributions of DiffusionGPT, the paper highlights the system's unique position in providing a versatile, proficient, and user-aligned solution in the domain of text-to-image generation. It paves the way for further developments in this field, suggesting future enhancements including direct feedback incorporation, expansion of model candidates, and possible application to a wide array of generative tasks beyond image synthesis. DiffusionGPT encapsulates the current evolution in generative AI, offering a glimpse into a future where AI-generated images respond even more closely to the vast spectrum of human imagination.