DiffusionGPT: LLM-Driven Text-to-Image Generation System (2401.10061v1)
Abstract: Diffusion models have opened up new avenues for the field of image generation, resulting in the proliferation of high-quality models shared on open-source platforms. However, a major challenge persists in current text-to-image systems are often unable to handle diverse inputs, or are limited to single model results. Current unified attempts often fall into two orthogonal aspects: i) parse Diverse Prompts in input stage; ii) activate expert model to output. To combine the best of both worlds, we propose DiffusionGPT, which leverages LLMs (LLM) to offer a unified generation system capable of seamlessly accommodating various types of prompts and integrating domain-expert models. DiffusionGPT constructs domain-specific Trees for various generative models based on prior knowledge. When provided with an input, the LLM parses the prompt and employs the Trees-of-Thought to guide the selection of an appropriate model, thereby relaxing input constraints and ensuring exceptional performance across diverse domains. Moreover, we introduce Advantage Databases, where the Tree-of-Thought is enriched with human feedback, aligning the model selection process with human preferences. Through extensive experiments and comparisons, we demonstrate the effectiveness of DiffusionGPT, showcasing its potential for pushing the boundaries of image synthesis in diverse domains.
- Training diffusion models with reinforcement learning.
- Muse: Text-to-image generation via masked generative transformers.
- Palm: Scaling language modeling with pathways.
- Cogview2: Faster and better text-to-image generation via hierarchical transformers.
- Visual programming: Compositional visual reasoning without training. 2022.
- Denoising diffusion probabilistic models.
- Large language models are zero-shot reasoners.
- Aligning text-to-image models using human feedback. 2023.
- Training language models to follow instructions with human feedback.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Tool learning with foundation models.
- Learning transferable visual models from natural language supervision. Cornell University - arXiv,Cornell University - arXiv, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv: Learning,arXiv: Learning, 2019.
- Hierarchical text-conditional image generation with clip latents.
- Generative adversarial text to image synthesis. Cornell University - arXiv,Cornell University - arXiv, 2016.
- High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Photorealistic text-to-image diffusion models with deep language understanding.
- Toolformer: Language models can teach themselves to use tools. 2023.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface.
- Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
- Llama: Open and efficient foundation language models.
- Chain of thought prompting elicits reasoning in large language models.
- Visual chatgpt: Talking, drawing and editing with visual foundation models.
- Imagereward: Learning and evaluating human preferences for text-to-image generation. 2023.
- Scaling autoregressive models for content-rich text-to-image generation. 2022.
- Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
- Adding conditional control to text-to-image diffusion models.
- Automatic chain of thought prompting in large language models. 2022.
- Least-to-most prompting enables complex reasoning in large language models. 2022.
- Jie Qin (68 papers)
- Jie Wu (230 papers)
- Weifeng Chen (22 papers)
- Yuxi Ren (16 papers)
- Huixia Li (16 papers)
- Hefeng Wu (35 papers)
- Xuefeng Xiao (51 papers)
- Rui Wang (996 papers)
- Shilei Wen (42 papers)