LLMGA: Multimodal Large Language Model based Generation Assistant (2311.16500v4)
Abstract: In this paper, we introduce a Multimodal LLM-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and response inherent in LLMs to assist users in image generation and editing. Diverging from existing approaches where Multimodal LLMs (MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our LLMGA provides a detailed language generation prompt for precise control over SD. This not only augments LLM context understanding but also reduces noise in generation prompts, yields images with more intricate and precise content, and elevates the interpretability of the network. To this end, we curate a comprehensive dataset comprising prompt refinement, similar image generation, inpainting & outpainting, and instruction-based editing. Moreover, we propose a two-stage training scheme. In the first stage, we train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. In the second stage, we optimize SD to align with the MLLM's generation prompts. Additionally, we propose a reference-based restoration network to alleviate texture, brightness, and contrast disparities between generated and preserved regions during inpainting and outpainting. Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications in an interactive manner.
- Ntire 2017 challenge on single image super-resolution: Dataset and study. In CVPRW, 2017.
- Jointly training large autoregressive multimodal models. arXiv preprint arXiv:2309.15564, 2023.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- Structured denoising diffusion models in discrete state-spaces. NeurIPS, 2021.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- All are worth words: A vit backbone for diffusion models. In CVPR, 2023a.
- One transformer fits all distributions in multi-modal diffusion at scale. arXiv preprint arXiv:2303.06555, 2023b.
- Conditional image generation with score-based diffusion models. arXiv preprint arXiv:2111.13606, 2021.
- Language models are few-shot learners. NeurIPS, 2020.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
- Diffusion models beat gans on image synthesis. NeurIPS, 2021.
- Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Frido: Feature pyramid diffusion for complex scene image synthesis. In AAAI, 2023.
- Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In CVPR, 2023.
- Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023.
- Vector quantized diffusion model for text-to-image synthesis. In CVPR, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Cascaded diffusion models for high fidelity image generation. JMLR, 2022.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
- Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
- Diffusionclip: Text-guided diffusion models for robust image manipulation. In CVPR, 2022.
- Variational diffusion models. NeurIPS, 2021.
- Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Visual instruction tuning. NeurIPS, 2023.
- Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
- Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- OpenAI. Gpt-4 technical report, 2023.
- Scalable diffusion models with transformers. In ICCV, 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. NeurIPS, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Zero-shot text-to-image generation. In ICML, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH, 2022a.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022b.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022c.
- Improved techniques for training gans. NeurIPS, 2016.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Ntire 2017 challenge on single image super-resolution: Methods and results. In CVPRW, 2017.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477, 2022.
- Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In ICCVW, 2021.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023a.
- Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023b.
- Diffi2i: Efficient diffusion model for image-to-image translation. arXiv preprint arXiv:2308.13767, 2023a.
- Diffir: Efficient diffusion model for image restoration. ICCV, 2023b.
- Knowledge distillation based degradation estimation for blind super-resolution. ICLR, 2023c.
- Raphael: Text-to-image generation via large mixture of diffusion paths. arXiv preprint arXiv:2305.18295, 2023.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
- Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
- Sur-adapter: Enhancing text-to-image pre-trained diffusion models with large language models. In ACM MM, 2023.
- Places: A 10 million image database for scene recognition. TPAMI, 2017.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Bin Xia (56 papers)
- Shiyin Wang (10 papers)
- Yingfan Tao (1 paper)
- Yitong Wang (47 papers)
- Jiaya Jia (162 papers)