Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generating Images with Multimodal Language Models (2305.17216v3)

Published 26 May 2023 in cs.CL, cs.CV, and cs.LG

Abstract: We propose a method to fuse frozen text-only LLMs with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs. Our approach outperforms baseline generation models on tasks with longer and more complex language. In addition to novel image generation, our model is also capable of image retrieval from a prespecified dataset, and decides whether to retrieve or generate at inference time. This is done with a learnt decision module which conditions on the hidden representations of the LLM. Our model exhibits a wider range of capabilities compared to prior multimodal LLMs. It can process image-and-text inputs, and produce retrieved images, generated images, and generated text -- outperforming non-LLM based generation models across several text-to-image tasks that measure context dependence.

Generating Images with Multimodal LLMs: An Overview

The paper "Generating Images with Multimodal LLMs" presents an innovative approach to fusing LLMs with pre-trained image encoders and decoders by mapping between their embedding spaces. The authors focus on creating a model that leverages the strengths of LLMs in text processing to extend capabilities to multimodal tasks, such as image retrieval, novel image generation, and multimodal dialogue.

Methodology and Model Architecture

The authors introduce a method called GILL (Generating Images with LLMs), which enables the processing of interleaved image-and-text inputs to generate coherent image and text outputs. The novelty of this approach lies in the efficient mapping network that translates the hidden representations of text into the embedding space of a visual model. This mapping allows the model to generate relevant visual outputs by grounding the LLM to a text-to-image generation model, specifically Stable Diffusion.

The model architecture is designed to keep the majority of the LLM weights frozen, allowing it to leverage the existing capabilities of LLMs learned during text pretraining. The proposed architectural changes involve the introduction of a GILLMapper module, a lightweight Transformer conditioned on special learned text tokens. This module efficiently maps the LLM's output embedding space to the input space of an image generation model, facilitating image synthesis.

Results and Evaluation

Experimental results demonstrate that GILL outperforms baseline models in tasks requiring longer and more complex language contexts. The model's ability to process multimodal context allows it to outperform non-LLM-based generation models, particularly in dialogue-conditioned image generation. The paper provides quantitative results on datasets such as VIST and VisDial, highlighting GILL's improved performance in generating relevant images when conditioned on rich textual and visual context.

Implications and Future Directions

The research provides compelling evidence that integrating LLMs with visual models can expand the capabilities of multimodal LLMs. This approach has significant implications for the future of artificial intelligence, particularly in applications involving AI assistants that require both text processing and image generation capabilities. The ability to produce interleaved multimodal outputs enhances the model's utility in a variety of tasks, from creative endeavors to providing visual content in response to queries.

The modular nature of GILL allows it to potentially benefit from advances in LLMs and visual models, suggesting future directions for scaling up the architecture. This could involve utilizing larger LLMs, more sophisticated image generation backbones, or finetuning on diverse datasets to improve alignment with the visual generation model.

Conclusion

In summary, the paper presents a significant step forward in enhancing the multimodal capabilities of LLMs by grounding them to visual outputs. Through efficient architectural innovations and robust evaluation, the authors demonstrate the potential of GILL in processing and generating image-and-text outputs, providing a promising foundation for future advancements in multimodal AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
  2. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
  3. Do as i can, not as i say: Grounding language in robotic affordances. CoRL, 2023.
  4. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  5. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005.
  6. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623, 2021.
  7. Unlimiformer: Long-range transformers with unlimited length input. arXiv preprint arXiv:2305.01625, 2023.
  8. Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021.
  9. Language models are few-shot learners. NeurIPS, 2020.
  10. End-to-end object detection with transformers. In ECCV, 2020.
  11. Transformers generalize differently from information stored in context vs in weights. NeurIPS MemARI Workshop, 2022.
  12. Discovering transferable forensic features for cnn-generated images detection. In ECCV, 2022.
  13. Maskgit: Masked generative image transformer. In CVPR, 2022.
  14. Re-imagen: Retrieval-augmented text-to-image generator. ICLR, 2023.
  15. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  16. Visual dialog. In CVPR, 2017.
  17. Cogview: Mastering text-to-image generation via transformers. NeurIPS, 2021.
  18. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  19. Magma–multimodal augmentation of generative models through adapter-based finetuning. EMNLP, 2022.
  20. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  21. Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV, 2022.
  22. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. EMNLP, 2020.
  23. Generative adversarial networks. Communications of the ACM, 2020.
  24. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  25. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
  26. Denoising diffusion probabilistic models. NeurIPS, 2020.
  27. The curious case of neural text degeneration. ICLR, 2020.
  28. Visual storytelling. In NAACL-HLT, 2016.
  29. Probing contextual language models for common ground with visual representations. NAACL, 2021.
  30. Adam: A method for stochastic optimization. ICLR, 2015.
  31. Grounding language models to images for multimodal inputs and outputs. ICML, 2023.
  32. Illustrating reinforcement learning from human feedback (rlhf). Hugging Face Blog, 2022. https://huggingface.co/blog/rlhf.
  33. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML, 2023.
  34. Microsoft coco: Common objects in context. ECCV, 2014.
  35. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  36. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
  37. Distortion agnostic deep watermarking. In CVPR, 2020.
  38. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. ICML, 2022.
  39. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  40. Training language models to follow instructions with human feedback. NeurIPS, 2022.
  41. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
  42. Gluegen: Plug and play multi-modal encoders for x-to-image generation. arXiv preprint arXiv:2303.10056, 2023.
  43. Learning transferable visual models from natural language supervision. In ICML, 2021.
  44. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  45. Zero-shot text-to-image generation. In ICML, 2021.
  46. Generating diverse high-fidelity images with vq-vae-2. NeurIPS, 2019.
  47. Generative adversarial text to image synthesis. In ICML, 2016.
  48. Can wikipedia help offline reinforcement learning? arXiv preprint arXiv:2201.12122, 2022.
  49. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  50. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
  51. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  52. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  53. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
  54. Unifying language learning paradigms. arXiv preprint arXiv:2205.05131, 2022.
  55. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  56. Multimodal few-shot learning with frozen language models. NeurIPS, 2021.
  57. Attention is all you need. NeurIPS, 2017.
  58. Finetuned language models are zero-shot learners. ICLR, 2022.
  59. Memorizing transformers. ICLR, 2022.
  60. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
  61. Doc: Improving long story coherence with detailed outline control. arXiv preprint arXiv:2212.10077, 2022.
  62. Retrieval-augmented multimodal language modeling. ICML, 2023.
  63. Cobit: A contrastive bi-directional image-text generation model. arXiv preprint arXiv:2303.13455, 2023.
  64. Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
  65. Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022.
  66. Cross-modal contrastive learning for text-to-image generation. In CVPR, 2021.
  67. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  68. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  69. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  70. A recipe for watermarking diffusion models. arXiv preprint arXiv:2303.10137, 2023.
  71. Learning to prompt for vision-language models. IJCV, 2022.
  72. Lafite2: Few-shot text-to-image generation. arXiv preprint arXiv:2210.14124, 2022.
  73. Lafite: Towards language-free training for text-to-image generation. CVPR, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jing Yu Koh (18 papers)
  2. Daniel Fried (69 papers)
  3. Ruslan Salakhutdinov (248 papers)
Citations (190)
Youtube Logo Streamline Icon: https://streamlinehq.com