Generating Images with Multimodal Language Models
Abstract: We propose a method to fuse frozen text-only LLMs with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs. Our approach outperforms baseline generation models on tasks with longer and more complex language. In addition to novel image generation, our model is also capable of image retrieval from a prespecified dataset, and decides whether to retrieve or generate at inference time. This is done with a learnt decision module which conditions on the hidden representations of the LLM. Our model exhibits a wider range of capabilities compared to prior multimodal LLMs. It can process image-and-text inputs, and produce retrieved images, generated images, and generated text -- outperforming non-LLM based generation models across several text-to-image tasks that measure context dependence.
- Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
- Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
- Do as i can, not as i say: Grounding language in robotic affordances. CoRL, 2023.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623, 2021.
- Unlimiformer: Long-range transformers with unlimited length input. arXiv preprint arXiv:2305.01625, 2023.
- Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021.
- Language models are few-shot learners. NeurIPS, 2020.
- End-to-end object detection with transformers. In ECCV, 2020.
- Transformers generalize differently from information stored in context vs in weights. NeurIPS MemARI Workshop, 2022.
- Discovering transferable forensic features for cnn-generated images detection. In ECCV, 2022.
- Maskgit: Masked generative image transformer. In CVPR, 2022.
- Re-imagen: Retrieval-augmented text-to-image generator. ICLR, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
- Visual dialog. In CVPR, 2017.
- Cogview: Mastering text-to-image generation via transformers. NeurIPS, 2021.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Magma–multimodal augmentation of generative models through adapter-based finetuning. EMNLP, 2022.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021.
- Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV, 2022.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. EMNLP, 2020.
- Generative adversarial networks. Communications of the ACM, 2020.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- The curious case of neural text degeneration. ICLR, 2020.
- Visual storytelling. In NAACL-HLT, 2016.
- Probing contextual language models for common ground with visual representations. NAACL, 2021.
- Adam: A method for stochastic optimization. ICLR, 2015.
- Grounding language models to images for multimodal inputs and outputs. ICML, 2023.
- Illustrating reinforcement learning from human feedback (rlhf). Hugging Face Blog, 2022. https://huggingface.co/blog/rlhf.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML, 2023.
- Microsoft coco: Common objects in context. ECCV, 2014.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
- Distortion agnostic deep watermarking. In CVPR, 2020.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. ICML, 2022.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Training language models to follow instructions with human feedback. NeurIPS, 2022.
- Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
- Gluegen: Plug and play multi-modal encoders for x-to-image generation. arXiv preprint arXiv:2303.10056, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Zero-shot text-to-image generation. In ICML, 2021.
- Generating diverse high-fidelity images with vq-vae-2. NeurIPS, 2019.
- Generative adversarial text to image synthesis. In ICML, 2016.
- Can wikipedia help offline reinforcement learning? arXiv preprint arXiv:2201.12122, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
- Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
- Unifying language learning paradigms. arXiv preprint arXiv:2205.05131, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Multimodal few-shot learning with frozen language models. NeurIPS, 2021.
- Attention is all you need. NeurIPS, 2017.
- Finetuned language models are zero-shot learners. ICLR, 2022.
- Memorizing transformers. ICLR, 2022.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
- Doc: Improving long story coherence with detailed outline control. arXiv preprint arXiv:2212.10077, 2022.
- Retrieval-augmented multimodal language modeling. ICML, 2023.
- Cobit: A contrastive bi-directional image-text generation model. arXiv preprint arXiv:2303.13455, 2023.
- Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
- Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022.
- Cross-modal contrastive learning for text-to-image generation. In CVPR, 2021.
- Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- A recipe for watermarking diffusion models. arXiv preprint arXiv:2303.10137, 2023.
- Learning to prompt for vision-language models. IJCV, 2022.
- Lafite2: Few-shot text-to-image generation. arXiv preprint arXiv:2210.14124, 2022.
- Lafite: Towards language-free training for text-to-image generation. CVPR, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.