Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Grounding Language Models to Images for Multimodal Inputs and Outputs (2301.13823v4)

Published 31 Jan 2023 in cs.CL, cs.AI, cs.CV, and cs.LG
Grounding Language Models to Images for Multimodal Inputs and Outputs

Abstract: We propose an efficient method to ground pretrained text-only LLMs to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of LLMs learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the LLM frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf LLM and paves the way towards an effective, general solution for leveraging pretrained LLMs in visually grounded settings.

The paper "Grounding LLMs to Images for Multimodal Inputs and Outputs" (Koh et al., 2023 ) introduces Frozen Retrieval Over Multimodal Data for Autoregressive Generation (FROMAGe), a method for grounding pre-trained text-only LLMs to the visual domain. This enables the model to process arbitrarily interleaved image-and-text data and generate text interleaved with retrieved images. The key idea is to leverage the existing capabilities of LLMs, such as in-context learning and free-form text generation, while adapting them to handle visual information.

The approach involves keeping the LLM frozen and fine-tuning input and output linear layers to facilitate cross-modality interactions. The model is trained with a multi-task objective:

  • Image captioning: learning to process interleaved multimodal inputs.
  • Image-text retrieval: learning to produce interleaved multimodal outputs.

For image captioning, visual embeddings are extracted using a pre-trained visual encoder. A linear mapping, WcRm×kd\mathbf{W}_c \in \mathbb{R}^{m \times kd}, is learned to map these embeddings into the input space of the LLM via a maximum-likelihood objective. mm: dimension of visual embeddings kk: number of vectors dd: hidden dimensionality

For image-text retrieval, the LLM learns a new [RET] token representing an image. Another linear mapping, WtRp×q\mathbf{W}_t \in \mathbb{R}^{p \times q}, is trained using contrastive learning to map the [RET] embeddings for a caption to be close to the visual embeddings of its paired image. The visual embeddings vϕ(yi)v_{\phi}(y_i) are mapped into the same retrieval space using the linear mapping WiRm×q\mathbf{W}_i \in \mathbb{R}^{m \times q}. pp: hidden representation of the [RET] token from the last hidden layer of the LLM

qq: retrieval dimension, where q<pq < p

The normalized cosine similarity for the image and text embeddings is computed as:

sim(x,y)=(hθ(x)TWt)(vϕ(y)TWi)Thθ(x)TWtvϕ(y)TWi)T\text{sim}(x, y) = \frac{(h_{\theta}(x)^T \mathbf{W}_t) (v_{\phi}(y)^T \mathbf{W}_i)^T}{ \lVert h_{\theta}(x)^T \mathbf{W}_t \rVert \lVert v_{\phi}(y)^T \mathbf{W}_i)^T \rVert }

Where: xx: caption yy: paired image hθ(x)h_{\theta}(x): output of the last hidden layer of the LLM (LLM) for the [RET] token vϕ(y)v_{\phi}(y): output of the visual encoder for the image Wt\mathbf{W}_t: linear mapping to map the hidden representation of [RET] from the last hidden layer of the LLM (LLM) Wi\mathbf{W}_i: linear mapping to map the visual embeddings

The InfoNCE loss is minimized for text-to-image (t2i) and image-to-text (i2t) retrieval over a batch of NN text-image pairs (xi,yi)(x_i, y_i). The loss functions are:

$\mathcal{L}_{\text{t2i} = -\frac{1}{N} \sum_{i=1}^N \left( \log \frac{\exp(\text{sim}(x_i, y_i) / \tau)}{ \sum_{j=1}^N \exp(\text{sim}(x_i, y_j) / \tau )} \right)$

$\mathcal{L}_{\text{i2t} = -\frac{1}{N} \sum_{i=1}^N \left( \log \frac{\exp(\text{sim}(y_i, x_i) / \tau)}{ \sum_{j=1}^N \exp(\text{sim}(y_i, x_j) / \tau )} \right)$

Where: τ\tau: learnable temperature parameter.

The final training loss is a weighted sum of the captioning loss Lc\mathcal{L}_{\text{c}} and the retrieval losses:

$\mathcal{L} = \lambda_c \mathcal{L}_{\text{c} + \lambda_r (\mathcal{L}_{\text{t2i} + \mathcal{L}_{\text{i2t})$

Where: λc\lambda_c: captioning loss weight λr\lambda_r: retrieval loss weight

During training, only the linear mappings (Wc\mathbf{W}_c, Wt\mathbf{W}_t, and Wi\mathbf{W}_i) and the [RET] embedding vector are updated.

The paper evaluates FROMAGe on tasks such as contextual image retrieval and visual dialogue, demonstrating strong zero-shot performance. Key findings include:

  • Autoregressive LLMs can perform text-to-image retrieval with greater sensitivity to input text compared to existing models.
  • The existing capabilities of pre-trained text-only LLMs can be leveraged for visually grounded tasks.

Experiments on the Visual Storytelling (VIST) dataset [huang2016visual] show that FROMAGe outperforms CLIP [radford2021learning] in contextual image retrieval, especially when provided with longer, temporally dependent sentences and interleaved image-and-text context. On Visual Dialog (VisDial) [das2017visual], FROMAGe achieves competitive results in zero-shot text answer selection and significantly outperforms prior work in text-to-image retrieval. Ablation studies validate the importance of freezing the LLM and using a dedicated retrieval token. The paper also presents results showing a positive correlation between model size and performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
  2. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
  3. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  4. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005.
  5. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp.  610–623, 2021.
  6. Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021.
  7. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  8. Language models are few-shot learners. NeurIPS, 2020.
  9. Data distributional properties drive emergent few-shot learning in transformers. NeurIPS, 2022.
  10. Learning a similarity metric discriminatively, with application to face verification. In CVPR, 2005.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  13. Transformer-xl: Attentive language models beyond a fixed-length context. ACL, 2019.
  14. Visual dialog. In CVPR, 2017.
  15. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. NeurIPS, 2022.
  16. Cogview2: Faster and better text-to-image generation via hierarchical transformers. arXiv preprint arXiv:2204.14217, 2022.
  17. Magma–multimodal augmentation of generative models through adapter-based finetuning. EMNLP, 2022.
  18. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  19. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. EMNLP, 2020.
  20. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  21. Training compute-optimal large language models. NeurIPS, 2022.
  22. The curious case of neural text degeneration. ICLR, 2020.
  23. Parameter-efficient transfer learning for nlp. In ICML, 2019.
  24. Visual storytelling. In NAACL-HLT, 2016.
  25. Scaling up visual and vision-language representation learning with noisy text supervision. In ICLR, 2021.
  26. Adam: A method for stochastic optimization. ICLR, 2015.
  27. The power of scale for parameter-efficient prompt tuning. EMNLP, 2021.
  28. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  29. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
  30. Prefix-tuning: Optimizing continuous prompts for generation. ACL, 2021.
  31. Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097, 2022b.
  32. Microsoft coco: Common objects in context. In ECCV, 2014.
  33. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS, 2019.
  34. Pretrained transformers as universal computation engines. AAAI, 2022.
  35. Linearly mapping from image to text space. arXiv preprint arXiv:2209.15162, 2022.
  36. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  37. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  38. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
  39. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 2019.
  40. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  41. Learning transferable visual models from natural language supervision. In ICLR, 2021.
  42. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  43. Zero-shot text-to-image generation. In ICML, 2021.
  44. Generative adversarial text to image synthesis. In ICML, 2016.
  45. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  46. Neural machine translation of rare words with subword units. ACL, 2015.
  47. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. ACL, 2018.
  48. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  49. Progressive generation of long text with pretrained language models. NAACL, 2021.
  50. Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022.
  51. Multimodal few-shot learning with frozen language models. NeurIPS, 2021.
  52. Attention is all you need. NeurIPS, 2017.
  53. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. ICML, 2022.
  54. Finetuned language models are zero-shot learners. ICLR, 2021.
  55. Emergent abilities of large language models. TMLR, 2022.
  56. Re3: Generating longer stories with recursive reprompting and revision. EMNLP, 2022.
  57. Vector-quantized image modeling with improved vqgan. ICLR, 2021.
  58. Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022a.
  59. Multimodal knowledge alignment with reinforcement learning. arXiv preprint arXiv:2205.12630, 2022b.
  60. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jing Yu Koh (18 papers)
  2. Ruslan Salakhutdinov (248 papers)
  3. Daniel Fried (69 papers)
Citations (98)
Youtube Logo Streamline Icon: https://streamlinehq.com