Emergent Mind

Grounding Language Models to Images for Multimodal Inputs and Outputs

(2301.13823)
Published Jan 31, 2023 in cs.CL , cs.AI , cs.CV , and cs.LG

Abstract

We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Sign up for a free account or log in to generate a summary of this paper:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
  2. CM3: A Causal Masked Multimodal Model of the Internet
  3. Flamingo: a visual language model for few-shot learning. NeurIPS
  4. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization
  5. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp.  610–623
  6. Multimodal datasets: misogyny, pornography, and malignant stereotypes
  7. On the Opportunities and Risks of Foundation Models
  8. Language models are few-shot learners. NeurIPS
  9. Data distributional properties drive emergent few-shot learning in transformers. NeurIPS
  10. Learning a similarity metric discriminatively, with application to face verification. In CVPR
  11. PaLM: Scaling Language Modeling with Pathways
  12. Scaling Instruction-Finetuned Language Models
  13. Transformer-xl: Attentive language models beyond a fixed-length context. ACL
  14. Visual dialog. In CVPR
  15. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. NeurIPS
  16. CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
  17. Magma–multimodal augmentation of generative models through adapter-based finetuning. EMNLP
  18. Taming transformers for high-resolution image synthesis. In CVPR
  19. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. EMNLP
  20. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR
  21. Training compute-optimal large language models. NeurIPS
  22. The curious case of neural text degeneration. ICLR
  23. Parameter-efficient transfer learning for nlp. In ICML
  24. Visual storytelling. In NAACL-HLT
  25. Scaling up visual and vision-language representation learning with noisy text supervision. In ICLR
  26. Adam: A method for stochastic optimization. ICLR
  27. The power of scale for parameter-efficient prompt tuning. EMNLP
  28. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning
  29. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
  30. Prefix-tuning: Optimizing continuous prompts for generation. ACL
  31. Contrastive Decoding: Open-ended Text Generation as Optimization
  32. Microsoft coco: Common objects in context. In ECCV
  33. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS
  34. Pretrained transformers as universal computation engines. AAAI
  35. Linearly Mapping from Image to Text Space
  36. Representation Learning with Contrastive Predictive Coding
  37. Training language models to follow instructions with human feedback
  38. Bleu: a method for automatic evaluation of machine translation. In ACL
  39. Pytorch: An imperative style, high-performance deep learning library. NeurIPS
  40. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9
  41. Learning transferable visual models from natural language supervision. In ICLR
  42. Scaling Language Models: Methods, Analysis & Insights from Training Gopher
  43. Zero-shot text-to-image generation. In ICML
  44. Generative adversarial text to image synthesis. In ICML
  45. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
  46. Neural machine translation of rare words with subword units. ACL
  47. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. ACL
  48. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
  49. Progressive generation of long text with pretrained language models. NAACL
  50. Transcending Scaling Laws with 0.1% Extra Compute
  51. Multimodal few-shot learning with frozen language models. NeurIPS
  52. Attention is all you need. NeurIPS
  53. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. ICML
  54. Finetuned language models are zero-shot learners. ICLR
  55. Emergent abilities of large language models. TMLR
  56. Re3: Generating longer stories with recursive reprompting and revision. EMNLP
  57. Vector-quantized image modeling with improved vqgan. ICLR
  58. Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022a.
  59. Multimodal Knowledge Alignment with Reinforcement Learning
  60. OPT: Open Pre-trained Transformer Language Models

Show All 60