Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bridging Vision and Language Spaces with Assignment Prediction (2404.09632v1)

Published 15 Apr 2024 in cs.CV and cs.LG

Abstract: This paper introduces VLAP, a novel approach that bridges pretrained vision models and LLMs to make frozen LLMs understand the visual world. VLAP transforms the embedding space of pretrained vision models into the LLMs' word embedding space using a single linear layer for efficient and general-purpose visual and language understanding. Specifically, we harness well-established word embeddings to bridge two modality embedding spaces. The visual and text representations are simultaneously assigned to a set of word embeddings within pretrained LLMs by formulating the assigning procedure as an optimal transport problem. We predict the assignment of one modality from the representation of another modality data, enforcing consistent assignments for paired multimodal data. This allows vision and language representations to contain the same information, grounding the frozen LLMs' word embedding space in visual data. Moreover, a robust semantic taxonomy of LLMs can be preserved with visual data since the LLMs interpret and reason linguistic information from correlations between word embeddings. Experimental results show that VLAP achieves substantial improvements over the previous linear transformation-based approaches across a range of vision-language tasks, including image captioning, visual question answering, and cross-modal retrieval. We also demonstrate the learned visual representations hold a semantic taxonomy of LLMs, making visual semantic arithmetic possible.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Nocaps: Novel object captioning at scale. In ICCV, 2019.
  2. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  3. Spice: Semantic propositional image caption evaluation. In ECCV, 2016.
  4. Labelling unlabelled videos from scratch with multi-modal self-supervision. In NeurIPS, 2020a.
  5. Self-labelling via simultaneous clustering and representation learning. In ICLR, 2020b.
  6. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL Workshop, 2005.
  7. Beit: Bert pertaining of image transformers. In ICLR, 2022.
  8. Language models are few-shot learners. In NeurIPS, 2020.
  9. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, 2020.
  10. Emerging properties in self-supervised vision transformers. In CVPR, 2021.
  11. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  12. Uniter: Universal image-text representation learning. In ECCV, 2020.
  13. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. 2023.
  14. Learning a similarity metric discriminatively, with application to face verification. In CVPR, 2005.
  15. Scaling instruction-finetuned language models. In arXiv preprint arXiv: 2210.11416, 2022.
  16. “this is my unicorn, fluffy”: Personalizing frozen vision-language representations. In ECCV, 2022.
  17. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NeurIPS, 2013.
  18. Visual dialog. In CVPR, 2017.
  19. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  20. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  21. Multi-modal alignment using representation codebook. In CVPR, 2022.
  22. Magma–multimodal augmentation of generative models through adapter-based finetuning. In EMNLP, 2022.
  23. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  24. From images to textual prompts: Zero-shot vqa with frozen large language models. In CVPR, 2023.
  25. Clipscore: A reference-free evaluation metric for image captioning. In EMNLP, 2021.
  26. A structural probe for finding syntax in word representations. In NAACL, 2019.
  27. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. In CoRR, 2020.
  28. What does bert learn about the structure of language? In ACL, 2019.
  29. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  30. Grounding language models to images for multimodal inputs and outputs. In ICML, 2023.
  31. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021.
  32. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  33. Blip2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  34. Mind the gap: understanding the modality gap in multi-modal contrastive representation learning. In NeurIPS, 2022.
  35. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out, 2004.
  36. Microsoft coco: Common objects in context. In ECCV, 2014.
  37. Visual instruction tuning. In NeurIPS, 2023a.
  38. Linguistic knowledge and transferability of contextual representations. In NAACL, 2019.
  39. Cot: Unsupervised domain adaptation with clustering and optimal transport. In CVPR, 2023b.
  40. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.
  41. Linearly mapping from image to text space. In ICLR, 2023.
  42. Clipcap: Clip prefix for image captioning. In arXiv preprint arXiv: 2111.09734, 2021.
  43. Bleu: A method for automatic evaluation of machine translation. In ACL, 2002.
  44. Mapping language models to grounded conceptual spaces. In ICLR, 2022.
  45. Improving language understanding by generative pre-training. Technical report, OpenAI, 2018.
  46. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.
  47. Learning transferable visual models from natural language supervision. In ICLR, 2021.
  48. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(140):1–67, 2020.
  49. Parametric umap embeddings for representation and semi-supervised learning. Neural Computation, 33(11), 2021.
  50. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  51. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In CVPR, 2022.
  52. Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP, 2019.
  53. Bert rediscovers the classical nlp pipeline. In ACL, 2019.
  54. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In CVPR, 2022.
  55. Llama: Open and efficient foundation language models. In arXiv preprint arXiv:2302.13971, 2023.
  56. Multimodal few-shot learning with frozen language models. In NeurIPS, 2021.
  57. Representation learning with contrastive predictive coding. In arXiv preprint arXiv: 1807.03748, 2018.
  58. Cider: Consensus-based image description evaluation. In CVPR, 2015.
  59. Probing pretrained language models for lexical semantics. In ACL, 2019.
  60. Gpt-j-6b: A 6 billion parameter autoregressive language model. 2021. URL https://github.com/kingoflolz/mesh-transformer-jax.
  61. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
  62. Reliable weighted optimal transport for unsupervised domain adaptation. In CVPR, 2020.
  63. Multimodal knowledge alignment with reinforcement learning. In arXiv preprint arXiv:2205.12630, 2022.
  64. Opt: Open pre-trained transformer language models. In arXiv preprint arXiv: 2205.01068, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jungin Park (16 papers)
  2. Jiyoung Lee (42 papers)
  3. Kwanghoon Sohn (53 papers)
Citations (3)