Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Masked Vision and Language Modeling for Multi-modal Representation Learning (2208.02131v2)

Published 3 Aug 2022 in cs.CV, cs.CL, and cs.LG

Abstract: In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked LLMing (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and LLMing, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method, along with common V+L alignment losses, achieves state-of-the-art performance in the regime of millions of pre-training data. Also, we outperforms the other competitors by a significant margin in limited data scenarios.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Gukyeong Kwon (14 papers)
  2. Zhaowei Cai (22 papers)
  3. Avinash Ravichandran (35 papers)
  4. Erhan Bas (7 papers)
  5. Rahul Bhotika (13 papers)
  6. Stefano Soatto (179 papers)
Citations (54)