Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VL-BEiT: Generative Vision-Language Pretraining (2206.01127v2)

Published 2 Jun 2022 in cs.CV and cs.CL

Abstract: We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining. Our minimalist solution conducts masked prediction on both monomodal and multimodal data with a shared Transformer. Specifically, we perform masked vision-LLMing on image-text pairs, masked LLMing on texts, and masked image modeling on images. VL-BEiT is learned from scratch with one unified pretraining task, one shared backbone, and one-stage training. Our method is conceptually simple and empirically effective. Experimental results show that VL-BEiT obtains strong results on various vision-language benchmarks, such as visual question answering, visual reasoning, and image-text retrieval. Moreover, our method learns transferable visual features, achieving competitive performance on image classification, and semantic segmentation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Hangbo Bao (17 papers)
  2. Wenhui Wang (47 papers)
  3. Li Dong (154 papers)
  4. Furu Wei (291 papers)
Citations (42)
Youtube Logo Streamline Icon: https://streamlinehq.com