Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE (2308.11971v2)

Published 23 Aug 2023 in cs.CV, cs.CL, cs.LG, and cs.MM

Abstract: Building scalable vision-LLMs to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 3.5x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Junyi Chen (31 papers)
  2. Longteng Guo (31 papers)
  3. Jia Sun (17 papers)
  4. Shuai Shao (57 papers)
  5. Zehuan Yuan (65 papers)
  6. Liang Lin (318 papers)
  7. Dongyu Zhang (32 papers)
Citations (3)