EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE (2308.11971v2)

Published 23 Aug 2023 in cs.CV, cs.CL, cs.LG, and cs.MM

Abstract: Building scalable vision-LLMs to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 3.5x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (7)

Junyi Chen (31 papers)
Longteng Guo (31 papers)
Jia Sun (17 papers)
Shuai Shao (57 papers)
Zehuan Yuan (65 papers)
Liang Lin (318 papers)
Dongyu Zhang (32 papers)

Citations (3)

View on Semantic Scholar

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE (2308.11971v2)

Related Papers