VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts (2111.02358v2)

Published 3 Nov 2021 in cs.CV, cs.CL, and cs.LG

Abstract: We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Mixture-of-Modality-Experts (MoME) Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMo achieves state-of-the-art results on various vision-language tasks, including VQA, NLVR2 and image-text retrieval. The code and pretrained models are available at https://aka.ms/vlmo.

PDF Abstract

Overview of VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

The paper in discussion presents VLMo, a Vision-Language Pretrained Model designed to enhance the integration and application of vision and language understanding tasks. VLMo employs a novel approach called the Mixture-of-Modality-Experts (MoME) Transformer, unifying the functionalities of a dual encoder and a fusion encoder within a modular Transformer architecture. This strategic integration aims to improve both vision-language classification and image-text retrieval tasks.

Core Contributions

Mixture-of-Modality-Experts Transformations: The core innovation of VLMo is the MoME Transformer, which utilizes modality-specific experts coupled with a shared self-attention layer. This design allows the model to process images and text both separately and together, maintaining modality-specific encoding while also aligning them through shared parameters. The modality experts (vision, language, and vision-language) are responsible for handling different modalities in a flexible yet cohesive manner.
Stagewise Pre-Training Strategy: VLMo incorporates a stagewise pre-training approach, utilizing large datasets of image-only and text-only data in addition to image-text pairs. This methodology capitalizes on the vastness and variety inherent in uni-modal data to bolster the generalization abilities of the model across both modalities, a notable deviation from relying solely on paired datasets.
Unified Pre-Training Tasks: The paper proposes simultaneous training on image-text contrastive learning, masked LLMing, and image-text matching. This joint training framework ensures that VLMo captures the nuanced relationship between visual and linguistic content.

Experimental Outcomes

VLMo demonstrates superior performance across well-recognized vision-language tasks such as Visual Question Answering (VQA) and Natural Language Visual Reasoning (NLVR2). For example, the large-size VLMo model outperforms current models, achieving a VQA test-dev score of 82.88 and an NLVR2 test accuracy of 88.62. These results underscore the efficacy of combining the dual encoder and fusion encoder approaches, managed through a unified system.

In addition to the classification benchmarks, VLMo shows substantial gains in image and text retrieval tasks. The separation of encoding—enabled by the dual encoder capability—results in a more efficient retrieval process, optimizing both accuracy and speed.

Implications and Future Directions

The VLMo framework has practical implications for a range of AI applications requiring simultaneous processing of textual and visual information, such as multimedia search engines, content moderation tools, and interactive AI systems. Furthermore, the theoretical strides made in modality integration can pave the way for more intricate assemblages in pre-training models.

In terms of future developments, scaling up the VLMo model size and exploring its application in generation tasks like image captioning appears promising. Another avenue involves extending VLMo to accommodate more modalities beyond vision and language, possibly incorporating video or audio for a truly multimodal AI platform.

In conclusion, VLMo represents a significant advancement in vision-language pre-training models. Its amalgamation of modality-specific learning with shared cross-modal attention offers an efficient, versatile approach to addressing the diverse challenges of vision-language tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Hangbo Bao (17 papers)
Wenhui Wang (47 papers)
Li Dong (154 papers)
Qiang Liu (405 papers)
Owais Khan Mohammed (4 papers)
Kriti Aggarwal (9 papers)
Subhojit Som (9 papers)
Furu Wei (291 papers)

Citations (468)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

YouTube

Show All Videos