An Empirical Study of Training End-to-End Vision-and-Language Transformers (2111.02387v3)

Published 3 Nov 2021 in cs.CV, cs.CL, and cs.LG

Abstract: Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present METER, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion module (e.g., merged attention vs. co-attention), architectural design (e.g., encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked image modeling). We conduct comprehensive experiments and provide insights on how to train a performant VL transformer. METER achieves an accuracy of 77.64% on the VQAv2 test-std set using only 4M images for pre-training, surpassing the state-of-the-art region-feature-based model by 1.04%, and outperforming the previous best fully transformer-based model by 1.6%. Notably, when further scaled up, our best VQA model achieves an accuracy of 80.54%. Code and pre-trained models are released at https://github.com/zdou0830/METER.

PDF Abstract

An Empirical Study of Training End-to-End Vision-and-Language Transformers

The paper presents a comprehensive investigation into the training of fully transformer-based Vision-and-Language (VL) models, termed Meter. The authors focus on overcoming the performance degradation typically observed in end-to-end modalities when compared to region-feature-based approaches. Notably, Meter achieves a 77.64% accuracy on the VQAv2 test-std set with just 4 million images for pre-training, outperforming the state-of-the-art in region-based and prior transformer-based models by 1.04% and 1.6%, respectively. Scaling further leads to an 80.54% accuracy.

Core Architecture and Methodology

Meter's design undergoes thorough experimentation across several dimensions:

Vision Encoders: Implements and compares a variety of transformers like CLIP-ViT, Swin transformer, and others.
Text Encoders: Utilizes models such as RoBERTa and DeBERTa.
Multimodal Fusion: Evaluates merged attention versus co-attention for integrating visual and textual data.
Architectural Design: Contrasts encoder-only and encoder-decoder structures.
Pre-Training Objectives: Incorporates methods like masked image modeling and checks their impact on downstream performance.

Key Findings

The empirical results highlight several insights into the effective training of VL transformers:

Vision Transformer Significance: The choice of vision transformer significantly impacts performance, with CLIP-ViT-224/16 and Swin Transformer notably effective.
Fusion Mechanics: Co-attention proves superior to merged attention under the given configurations, underscoring a need for modality-specific parameterization.
Encoder-Only Architecture: Encoder-only setups show preferable results for discriminative tasks over encoder-decoder models.
Optimization Strategies: Use of differentiated learning rates for pre-trained versus newly initialized layers enhances training convergence.

Practical and Theoretical Implications

The findings influence both the practical application and theoretical understanding of VL models:

Efficient Pre-Training: Demonstrates substantial performance with smaller training sets, offering computational efficiency.
Task-Specific Optimization: Highlights the need for task-aligned architecture design, suggesting encoder-only models for classification-based tasks.
Vision Versus Language Encoder Insights: Decouples effectiveness in vision or language tasks from cross-modal tasks, emphasizing the unique challenges in multimodal processing.

Future Directions in AI

The research suggests further exploration into scalable models and extends the scope of VL tasks to include generative processes. Additionally, considerations for multilingual datasets and alignment strategies for asynchronous task demands could furnish new methodologies in the evolving AI landscape.

Overall, this work delineates a path for harnessing the capabilities of transformers in a VL context, providing both state-of-the-art performance and newfound understanding of multimodal interactions.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Zi-Yi Dou (33 papers)
Yichong Xu (42 papers)
Zhe Gan (135 papers)
Jianfeng Wang (149 papers)
Shuohang Wang (69 papers)
Lijuan Wang (133 papers)
Chenguang Zhu (100 papers)
Pengchuan Zhang (58 papers)
Lu Yuan (130 papers)
Nanyun Peng (205 papers)
Zicheng Liu (153 papers)
Michael Zeng (76 papers)

Citations (335)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - zdou0830/METER: METER: A Multimodal End-to-end TransformER Framework (372 stars)