An Empirical Study of Training End-to-End Vision-and-Language Transformers
The paper presents a comprehensive investigation into the training of fully transformer-based Vision-and-Language (VL) models, termed Meter. The authors focus on overcoming the performance degradation typically observed in end-to-end modalities when compared to region-feature-based approaches. Notably, Meter achieves a 77.64% accuracy on the VQAv2 test-std set with just 4 million images for pre-training, outperforming the state-of-the-art in region-based and prior transformer-based models by 1.04% and 1.6%, respectively. Scaling further leads to an 80.54% accuracy.
Core Architecture and Methodology
Meter's design undergoes thorough experimentation across several dimensions:
- Vision Encoders: Implements and compares a variety of transformers like CLIP-ViT, Swin transformer, and others.
- Text Encoders: Utilizes models such as RoBERTa and DeBERTa.
- Multimodal Fusion: Evaluates merged attention versus co-attention for integrating visual and textual data.
- Architectural Design: Contrasts encoder-only and encoder-decoder structures.
- Pre-Training Objectives: Incorporates methods like masked image modeling and checks their impact on downstream performance.
Key Findings
The empirical results highlight several insights into the effective training of VL transformers:
- Vision Transformer Significance: The choice of vision transformer significantly impacts performance, with CLIP-ViT-224/16 and Swin Transformer notably effective.
- Fusion Mechanics: Co-attention proves superior to merged attention under the given configurations, underscoring a need for modality-specific parameterization.
- Encoder-Only Architecture: Encoder-only setups show preferable results for discriminative tasks over encoder-decoder models.
- Optimization Strategies: Use of differentiated learning rates for pre-trained versus newly initialized layers enhances training convergence.
Practical and Theoretical Implications
The findings influence both the practical application and theoretical understanding of VL models:
- Efficient Pre-Training: Demonstrates substantial performance with smaller training sets, offering computational efficiency.
- Task-Specific Optimization: Highlights the need for task-aligned architecture design, suggesting encoder-only models for classification-based tasks.
- Vision Versus Language Encoder Insights: Decouples effectiveness in vision or language tasks from cross-modal tasks, emphasizing the unique challenges in multimodal processing.
Future Directions in AI
The research suggests further exploration into scalable models and extends the scope of VL tasks to include generative processes. Additionally, considerations for multilingual datasets and alignment strategies for asynchronous task demands could furnish new methodologies in the evolving AI landscape.
Overall, this work delineates a path for harnessing the capabilities of transformers in a VL context, providing both state-of-the-art performance and newfound understanding of multimodal interactions.