Align before Fuse: Vision and Language Representation Learning with Momentum Distillation (2107.07651v2)

Published 16 Jul 2021 in cs.CV and cs.AI

Abstract: Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. In order to improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. We provide a theoretical analysis of ALBEF from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR$^2$, ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-of-the-art, while enjoying faster inference speed. Code and pre-trained models are available at https://github.com/salesforce/ALBEF/.

PDF Abstract

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

The paper introduces a novel vision-language pre-training (VLP) framework named "Align before Fuse" (ALBEF), which aims to address several limitations inherent in traditional VLP approaches. The key innovation of ALBEF lies in introducing a contrastive loss to align image and text representations before fusing them using cross-modal attention, and employing momentum distillation for improved representation learning under noisy data conditions.

Key Contributions

ALBEF Framework

The ALBEF model consists of an image encoder based on ViT-B/16, a text encoder initialized with the first six layers of BERT_base, and a multimodal encoder using the last six layers of BERT_base. The primary objective is to align unimodal representations before applying a multimodal encoder, thus facilitating a more grounded representation learning process.

Intermediate Image-Text Contrastive Loss

ALBEF introduces an intermediate image-text contrastive (ITC) loss on the unimodal encoders. This loss performs three key functions:

Aligns visual and textual features, facilitating subsequent cross-modal learning.
Enhances the understanding of image and text semantics independently.
Forms a common low-dimensional space for embedding images and texts, aiding the image-text matching objective through contrastive hard negative mining.

Momentum Distillation

To handle noisy supervision from web data, the authors propose momentum distillation (MoD). This method leverages pseudo-targets generated by a momentum model, which is an exponential moving average of the base model's parameters. This approach allows the model to be less penalizing toward other reasonable yet different outputs, enhancing its generalization capabilities.

Theoretical Underpinning

The paper provides a mutual information maximization perspective, demonstrating that the ITC and MLM losses in ALBEF maximize a lower bound on the mutual information between different views of an image-text pair. This theoretical grounding extends to interpreting momentum distillation as generating new, semantically similar views, thus enforcing invariance to semantic-preserving transformations.

Numerical Results

ALBEF showcases substantial numerical improvements across multiple vision-language benchmarks:

On image-text retrieval tasks, ALBEF outperforms methods pre-trained on significantly larger datasets (like CLIP and ALIGN).
It achieves absolute improvements of 2.37% in Visual Question Answering (VQA) and 3.84% in Natural Language Visual Reasoning (NLVR²⁾ compared to the state-of-the-art VILLA.
ALBEF also exhibits faster inference speeds due to its enhanced architecture.

Implications and Future Directions

The practical implications of this research are extensive. By not requiring bounding box annotations or high-resolution images, ALBEF simplifies the pre-training process while simultaneously achieving superior performance. This methodological advancement opens avenues for more efficient and scalable VLP models that can leverage vast yet noisy datasets available on the web.

The utilization of momentum distillation for generating pseudo-targets suggests potential advancements in self-supervised learning frameworks. Future research could explore further refinements to the MoD approach or extend its application to other domains within multi-modal learning frameworks.

Conclusion

The ALBEF framework represents a significant step forward in vision-language representation learning by introducing an effective alignment mechanism before fusion and leveraging momentum distillation to handle noisy annotations. This results in superior and more efficient models, providing a solid foundation for future explorations and enhancements in multi-modal representation learning.

The proposed methods, validated through strong numerical results, could inspire future developments in the domain, particularly in addressing the scalability and efficiency challenges posed by large-scale, noisy pre-training data.