Improved baselines for vision-language pre-training (2305.08675v2)

Published 15 May 2023 in cs.CV

Abstract: Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work claims improvements over CLIP using additional non-contrastive losses inspired from self-supervised learning. However, it is sometimes hard to disentangle the contribution of these additional losses from other implementation details, e.g., data augmentation or regularization techniques, used to train the model. To shed light on this matter, in this paper, we first propose, implement and evaluate several baselines obtained by combining contrastive learning with recent advances in self-supervised learning. In particular, we use the loss functions that were proven successful for visual self-supervised learning to align image and text modalities. We find that these baselines outperform a basic implementation of CLIP. However, when a stronger training recipe is employed, the advantage disappears. Indeed, we find that a simple CLIP baseline can also be improved substantially, up to a 25% relative improvement on downstream zero-shot tasks, by using well-known training techniques that are popular in other subfields. Moreover, we discover that it is enough to apply image and text augmentations to make up for most of the improvement attained by prior works. With our improved training recipe for CLIP, we obtain state-of-the-art performance on four standard datasets, and consistently outperform prior work (up to +4% on the largest dataset), while being substantially simpler. The code is available at https://github.com/facebookresearch/clip-rocket

PDF Abstract

Advanced Baselines for Vision-Language Pre-training

The paper "Improved baselines for vision-language pre-training" explores the nuances and enhancements in the landscape of vision-language pre-training (VLP), particularly focusing on contrastive and non-contrastive learning methodologies. The key emphasis of this paper is the optimization of the CLIP (Contrastive Language–Image Pretraining), a leading framework in this domain, by integrating advanced training techniques and particularly analyzing the role of non-contrastive learning approaches in improving model performance.

The authors introduce four VLP baselines: SiamLIP, BYOLIP, BarLIP, and SwALIP, which translate non-contrastive losses from self-supervised learning (SSL) into the multimodal domain, evaluating their capacity to enhance CLIP’s performance. These models incorporate consistency-based, redundancy reduction, and clustering-based non-contrastive methodologies derived from successful SSL frameworks like SimSiam, BYOL, Barlow Twins, and SwAV. Despite the methodological innovation, the improvements by these alternative loss functions were marginal or insufficient compared to when a robust training regime was employed with standard CLIP.

A major component of the investigation involved refining training recipes for CLIP. The researchers meticulously curated a training regimen that combines strong data augmentations, enhanced projector networks, label smoothing, and regularization techniques. This recipe, dubbed CLIP2, significantly elevated model performance, achieving up to 11% improvement in zero-shot image classification tasks on ImageNet, surpassing prior VLP methods that employed more complex CLIP modifications and non-contrastive losses. Interestingly, implementing these well-established training techniques led to simpler models that were more efficient than their complex counterparts.

Specifically, the paper outlines several key findings:

While non-contrastive baselines provided some performance uplift over basic CLIP implementations, the advantages dissipated when improved training recipes were utilized.
Strong numerical results emphasized the utility of advanced training techniques; CLIP2 achieved a notable boost in ImageNet classification accuracy, demonstrating the criticality of effective training dynamics.
The approach also indicated significant scalability across different dataset sizes, outperforming complex frameworks on large-scale datasets like Open29M and YFCC15M.

Theoretical implications are considerable; the analysis enhances understanding of model training efficiency and effectiveness gaps within VLP, advocating for methodical baselines and robust training regimen designs over intricate model extensions. Practically, these findings suggest a strategic pivot in VLP research towards optimizing and simplifying existing methodologies before incorporating experimental components.

Looking forward, the implications of this research propose an informed approach for future AI development—prioritization of advanced universal training recipes, evaluation of baseline models, and justified incorporation of additional learning strategies. In the rapidly evolving AI and machine learning field, where large-scale models and datasets become increasingly prevalent, establishing solid strategies for model performance enhancement remains paramount and influential as researchers strive towards computational efficiency and robust multimodal learning.