Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training (2301.02280v2)

Published 5 Jan 2023 in cs.CV

Abstract: Vision-LLMs trained with contrastive learning on large-scale noisy data are becoming increasingly popular for zero-shot recognition problems. In this paper we improve the following three aspects of the contrastive pre-training pipeline: dataset noise, model initialization and the training objective. First, we propose a straightforward filtering strategy titled Complexity, Action, and Text-spotting (CAT) that significantly reduces dataset size, while achieving improved performance across zero-shot vision-language tasks. Next, we propose an approach titled Concept Distillation to leverage strong unimodal representations for contrastive training that does not increase training complexity while outperforming prior work. Finally, we modify the traditional contrastive alignment objective, and propose an importance-sampling approach to up-sample the importance of hard-negatives without adding additional complexity. On an extensive zero-shot benchmark of 29 tasks, our Distilled and Hard-negative Training (DiHT) approach improves on 20 tasks compared to the baseline. Furthermore, for few-shot linear probing, we propose a novel approach that bridges the gap between zero-shot and few-shot performance, substantially improving over prior work. Models are available at https://github.com/facebookresearch/diht.

Citations (69)

View on Semantic Scholar

Summary

The paper introduces the CAT filtering strategy that selectively retains high-quality image-text pairs to reduce dataset noise in pre-training pipelines.
The paper proposes concept distillation to efficiently incorporate pre-trained unimodal representations for enhanced vision-language alignment.
The paper applies hard-negative noise contrastive estimation to boost model discrimination and achieve superior zero-shot task performance.

Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

The paper presents methodologies for enhancing the contrastive pre-training pipeline of vision-LLMs (VLMs). Contrastive learning is increasingly favored for zero-shot vision-language tasks due to its scalability and effectiveness. However, these models are typically trained on large, noisy datasets that impact their performance. The authors address three pivotal aspects: dataset noise, model initialization, and the contrastive learning objective.

Key Contributions

Complexity, Action, and Text-spotting (CAT) Filtering: This filtering strategy is proposed to reduce dataset noise. It selectively retains informative image-text pairs based on complexity, object-action relations, and text spotting in images, which helps eliminate noise introduced by uninformative pairs prevalent in large datasets like LAION-2B. The paper demonstrates a notable performance gain by reducing the dataset significantly while maintaining or improving accuracy across various zero-shot tasks.
Concept Distillation (CD): The paper proposes leveraging pre-trained unimodal representations by distilling concepts (objects and attributes) into dual-encoder models. This approach retains the advantages of pre-trained vision models without increasing training complexity. By training linear classifiers on frozen pre-trained models, they generate rich semantic predictions ensuring robustness in vision-language alignment.
Hard-negative Noise Contrastive Estimation (HN-NCE): The authors modify the classical InfoNCE loss by up-sampling hard negatives using model-based importance sampling. This nuanced approach adjusts the contribution of difficult negatives within a batch, enhancing the discriminative ability of the model without adding computational overhead.

Results and Implications

The presented techniques improve performance in 20 out of 29 benchmark tasks compared to baseline models. The cumulative approach depicted in their pipeline achieves this enhancement while significantly reducing dataset size, prominently in the LAION-CAT dataset. These results imply a strong potential for scaling VLMs more efficiently and can accelerate advancements in few-shot and zero-shot scenarios.

In theory, these methodologies suggest pathways for extensive improvement in multimodal models by refining data preprocessing, strategically utilizing pre-trained models, and employing sophisticated training objectives. Practically, such enhancements could be vital for applications involving minimal task-specific fine-tuning, devising pathways for more generalized models adaptable to various tasks without additional labeled data.

Prospects for Future Research

The findings invite several avenues for future exploration, notably, the potential adaptation of these methods to more complex architectures, such as encoder-decoder systems which typically yield better zero-shot capabilities. Additionally, the scalability of the techniques in dealing with datasets characterized by extreme noise levels warrants further investigation. Future work might expand these filtering and training techniques to diverse domains, reflecting the ongoing evolution of AI models toward more versatile and resource-efficient paradigms.

The paper offers a comprehensive toolkit for the community aiming to push the boundaries of contrastive learning in vision-LLMs, making significant contributions to both the methodological and practical advancements in AI research.

PDF Markdown

Related Papers

GitHub

GitHub - facebookresearch/diht: Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training (132 stars)