Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (2110.05208v2)

Published 11 Oct 2021 in cs.CV

Abstract: Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1 x fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.Our code, dataset and models are released at: https://github.com/Sense-GVT/DeCLIP

PDF Abstract

Overview of DeCLIP

The paper "Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm" presents an innovative framework, termed DeCLIP, which aims to enhance the data efficiency of Contrastive Language-Image Pre-training (CLIP). The primary motivation behind this work is the substantial data requirement of CLIP models, which typically rely on vast datasets of image-text pairs to achieve high zero-shot accuracy and transferability. DeCLIP introduces multiple supervisory strategies to improve data efficiency, utilizing the inherent structure within image-text pairs to achieve comparable performance with significantly less data.

Key Contributions

The authors propose several novel supervisory mechanisms to leverage detailed information in the data:

Self-Supervision (SS): DeCLIP incorporates self-supervision within each modality, applying techniques like SimSiam for images and Masked LLMing for text to extract meaningful representations independently.
Multi-View Supervision (MVS): By creating multiple augmented views of images and texts, DeCLIP expands the dataset through diverse permutations of image-text pairs. This contributes to more robust feature learning by capturing invariances and structural nuances.
Nearest-Neighbor Supervision (NNS): This method involves utilizing semantically similar image-text pairs as additional sources of supervision, seeking nearest neighbors in the feature embedding space to provide enriched training signals.

Experimental Results

DeCLIP demonstrates impressive zero-shot capabilities with significant reductions in required data volume. For instance, with the ResNet50 architecture, DeCLIP achieves 60.4% zero-shot top1 accuracy on ImageNet using only 56 million image-text pairs—more than 7 times less data compared to the 400 million pairs used in the original CLIP. Across various downstream visual datasets, DeCLIP models consistently outperform their CLIP counterparts, achieving superior data efficiency and transferability.

Implications and Future Directions

The approach outlined in DeCLIP signals a pivotal step towards more sustainable AI model training. Reducing data requirements without sacrificing performance could broaden the accessibility and applicability of advanced pre-trained models in resource-constrained environments. The successful integration of self-supervision and multi-view/nearest-neighbor strategies in a contrastive paradigm may inspire new architectures and training regimes beyond vision-language tasks.

Future research could investigate the extension of DeCLIP to other modalities, such as acoustic signals, further sharpening AI's multi-modal capabilities. Additionally, exploring more refined or automated data augmentation strategies to enhance MVS could unlock additional performance gains, potentially establishing a new standard in model pre-training methodologies.

In summary, DeCLIP presents a compelling case for leveraging intrinsic supervisory signals, advocating for more intelligent data utilization in pre-trained models. The framework not only advances technical boundaries in AI but also fosters more accessible and environmentally conscious AI development practices.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Yangguang Li (44 papers)
Feng Liang (61 papers)
Lichen Zhao (5 papers)
Yufeng Cui (12 papers)
Wanli Ouyang (358 papers)
Jing Shao (109 papers)
Fengwei Yu (23 papers)
Junjie Yan (109 papers)

Citations (392)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Sense-GVT/DeCLIP: Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (627 stars)