Self-supervised Pretraining of Visual Features in the Wild (2103.01988v2)

Published 2 Mar 2021 in cs.CV and cs.AI

Abstract: Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods. These results have been achieved in a control environment, that is the highly curated ImageNet dataset. However, the premise of self-supervised learning is that it can learn from any random image and from any unbounded dataset. In this work, we explore if self-supervision lives to its expectation by training large models on random, uncurated images with no supervision. Our final SElf-supERvised (SEER) model, a RegNetY with 1.3B parameters trained on 1B random images with 512 GPUs achieves 84.2% top-1 accuracy, surpassing the best self-supervised pretrained model by 1% and confirming that self-supervised learning works in a real world setting. Interestingly, we also observe that self-supervised models are good few-shot learners achieving 77.9% top-1 with access to only 10% of ImageNet. Code: https://github.com/facebookresearch/vissl

PDF Abstract

Self-supervised Pretraining of Visual Features in the Wild

This paper tackles the challenge of training visual models using self-supervised methods without relying on curated datasets. The researchers investigate the efficacy of self-supervised learning (SSL) in a real-world context by pretraining large models on uncurated data, exploring whether these methods achieve competitive performance compared to those trained on datasets like ImageNet.

Approach and Methodology

The authors focus on the RegNet architecture, particularly RegNetY models, employing the SwAV self-supervised learning method to pretrain on 1 billion images. The training leverages 512 GPUs to manage a RegNetY model with 1.3B parameters—named SEER—for efficient and scalable SSL. SwAV, a clustering-based method, is pivotal in this approach, encouraging consistency in cluster assignments across different views of the same image to derive semantic representations.

Key to this methodology is a gradient checkpointing strategy paired with mixed precision to address computational constraints when scaling up model capacity. Moreover, the paper employs a billion-scale dataset from Instagram, emphasizing randomness and absence of manual data curation to truly test SSL's performance in unconstrained environments.

Main Results

The SEER model achieves notable results with an 84.2% top-1 accuracy on ImageNet, surpassing previous best self-supervised models by 1%. Furthermore, the paper highlights the model’s capability as a few-shot learner, achieving 77.9% accuracy with only 10% of ImageNet data, showcasing SSL's potential for efficient data utilization.

Implications and Comparisons

A significant contribution of this work is the demonstration that model performance scales positively with data size and parameter quantity. This finding extends SSL's applicability beyond laboratory conditions, suggesting that robust, efficient models can be trained from the vast, uncurated data pools typical in real-world settings.

The paper compares the SEER model against both semi-supervised and self-supervised methods trained on curated datasets. Remarkably, even with access to a mere fraction of data, SEER's performance remains competitive. This positions SEER as a cost-effective and scalable alternative to traditionally curated approaches, which often require extensive preprocessing and filtering.

Considerations for Future Developments

These findings could shift the paradigm of visual model training by diminishing reliance on carefully curated datasets, promoting more democratized access to model training for research and industry. The paper also opens pathways for future research in continuous learning systems, which might take advantage of a seamlessly expanding data landscape, thus enhancing model adaptability and robustness.

In conclusion, the research presents compelling evidence that SSL methods can effectively scale with large, uncurated datasets to produce highly competitive models. By bridging the gap between theoretical potential and practical applicability, this work sets a precedent for further advancements in leveraging uncurated data in AI and deep learning environments.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Priya Goyal (15 papers)
Mathilde Caron (25 papers)
Benjamin Lefaudeux (1 paper)
Min Xu (169 papers)
Pengchao Wang (3 papers)
Vivek Pai (1 paper)
Mannat Singh (13 papers)
Vitaliy Liptchinsky (12 papers)
Ishan Misra (65 papers)
Armand Joulin (81 papers)
Piotr Bojanowski (50 papers)

Citations (253)

View on Semantic Scholar

Self-supervised Pretraining of Visual Features in the Wild (2103.01988v2)