Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision (2202.08360v2)

Published 16 Feb 2022 in cs.CV, cs.AI, and cs.CY

Abstract: Discriminative self-supervised learning allows training models on any random group of internet images, and possibly recover salient information that helps differentiate between the images. Applied to ImageNet, this leads to object centric features that perform on par with supervised features on most object-centric downstream tasks. In this work, we question if using this ability, we can learn any salient and more representative information present in diverse unbounded set of images from across the globe. To do so, we train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn. We scale our model size to dense 10 billion parameters to avoid underfitting on a large data size. We extensively study and validate our model performance on over 50 benchmarks including fairness, robustness to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets. The resulting model, not only captures well semantic information, it also captures information about artistic style and learns salient information such as geolocations and multilingual word embeddings based on visual content only. More importantly, we discover that such model is more robust, more fair, less harmful and less biased than supervised models or models trained on object centric datasets such as ImageNet.

Authors (8)

Priya Goyal (15 papers)
Quentin Duval (9 papers)
Isaac Seessel (1 paper)
Mathilde Caron (25 papers)
Ishan Misra (65 papers)
Levent Sagun (31 papers)
Armand Joulin (81 papers)
Piotr Bojanowski (50 papers)

Citations (103)

View on Semantic Scholar

Summary

Overview of "Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision"

The paper, "Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision," presents a robust investigation into the domain of self-supervised learning (SSL) with a focus on training vision models at scale. It explores the hypothesis that vision models can achieve enhanced robustness and fairness when pretrained on a diverse set of uncurated internet images, in contrast to the traditional object-centric datasets like ImageNet.

Methodology and Approach

The authors utilize a discriminative SSL framework, following recent advancements that extend beyond the limitations of supervised learning. This method involves using a pretext task to differentiate images without labels, enabling the extraction of high-quality features transferable to numerous downstream tasks. Crucial to this paper is the emphasis on training models on a wide spectrum of images sourced from the internet without preprocessing or explicit curation, thus representing global diversity.

A significant contribution is the employment of a 10 billion parameters model, which stands out in its capacity to learn from substantial data volumes without underfitting. The fully sharded data parallel (FSDP) approach is leveraged to deal with the computational challenges inherent to training such large-scale models, optimizing memory usage and computational efficiency.

Experimental Results

The performance of the pretrained model is validated across over 50 benchmarks, covering aspects such as fairness, robustness to distribution shift, and capability in fine-grained recognition tasks. Noteworthy assertions include:

Fairness and Bias: Empirical evidence suggests that models trained on uncurated datasets demonstrate improved fairness, exhibiting reduced gender and skintone biases compared to those trained on datasets like ImageNet.
Robustness: The model shows superior out-of-domain generalization on datasets exhibiting distribution shifts, indicating enhanced robustness of SSL-pretrained models over traditional supervised approaches.
Representation Quality: Through linear probing, the model outperforms state-of-the-art supervised and self-supervised models on the majority of evaluated tasks, reflecting the quality of visual features learned during the unsupervised pretraining phase.

Implications and Future Directions

The research underscores the potential of leveraging vast, uncurated datasets coupled with scaled model architectures to surpass traditional supervised learning benchmarks in both performance and ethical dimensions. The use of diverse datasets inadvertently addresses issues of bias and fairness, promoting more equitable AI systems.

The findings also hint at scalable training methodologies as pivotal for future AI developments. As demonstrated, increases in model size correlate with improved task performance and fairness, suggesting further exploration could yield even more substantial gains.

Given these insights, future work could focus on:

Extending the model's applicability to a broader range of domains beyond computer vision.
Investigating the benefits of unsupervised pretraining in multimodal and sequential data contexts.
Exploring more efficient ways to scale model training without incurring prohibitive computational costs.

The paper provides a comprehensive framework for advancing vision tasks while mitigating ethical concerns, setting a benchmark for future AI research in responsibly harnessing large-scale datasets.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos