Revisiting Weakly Supervised Pre-Training of Visual Perception Models (2201.08371v2)

Published 20 Jan 2022 in cs.CV

Abstract: Model pre-training is a cornerstone of modern visual recognition systems. Although fully supervised pre-training on datasets like ImageNet is still the de-facto standard, recent studies suggest that large-scale weakly supervised pre-training can outperform fully supervised approaches. This paper revisits weakly-supervised pre-training of models using hashtag supervision with modern versions of residual networks and the largest-ever dataset of images and corresponding hashtags. We study the performance of the resulting models in various transfer-learning settings including zero-shot transfer. We also compare our models with those obtained via large-scale self-supervised learning. We find our weakly-supervised models to be very competitive across all settings, and find they substantially outperform their self-supervised counterparts. We also include an investigation into whether our models learned potentially troubling associations or stereotypes. Overall, our results provide a compelling argument for the use of weakly supervised learning in the development of visual recognition systems. Our models, Supervised Weakly through hashtAGs (SWAG), are available publicly.

Authors (10)

Mannat Singh (13 papers)
Laura Gustafson (11 papers)
Aaron Adcock (10 papers)
Vinicius de Freitas Reis (1 paper)
Bugra Gedik (8 papers)
Raj Prateek Kosaraju (3 papers)
Dhruv Mahajan (38 papers)
Ross Girshick (75 papers)
Piotr Dollár (49 papers)
Laurens van der Maaten (54 papers)

Citations (108)

View on Semantic Scholar

Summary

Revisiting Weakly Supervised Pre-Training of Visual Perception Models

This paper offers a thorough examination of weakly supervised pre-training methods for visual perception models, especially in comparison to fully supervised and self-supervised techniques. The research revisits hashtag supervision as a weakly supervised approach, leveraging the largest dataset of images and hashtags compiled thus far. The paper evaluates weakly supervised models, affectionately termed SWAG (Supervised Weakly through hashtAGs), across various transfer-learning contexts, including zero-shot transfer, comparing them with models pre-trained via large-scale self-supervised learning methods.

The methodology employed a dataset collection process drawing upon public Instagram image data, employing hashtag-based labeling strategies. The dataset construction involved filtering and canonicalizing hashtags into uniform categories, then re-sampling for a balanced distribution across classes. Importantly, the procedure reportedly decreased the inherent noise prevalent in hashtag data, striving for a more effective pre-training dataset.

Once pre-trained, models were transferred to diverse image-classification tasks and assessed for their "off-the-shelf performance" in zero-shot and few-shot transfer scenarios, amplifying the research's focus on efficiency in learning new visual concepts without additional task-specific training.

Key findings highlight that weakly-supervised models, including those trained under hashtag supervision, show competitive performance across various benchmarks, often matching or surpassing state-of-the-art models pre-trained through other paradigms, including some leading self-supervised learners. For example, these models demonstrated a notable advantage in zero-shot transfer scenarios, with Platt scaling techniques enhancing their output consistency.

A potential limitation addressed within the paper is the inheritability of harmful or biased associations from the training data, attributed to the noisy nature of user-generated hashtags. Experiments conducted to trace stereotypes or biases suggest these models might not be as prone to such issues compared to certain LLMs, yet they advocate for cautious deployment and further analysis for bias mitigation.

The implications extend towards the potential of weakly-supervised learning frameworks in constructing scalable, versatile visual perception systems where the burdens of intensive data labeling can be circumvented without substantial performance trade-offs.

Speculation around future AI trends suggests directions such as improving noise-filtering algorithms or even synthesizing weakly- and self-supervised strategies for optimal pre-training efficacy. As AI continues to integrate into complex real-world applications, this paper's findings reinforce the viability of less resource-intensive, yet highly effective, training paradigms in maintaining the progress trajectory of visual recognition technologies.

Related Papers

YouTube

Show All Videos