Scaling and Benchmarking Self-Supervised Visual Representation Learning (1905.01235v2)

Published 3 May 2019 in cs.CV, cs.AI, and cs.LG

Abstract: Self-supervised learning aims to learn representations from the data itself without explicit manual supervision. Existing efforts ignore a crucial aspect of self-supervised learning - the ability to scale to large amount of data because self-supervision requires no manual labels. In this work, we revisit this principle and scale two popular self-supervised approaches to 100 million images. We show that by scaling on various axes (including data size and problem 'hardness'), one can largely match or even exceed the performance of supervised pre-training on a variety of tasks such as object detection, surface normal estimation (3D) and visual navigation using reinforcement learning. Scaling these methods also provides many interesting insights into the limitations of current self-supervised techniques and evaluations. We conclude that current self-supervised methods are not 'hard' enough to take full advantage of large scale data and do not seem to learn effective high level semantic representations. We also introduce an extensive benchmark across 9 different datasets and tasks. We believe that such a benchmark along with comparable evaluation settings is necessary to make meaningful progress. Code is at: https://github.com/facebookresearch/fair_self_supervision_benchmark.

PDF Abstract

Scaling and Benchmarking Self-Supervised Visual Representation Learning

The paper "Scaling and Benchmarking Self-Supervised Visual Representation Learning" investigates self-supervised learning (SSL) approaches for visual representation learning. Unlike traditional supervised learning, SSL utilizes the data itself to generate labels, obviating the need for manually labeled datasets. This work systematically explores how SSL can be effectively scaled and analyzed through comprehensive benchmarks.

Overview

The paper revisits the core principle of SSL: scalability. Given that SSL does not rely on supervised labels, it should theoretically scale efficiently with data. The paper examines two prominent SSL methods - Jigsaw and Colorization - by scaling data size, model capacity, and task complexity. It provides meaningful insights into SSL's current capabilities and limitations, especially when compared to supervised pre-training.

Key Components

Scaling Data Size: The scalability of SSL methods is explored by training models on large datasets, up to 100 million images. Interestingly, the transfer performance improves log-linearly with the increase in data size. However, Jigsaw surpasses Colorization, indicating its higher efficacy in exploiting large-scale datasets.
Scaling Model Capacity: The paper involves using high-capacity models like ResNet-50, alongside the lower-capacity models like AlexNet. It is observed that ResNet-50 captures the benefits of increased data size more effectively than AlexNet, suggesting the importance of larger models in SSL.
Scaling Problem Complexity: By increasing problem complexity, such as the number of permutations in Jigsaw, the utility of high-capacity models becomes more evident. Complexity adjustment plays a crucial role in enhancing the SSL performance.
Benchmarking Suite: An extensive suite comprising nine diverse tasks is proposed to objectively compare different SSL methods. These benchmarks include image classification, low-shot classification, object detection, scene geometry (3D), and visual navigation, offering a holistic overview of representational quality.

Findings

Data and Model Interplay: The paper confirms that improvements on one scalability axis are complementary to others. Thus, performance gains can be collectively achieved by scaling data size, model capacity, and task difficulty.
Comparison to Supervised Learning: On tasks like surface normal estimation and visual navigation, SSL methods show competitive or superior performance compared to supervised counterparts. However, in semantic tasks like image classification, supervised models still hold a significant lead.
Pre-training and Transfer Domains: The similarity between pre-training and transfer domains significantly influences SSL effectiveness. Hence, pre-trained models on similar domain datasets exhibit better transfer performance.

Implications and Future Directions

The implications of these findings are twofold. Practically, they suggest that leveraging large datasets and high-capacity models in SSL can potentially match supervised pre-training in select domains. The theoretical insights emphasize the need for more intricate self-supervised tasks that fully leverage data and model capacities.

Moving forward, research could focus on:

Designing complex pretext tasks that better capture semantic abstractions.
Exploring SSL's resilience to domain shifts and its ability to generalize across diverse datasets.
Further enhancing benchmarks to cover a broader spectrum of representation learning evaluations.

Conclusion

The exploration into scaling SSL methods underscores their potential, yet highlights that current methods are not fully capitalizing on available data and model capabilities. Establishing standardized benchmarks is pivotal for measuring meaningful progress. Researchers are encouraged to develop SSL techniques that overcome these limitations, striving for parity or superiority in image understanding compared to traditional supervised methods.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Priya Goyal (15 papers)
Dhruv Mahajan (38 papers)
Abhinav Gupta (178 papers)
Ishan Misra (65 papers)

Citations (384)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - facebookresearch/fair_self_supervision_benchmark: Scaling and Benchmarking Self-Supervised Visual Representation Learning (587 stars)