Efficient Visual Pretraining with Contrastive Detection (2103.10957v2)

Published 19 Mar 2021 in cs.CV

Abstract: Self-supervised pretraining has been shown to yield powerful representations for transfer learning. These performance gains come at a large computational cost however, with state-of-the-art methods requiring an order of magnitude more computation than supervised pretraining. We tackle this computational bottleneck by introducing a new self-supervised objective, contrastive detection, which tasks representations with identifying object-level features across augmentations. This objective extracts a rich learning signal per image, leading to state-of-the-art transfer accuracy on a variety of downstream tasks, while requiring up to 10x less pretraining. In particular, our strongest ImageNet-pretrained model performs on par with SEER, one of the largest self-supervised systems to date, which uses 1000x more pretraining data. Finally, our objective seamlessly handles pretraining on more complex images such as those in COCO, closing the gap with supervised transfer learning from COCO to PASCAL.

Authors (6)

Skanda Koppula (23 papers)
Jean-Baptiste Alayrac (38 papers)
Aaron van den Oord (44 papers)
Oriol Vinyals (116 papers)
João Carreira (49 papers)
Olivier J. Hénaff (14 papers)

Citations (157)

View on Semantic Scholar

Summary

The paper introduces DetCon, a contrastive detection objective that extracts object-level signals to enable efficient pretraining.
The paper reduces computational demands by achieving comparable performance with up to tenfold less pretraining than previous methods.
The paper demonstrates improved transfer learning across diverse tasks, including COCO detection, instance segmentation, and semantic segmentation.

Summary of "Efficient Visual Pretraining with Contrastive Detection"

The paper "Efficient Visual Pretraining with Contrastive Detection" introduces a novel self-supervised objective known as contrastive detection (DetCon) that aims to reduce the substantial computational overhead associated with state-of-the-art self-supervised pretraining methods. The primary focus is to enable efficient transfer learning by constructing rich object-level features across image augmentations, offering notable gains in both accuracy and computational efficiency compared to previous solutions.

Key Contributions

The research highlights several pivotal contributions to the domain of computer vision and self-supervised learning:

Contrastive Detection Objective: The proposed DetCon objective is designed to extract learning signals from objects within images, enhancing information intake from each training instance without necessitating additional resources. This is achieved by leveraging unsupervised segmentation techniques to generate approximate object-based regions within images, which are then used to enhance training through the contrastive objective.
Reduced Computational Requirements: Remarkably, the DetCon method achieves comparable or superior performance on a wide range of downstream tasks with significantly less pretraining compared to existing state-of-the-art methods. The paper reports reducing pretraining requirements by up to tenfold while maintaining competitive performance.
Improved Transfer Learning: Models pretrained with DetCon demonstrate enhanced transfer capabilities across multiple complex tasks including COCO detection and instance segmentation, semantic segmentation on PASCAL and Cityscapes, and NYU depth estimation. Notably, a DetCon-pretrained model rivaled SEER, a large-scale self-supervised system requiring significantly more data, illustrating the efficiency of the proposed approach.

Results

The empirical evaluation reveals that DetCon outperforms existing methods such as SimCLR and BYOL in various transfer learning tasks, underscoring its potential in the field of self-supervised learning. The DetCon models excel notably in scenarios with complex scenes, bridging the gap between self-supervised and supervised methodologies in handling datasets with intricate images such as COCO.

Pretraining Efficiency: Experiments exhibit a remarkable reduction in pretraining times while achieving state-of-the-art performance on several downstream tasks. The paper illustrates these gains quantitatively by comparing the performance of different models on tasks such as object detection and instance segmentation.
Model Scalability: The research extends to large model architectures, demonstrating that DetCon sustains its advantages when scaling to deeper networks, thereby reinforcing its utility in large-scale applications.

Implications and Future Work

The introduction of DetCon paves the way for more resource-efficient training of self-supervised models, which is increasingly critical as datasets become larger and more complex. Its ability to reduce dependency on extensive labeled data aligns well with contemporary trends towards utilizing vast unlabeled data reservoirs.

This paper also raises intriguing avenues for future research, particularly in improving the quality and accuracy of object segmentation used in pretraining. Such advancements could further elevate the learning signal obtained during training. Additionally, the exploration of synergistic mechanisms that integrate DetCon with other contrastive learning objectives could foster novel strategies for self-supervised learning beyond the current state of the art.

In summary, DetCon offers a promising pathway in self-supervised learning, enhancing both the efficiency and efficacy of visual pretraining, and setting a strong foundation for future developments in the discipline.

PDF Markdown

Related Papers

YouTube

Show All Videos