Dense Contrastive Learning for Self-Supervised Visual Pre-Training (2011.09157v2)

Published 18 Nov 2020 in cs.CV

Abstract: To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only <1% slower), but demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation; and outperforms the state-of-the-art methods by a large margin. Specifically, over the strong MoCo-v2 baseline, our method achieves significant improvements of 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation. Code is available at: https://git.io/AdelaiDet

Authors (5)

Xinlong Wang (56 papers)
Rufeng Zhang (9 papers)
Chunhua Shen (404 papers)
Tao Kong (49 papers)
Lei Li (1293 papers)

Citations (619)

View on Semantic Scholar

Summary

The paper presents DenseCL, a novel self-supervised method that leverages a dense contrastive loss to improve local feature learning for dense prediction tasks.
It employs a dual-path architecture with parallel global and dense projection heads to maintain spatial details while contrasting features.
Empirical results show improved object detection and segmentation performance, with gains of up to 3.0% mIoU and notable AP improvements on standard benchmarks.

Dense Contrastive Learning for Self-Supervised Visual Pre-Training

The paper introduces Dense Contrastive Learning (DenseCL), a novel self-supervised learning method designed to improve visual pre-training for dense prediction tasks, such as object detection and semantic segmentation. Traditional self-supervised learning techniques are often optimized around image-level tasks like classification, which can be suboptimal for tasks requiring pixel-level precision. DenseCL addresses this gap by focusing directly on local feature correspondence and leveraging a dense pairwise contrastive loss.

Methodology

DenseCL extends traditional contrastive learning by operating at the level of local image features rather than global image-level features. The architecture consists of a backbone network that outputs dense feature maps, which are then processed through parallel global and dense projection heads. The dense projection head, unlike conventional global pooling methods, maintains spatial information by employing 1x1 convolution layers that output dense feature vectors.

The core of DenseCL is its loss function, which supplements the traditional InfoNCE loss used in contrastive learning with a dense counterpart. This dense loss calculates pairwise contrastive similarity for localized regions between two augmented views of an image. The positive sample for each local feature vector is defined based on cross-view feature correspondence, extracted using similarity metrics on downsampled feature maps.

Results

The proposed DenseCL demonstrates superior performance on several dense prediction tasks when compared to the MoCo-v2 baseline, with negligible computational overhead. When evaluated for downstream tasks such as PASCAL VOC and COCO object detection and segmentation, significant gains were observed. DenseCL improved AP scores by 2.0% on PASCAL VOC detection and by 0.9% on COCO instance segmentation. It also achieved significant improvements in semantic segmentation, with a 3.0% mIoU increase on PASCAL VOC and a 1.8% mIoU increase on Cityscapes.

The empirical findings suggest that DenseCL improves localization accuracy, particularly evident in metrics such as AP75, which require precise bounding box predictions.

Implications and Future Directions

DenseCL sets a precedent for improving self-supervised models through dense, feature-level learning approaches. The ability to train models without labeled data for tasks traditionally requiring precise annotations opens new avenues for scaling computer vision applications, particularly in domains where labeled data is sparse or costly to obtain.

Theoretical implications of DenseCL suggest a reconsideration of how spatial information is leveraged in self-supervised learning frameworks. Practically, DenseCL may bridge the discrepancy between pre-training and fine-tuning tasks, enhancing the transferability of self-supervised models.

Future developments could explore incorporating DenseCL with more intricate architectures, further optimizing the model for computational efficiency, and extending its application to other domains that require fine-grained feature correspondence.

In summary, DenseCL presents a tangible advancement in self-supervised learning for dense prediction tasks, enhancing both theoretical understanding and practical application.

PDF Markdown