DetCo: Unsupervised Contrastive Learning for Object Detection (2102.04803v2)

Published 9 Feb 2021 in cs.CV

Abstract: Unsupervised contrastive learning achieves great success in learning image representations with CNN. Unlike most recent methods that focused on improving accuracy of image classification, we present a novel contrastive learning approach, named DetCo, which fully explores the contrasts between global image and local image patches to learn discriminative representations for object detection. DetCo has several appealing benefits. (1) It is carefully designed by investigating the weaknesses of current self-supervised methods, which discard important representations for object detection. (2) DetCo builds hierarchical intermediate contrastive losses between global image and local patches to improve object detection, while maintaining global representations for image recognition. Theoretical analysis shows that the local patches actually remove the contextual information of an image, improving the lower bound of mutual information for better contrastive learning. (3) Extensive experiments on PASCAL VOC, COCO and Cityscapes demonstrate that DetCo not only outperforms state-of-the-art methods on object detection, but also on segmentation, pose estimation, and 3D shape prediction, while it is still competitive on image classification. For example, on PASCAL VOC, DetCo-100ep achieves 57.4 mAP, which is on par with the result of MoCov2-800ep. Moreover, DetCo consistently outperforms supervised method by 1.6/1.2/1.0 AP on Mask RCNN-C4/FPN/RetinaNet with 1x schedule. Code will be released at \href{https://github.com/xieenze/DetCo}{\color{blue}{\tt github.com/xieenze/DetCo}}.

Authors (8)

Enze Xie (84 papers)
Jian Ding (132 papers)
Wenhai Wang (123 papers)
Xiaohang Zhan (27 papers)
Hang Xu (205 papers)
Peize Sun (33 papers)
Zhenguo Li (195 papers)
Ping Luo (340 papers)

Citations (300)

View on Semantic Scholar

Summary

The paper presents a novel approach using multi-level supervision and global-local contrastive learning to enhance both object detection and image classification.
The method outperforms existing models by improving ImageNet top-1 accuracy by 6.9% and increasing COCO AP significantly over previous benchmarks.
Its balanced design for dense and instance-level prediction informs future research in integrated vision systems and complex scene understanding.

An Overview of DetCo: Unsupervised Contrastive Learning for Object Detection

The paper "DetCo: Unsupervised Contrastive Learning for Object Detection" introduces a novel self-supervised learning approach aimed specifically at enhancing object detection tasks. The work addresses the fundamental challenge in contrastive learning: the typical trade-off between object detection and image classification performance. While existing methods often excel in one area at the expense of the other, DetCo demonstrates improvements in both domains through innovative design choices.

Core Innovations in DetCo

DetCo's key contributions stem from two primary innovations: multi-level supervision and global-local contrastive learning.

Multi-Level Supervision: Traditional contrastive learning methods tend to focus only on the final layers of the neural network to ensure feature discrimination for classification. DetCo, however, applies supervision across multiple layers of the feature pyramid. This approach is crucial for object detection, where effective discrimination across various levels is necessary for accurate predictions. By using intermediate layer supervision, DetCo enhances representational quality throughout the network, leading to better performance on dense prediction tasks.
Global and Local Contrastive Learning: This framework extends contrastive learning beyond the global image perspective to include local regions or patches. This cross-level contrastive learning ensures that both global and local representations are optimized, thereby improving instance-level discrimination necessary for detection, while also maintaining strong image-level features for classification. This duality in learning tasks ensures that DetCo achieves a balanced improvement in both object detection and image classification tasks.

Experimental Performance

DetCo is evaluated extensively on several datasets, including PASCAL VOC, COCO, Cityscapes, and more. In ImageNet classification, DetCo surpasses InsLoc and DenseCL by 6.9% and 5% in top-1 accuracy, respectively. For COCO detection, DetCo improves performance metrics by a notable margin, exceeding SwAV with Mask R-CNN C4 by 6.9 AP points. Furthermore, DetCo enhances Sparse R-CNN’s performance from 45.0 AP to 46.5 AP, setting a new state of the art on the COCO benchmark.

Implications and Future Directions

DetCo's improvements are significant for several reasons. First, it ensures that models are versatile across a range of tasks from detection to classification, which is valuable in holistic vision systems. The ability to achieve strong performance on two traditionally conflicting tasks with a single model architecture could inform future developments in unsupervised learning and feature extraction strategies.

In terms of future research directions, DetCo opens avenues for exploring the integration of global-local representations in other multimodal settings or complex vision tasks like 3D object detection and scene understanding. Additionally, the approach could be extended to work with transformer-based architectures or in conjunction with reinforcement learning for even more robust scene understanding.

Overall, DetCo represents a substantial advancement in self-supervised learning for computer vision, providing a balanced and effective approach for tackling both object detection and image classification within a unified framework.

Related Papers

GitHub

GitHub - xieenze/DetCo (269 stars)