Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning (2011.10043v2)

Published 19 Nov 2020 in cs.CV and cs.LG

Abstract: Contrastive learning methods for unsupervised visual representation learning have reached remarkable levels of transfer performance. We argue that the power of contrastive learning has yet to be fully unleashed, as current methods are trained only on instance-level pretext tasks, leading to representations that may be sub-optimal for downstream tasks requiring dense pixel predictions. In this paper, we introduce pixel-level pretext tasks for learning dense feature representations. The first task directly applies contrastive learning at the pixel level. We additionally propose a pixel-to-propagation consistency task that produces better results, even surpassing the state-of-the-art approaches by a large margin. Specifically, it achieves 60.2 AP, 41.4 / 40.5 mAP and 77.2 mIoU when transferred to Pascal VOC object detection (C4), COCO object detection (FPN / C4) and Cityscapes semantic segmentation using a ResNet-50 backbone network, which are 2.6 AP, 0.8 / 1.0 mAP and 1.0 mIoU better than the previous best methods built on instance-level contrastive learning. Moreover, the pixel-level pretext tasks are found to be effective for pre-training not only regular backbone networks but also head networks used for dense downstream tasks, and are complementary to instance-level contrastive methods. These results demonstrate the strong potential of defining pretext tasks at the pixel level, and suggest a new path forward in unsupervised visual representation learning. Code is available at \url{https://github.com/zdaxie/PixPro}.

View on arXiv

Authors (6)

Zhenda Xie (51 papers)
Yutong Lin (15 papers)
Zheng Zhang (488 papers)
Yue Cao (147 papers)
Stephen Lin (72 papers)
Han Hu (196 papers)

Citations (388)

View on Semantic Scholar

Summary

Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning

The paper "Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning" introduces an innovative approach to unsupervised visual representation learning by shifting the focus from instance-level to pixel-level pretext tasks. Traditional methods predominantly rely on instance discrimination, potentially limiting spatial sensitivity needed for pixel-based tasks such as object detection and semantic segmentation. This work addresses this gap, proposing a framework that enhances pixel-level consistency through novel pretext tasks.

Methodology and Key Contributions

The authors propose two main pixel-level pretext tasks: PixContrast and PixPro. PixContrast extends traditional contrastive learning to the pixel level, treating each pixel as a unique class and leveraging contrastive loss to discriminate between pixels within and across images.

PixPro introduces a more sophisticated pixel-to-propagation consistency task. It consists of two branches: a regular encoder branch and a pixel propagation module (PPM) branch, which propagates features of similar pixels to impart a smoothing effect. Importantly, unlike PixContrast, PixPro relies solely on positive pairs, bypassing the complexities of handling negative pairs. Empirical evidence in the paper shows that PixPro significantly outperforms PixContrast, enhancing spatial smoothness without compromising pixel sensitivity.

Numerical Results

The proposed method, demonstrated with a ResNet-50 backbone, exhibits superior transfer performance on standard benchmarks for dense prediction tasks:

Pascal VOC (Faster R-CNN R50-C4): Achieved 60.2 AP, surpassing previous best methods by 2.6 AP.
COCO (Mask R-CNN R50-FPN / R50-C4): Reported 41.4 / 40.5 mAP, indicating improvements of 0.8 / 1.0 mAP over state-of-the-art methods.
Cityscapes Semantic Segmentation: Obtained 77.2 mIoU, a 1.0 mIoU increase compared to leading methods.

These results underscore the effectiveness of pixel-level pretext tasks in enhancing feature representations for downstream tasks that require spatially sensitive inference.

Implications and Future Directions

The introduction of pixel-level tasks in unsupervised learning shifts the paradigm towards more granular feature representation learning. The success of PixPro suggests several broader implications:

Enhanced Feature Pre-training: Dense prediction tasks immensely benefit from spatial sensitivity and smoothness, suggesting a potential rethinking of pre-training strategies for other related tasks.
Network Alignment: The paper highlights aligning pre-training architectures with downstream tasks, reporting benefits in semi-supervised scenarios, particularly where labeled data is limited.
Complementary Nature: The paper shows that combining pixel-level methods with instance-level tasks can yield comprehensive feature learning, maintaining categorization capabilities while enhancing spatial inference.

Potential future work could explore extending pixel-level learning to other modalities, such as video or multi-modal datasets, further enhancing the adaptability and robustness of unsupervised learning frameworks.

In summary, this paper offers a well-founded advancement in unsupervised learning, with rigorous numerical benchmarks validating the approach. The integration of pixel-level consistency highlights a promising direction for more adaptive and application-specific pre-training methodologies in computer vision.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - zdaxie/PixPro: Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021 (330 stars)