Contrastive Masked Autoencoders are Stronger Vision Learners (2207.13532v3)

Published 27 Jul 2022 in cs.CV

Abstract: Masked image modeling (MIM) has achieved promising results on various vision tasks. However, the limited discriminability of learned representation manifests there is still plenty to go for making a stronger vision learner. Towards this goal, we propose Contrastive Masked Autoencoders (CMAE), a new self-supervised pre-training method for learning more comprehensive and capable vision representations. By elaboratively unifying contrastive learning (CL) and masked image model (MIM) through novel designs, CMAE leverages their respective advantages and learns representations with both strong instance discriminability and local perceptibility. Specifically, CMAE consists of two branches where the online branch is an asymmetric encoder-decoder and the momentum branch is a momentum updated encoder. During training, the online encoder reconstructs original images from latent representations of masked images to learn holistic features. The momentum encoder, fed with the full images, enhances the feature discriminability via contrastive learning with its online counterpart. To make CL compatible with MIM, CMAE introduces two new components, i.e. pixel shifting for generating plausible positive views and feature decoder for complementing features of contrastive pairs. Thanks to these novel designs, CMAE effectively improves the representation quality and transfer performance over its MIM counterpart. CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection. Notably, CMAE-Base achieves $85.3\%$ top-1 accuracy on ImageNet and $52.5\%$ mIoU on ADE20k, surpassing previous best results by $0.7\%$ and $1.8\%$ respectively. The source code is publicly accessible at \url{https://github.com/ZhichengHuang/CMAE}.

PDF Abstract

Contrastive Masked Autoencoders are Stronger Vision Learners

The paper "Contrastive Masked Autoencoders are Stronger Vision Learners" proposes a novel self-supervised pre-training framework called Contrastive Masked Autoencoders (CMAE). This approach aims to enhance the efficacy of Masked Image Modeling (MIM) in developing comprehensive and discriminative vision representations by integrating Contrastive Learning (CL) with MIM. The authors elaborate on the significant deficiencies of traditional MIM, emphasizing its limited capability to learn discriminative features. CMAE addresses this by constructively unifying the strengths of CL and MIM, thus presenting a more capable vision learner.

Framework Details

CMAE's framework is designed with two primary branches: an online branch and a momentum branch. The online branch is an asymmetric encoder-decoder that focuses on learning global representations from the visible portions of masked images, akin to other MIM techniques like MAE. In parallel, the momentum branch functions as a momentum-updated encoder operated on full images. The CMAE effectively aligns CL with MIM through two key developments: pixel shifting for generating contrastive views and a feature decoder for complementing contrastive pairs. This architecture enables the online encoder to assimilate both the holistic content and instance-level discriminability efficiently.

Technical Contributions

Novel Architecture Integration: The CMAE introduces a cohesive structure that synergizes CL and MIM. This is achieved through a dual-branch architecture, which delineates the innovative contrastive MAE framework, leading to superior representation learning.
Data Augmentation Strategy: The paper describes a pixel shifting augmentation technique that optimizes the generation of positive contrastive views. This strategy mitigates the misalignment between views usually caused by random cropping, commonly used in CL, by maintaining seamless compatibility with MIM regimes.
Feature Decoder Introduction: To rectify the discrepancy in the features of masked parts and input images, CMAE employs a feature decoder. By predicting the masked inputs' features, it harmonizes the learning objectives between MIM and CL, enhancing the latent feature quality.

Experimental Outcomes

CMAE's efficacy is evidenced through state-of-the-art performance on several competitive benchmarks. Specifically, CMAE-Base demonstrates a notable improvement by achieving 85.3% top-1 accuracy on ImageNet and 52.5% mIoU on ADE20K, surpassing previous results by notable margins. The improvement in classification, semantic segmentation, and object detection tasks suggests CMAE's robustness and versatility across various vision challenges.

Implications and Future Directions

The implications of this research are significant for the field of self-supervised learning, primarily in computer vision contexts. By demonstrating the utility of integrating CL into MIM frameworks, this work paves the way for more nuanced understanding and development of self-supervised models. Theoretically, the harmonization of feature-level and instance-level learning objectives could stimulate further advancements in unsupervised model architectures and representations.

Future research could explore the scalability of CMAE on diverse datasets and its interactions with other multi-modal learning methods. Additionally, the exploration of alternative view generation strategies or integration with text-based modalities may broaden the application scope of contrastive masked autoencoders.

In conclusion, this paper presents a well-founded method to enhance MIM through contrastive techniques, marking a significant advancement in self-supervised vision learning. CMAE's ability to improve representation quality establishes it as a promising component for further innovations in artificial intelligence and machine learning systems.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Zhicheng Huang (9 papers)
Xiaojie Jin (50 papers)
Chengze Lu (2 papers)
Qibin Hou (81 papers)
Ming-Ming Cheng (185 papers)
Dongmei Fu (19 papers)
Xiaohui Shen (67 papers)
Jiashi Feng (295 papers)

Citations (128)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - ZhichengHuang/CMAE: The official implementation of CMAE https://arxiv.org/abs/2207.13532 and https://ieeexplore.ieee.org/document/10330745 (65 stars)

Tweets

https://twitter.com/MrCatid/status/1752421460725883352