Papers
Topics
Authors
Recent
Search
2000 character limit reached

Contrastive Masked Autoencoders are Stronger Vision Learners

Published 27 Jul 2022 in cs.CV | (2207.13532v3)

Abstract: Masked image modeling (MIM) has achieved promising results on various vision tasks. However, the limited discriminability of learned representation manifests there is still plenty to go for making a stronger vision learner. Towards this goal, we propose Contrastive Masked Autoencoders (CMAE), a new self-supervised pre-training method for learning more comprehensive and capable vision representations. By elaboratively unifying contrastive learning (CL) and masked image model (MIM) through novel designs, CMAE leverages their respective advantages and learns representations with both strong instance discriminability and local perceptibility. Specifically, CMAE consists of two branches where the online branch is an asymmetric encoder-decoder and the momentum branch is a momentum updated encoder. During training, the online encoder reconstructs original images from latent representations of masked images to learn holistic features. The momentum encoder, fed with the full images, enhances the feature discriminability via contrastive learning with its online counterpart. To make CL compatible with MIM, CMAE introduces two new components, i.e. pixel shifting for generating plausible positive views and feature decoder for complementing features of contrastive pairs. Thanks to these novel designs, CMAE effectively improves the representation quality and transfer performance over its MIM counterpart. CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection. Notably, CMAE-Base achieves $85.3\%$ top-1 accuracy on ImageNet and $52.5\%$ mIoU on ADE20k, surpassing previous best results by $0.7\%$ and $1.8\%$ respectively. The source code is publicly accessible at \url{https://github.com/ZhichengHuang/CMAE}.

Citations (128)

Summary

  • The paper presents a unified framework that integrates masked image modeling and contrastive learning to enhance both holistic and discriminative representations.
  • It achieves state-of-the-art performance on ImageNet and other benchmarks with notable gains in accuracy and feature separability.
  • The methodology employs innovative pixel shifting augmentation and a dual-branch architecture, enabling efficient feature alignment and faster convergence.

Contrastive Masked Autoencoders: Advancing Self-Supervised Vision Representations

Introduction

The paper "Contrastive Masked Autoencoders are Stronger Vision Learners" (2207.13532) introduces Contrastive Masked Autoencoders (CMAE), a self-supervised learning framework that unifies masked image modeling (MIM) and contrastive learning (CL) to address limitations of current vision representation learning. MIM, exemplified by methods such as MAE and SimMIM, excels at learning holistic, context-sensitive representations but often exhibits suboptimal discriminative power across instances. In contrast, CL methods are effective at producing discriminative features but frequently lack spatial sensitivity. Existing efforts to combine these paradigms have resulted in marginal gains, largely due to incompatibilities in augmentation strategies, model architectures, and loss formulations.

This work proposes an integrated framework that carefully resolves these incompatibilities. CMAE demonstrates clear empirical superiority across major vision benchmarks, achieving state-of-the-art performance in classification, segmentation, and detection, and providing insights on representation transferability, model scaling, and the internal feature structure induced by joint contrastive and reconstruction objectives.

Methodology

Joint Contrastive and Reconstruction Pretraining

CMAE consists of a dual-branch pretraining structure:

  • Online Branch: An asymmetric encoder-decoder, with the encoder operating on a randomly masked input (as in MIM) and feeding both a pixel reconstruction decoder and a feature decoder.
  • Momentum Branch: A momentum-updated encoder, consuming an augmented, unmasked version of the same image to provide a stable contrastive target.

The key design choices are:

  • Feature Decoder: Provides recoverable features for masked tokens in the online branch, reducing the semantic gap between masked and unmasked representations, and enabling meaningful contrastive learning at the global feature level.
  • Augmentation Strategy: Rather than standard strong augmentations used in contrastive frameworks (e.g., large random crops), CMAE employs pixel shifting—a controlled spatial translation over a shared master crop—to ensure that positive contrastive pairs maintain sufficient semantic overlap even after aggressive token masking.
  • Loss Functions: Utilizes MSE loss for pixel reconstruction over masked patches and InfoNCE loss between projected global representations from the online feature decoder and the momentum encoder. The overall objective is a weighted sum of these losses.

Architectural and Training Details

  • The encoders are based on ViT (Vision Transformer); the hybrid variant employs convolutional stem tokenization.
  • The momentum encoder parameters are updated via exponential moving average.
  • Only the online encoder is retained for downstream finetuning; the momentum branch is discarded post-pretraining.

Experimental Results

ImageNet and Beyond

CMAE achieves 84.7% top-1 accuracy on ImageNet-1K with ViT-B at 1600 epochs, outperforming state-of-the-art MIM (MAE, SimMIM) and contrastive (MoCo-v3, DINO) baselines by substantial margins. Incorporating convolutional stems further increases accuracy to 85.3%, a 0.7% absolute gain over previous bests.

In semantic segmentation on ADE20K, CMAE reaches 52.5% mIoU, outstripping previous MIM and hybrid approaches by up to 1.8%. On COCO detection/segmentation tasks, CMAE delivers consistent AP improvements (up to +0.4 APbbox^{bbox}, +0.5 APmask^{mask} over ConvMAE/MAE). Transfer experiments on iNaturalist and Places also demonstrate systematic accuracy improvements of 1-1.7% over MAE.

Ablation and Analysis

  • Contrastive Augmentation: Pixel shifting augmentation consistently outperforms standard augmentation, validating the need for maintaining view alignment under heavy input corruption from masking.
  • Feature Decoder: Adding a lightweight, non-shared feature decoder is crucial; even shallow decoders yield notable gains in downstream accuracy and representation compactness.
  • Loss Balancing: InfoNCE loss complementing the pixel prediction loss delivers optimal results when balanced; excessive weight on the contrastive objective degrades representation quality.
  • Input to Momentum Encoder: Full images yield the strongest contrastive supervision; masking in this branch yields worse performance due to semantic incompleteness.
  • Efficiency and Transfer: CMAE representations converge faster in downstream finetuning and yield higher linear and partial finetuning probe accuracy, indicating robust feature separability.

Quantitative feature analysis demonstrates that CMAE induces lower intra-class dispersion and higher inter-class separation in feature space compared to MAE, substantiating the hypothesized improvements in discriminative capacity.

Model Scaling

CMAE exhibits monotonic gains from model size scaling (ViT-S/B/L), matching expected trends from MIM and contrastive literature and highlighting its extensibility.

Theoretical and Practical Implications

CMAE's methodologically rigorous integration of contrastive and reconstruction paradigms demonstrates that instance discrimination and spatial/holistic feature learning can be co-optimized without negative interference, provided that view alignment and semantic correspondence are maintained through architectural and augmentation innovations.

Practically, this approach improves representation learning for a wide variety of vision backbones (pure ViT and hybrid), requiring only minor architectural additions and no reliance on external tokenizers. The method's robustness with respect to pretraining duration and downstream adaptation protocols makes it an attractive candidate for standardization in self-supervised visual pretraining pipelines.

On the theoretical front, the success of designing a feature-aligned decoder and semantically calibrated augmentations informs future efforts in multiview SSL, suggesting that explicit bridging of task-specific semantic gaps may be necessary for further gains.

Future Directions

Areas for future exploration include scaling CMAE to longer pretraining schedules and larger datasets (e.g., billion-scale web data), and extending the architecture to incorporate cross-modal contrastive views (e.g., dense image-caption pairs) for joint vision-language modeling. Furthermore, investigations into integrating other forms of positive/negative pair mining, or using more elaborate decoders (e.g., hierarchical or multi-resolution), present natural extensions.

Conclusion

Contrastive Masked Autoencoders set a new high-water mark for self-supervised vision pretraining by unifying discriminative and reconstructive learning within a refined augmentation and architectural scheme. This work demonstrates not only measurable gains across standard vision benchmarks but also provides empirical support for principled combinations of MIM and CL. CMAE represents a substantial advance for both the theoretical understanding and practical realization of scalable, effective visual representation learning.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.