Papers
Topics
Authors
Recent
Search
2000 character limit reached

Understanding Masked Autoencoders From a Local Contrastive Perspective

Published 3 Oct 2023 in cs.CV | (2310.01994v2)

Abstract: Masked AutoEncoder (MAE) has revolutionized the field of self-supervised learning with its simple yet effective masking and reconstruction strategies. However, despite achieving state-of-the-art performance across various downstream vision tasks, the underlying mechanisms that drive MAE's efficacy are less well-explored compared to the canonical contrastive learning paradigm. In this paper, we first propose a local perspective to explicitly extract a local contrastive form from MAE's reconstructive objective at the patch level. And then we introduce a new empirical framework, called Local Contrastive MAE (LC-MAE), to analyze both reconstructive and contrastive aspects of MAE. LC-MAE reveals that MAE learns invariance to random masking and ensures distribution consistency between the learned token embeddings and the original images. Furthermore, we dissect the contribution of the decoder and random masking to MAE's success, revealing both the decoder's learning mechanism and the dual role of random masking as data augmentation and effective receptive field restriction. Our experimental analysis sheds light on the intricacies of MAE and summarizes some useful design methodologies, which can inspire more powerful visual self-supervised methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. A critical analysis of self-supervision, or what we can learn from a single image. arXiv preprint arXiv:1904.13132, 2019.
  2. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  3. How to understand masked autoencoders. arXiv preprint arXiv:2202.03670, 2022.
  4. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
  5. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  6. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  7. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758, 2021.
  8. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
  9. Context autoencoder for self-supervised representation learning. International Journal of Computer Vision, pages 1–16, 2023.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  11. With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9588–9597, 2021.
  12. Xcit: Cross-covariance image transformers. arXiv preprint arXiv:2106.09681, 2021.
  13. Convmae: Masked convolution meets masked autoencoders. arXiv preprint arXiv:2205.03892, 2022.
  14. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
  15. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  16. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  17. Contrastive masked autoencoders are stronger vision learners. arXiv preprint arXiv:2207.13532, 2022.
  18. Understanding masked autoencoders via hierarchical latent variable models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7918–7928, 2023.
  19. Understanding masked image modeling via learning occlusion invariant feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6241–6251, 2023.
  20. Semmae: Semantic-guided masking for learning masked autoencoders. Advances in Neural Information Processing Systems, 35:14290–14302, 2022.
  21. Mixmim: Mixed and masked image modeling for efficient visual representation learning. arXiv preprint arXiv:2205.13137, 2022.
  22. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  23. Good helper is around you: Attention-driven masked image modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1799–1807, 2023.
  24. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  25. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  26. What do self-supervised vision transformers learn? arXiv preprint arXiv:2305.00729, 2023.
  27. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  28. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  29. Localizing objects with self-supervised transformers and no labels. arXiv preprint arXiv:2109.14279, 2021.
  30. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  31. Cut and learn for unsupervised object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3124–3134, 2023a.
  32. Tokencut: Segmenting objects in images and videos with self-supervised transformer and normalized cut. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
  33. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 22–31, 2021.
  34. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022.
  35. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 579–588, 2021.
  36. How mask matters: Towards theoretical understandings of masked autoencoders. Advances in Neural Information Processing Systems, 35:27127–27139, 2022.
  37. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, pages 13001–13008, 2020.
  38. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.

Summary

  • The paper introduces LC-MAE, a framework that decomposes MAE training into reconstruction, cross-view, and in-view contrastive losses.
  • It demonstrates that decoder depth and random masking significantly influence semantic representation learning and downstream task performance.
  • Ablation studies confirm that contrastive objectives, even without reconstruction, yield high-fidelity feature learning in MAE.

Understanding Masked Autoencoders From a Local Contrastive Perspective

Introduction

The paper "Understanding Masked Autoencoders From a Local Contrastive Perspective" (2310.01994) provides a detailed analysis of Masked AutoEncoder (MAE), a self-supervised learning (SSL) method that has significantly impacted computer vision by demonstrating state-of-the-art performance across various vision tasks. Despite its success, the mechanisms underlying MAE's effectiveness have remained less understood compared to contrastive learning paradigms. This paper aims to address this gap by examining MAE from a local contrastive perspective and introducing a novel framework, Local Contrastive MAE (LC-MAE), to systematically explore both reconstructive and contrastive aspects of MAE.

Local Contrastive Framework for MAE

The authors propose an empirical framework, LC-MAE, to analyze MAE's mechanics by explicitly reformulating its training objective to include both reconstructive and local contrastive objectives at the image patch level. Through this reformulation, MAE's training is decomposed into three explicit loss components: a reconstruction loss, a cross-view contrastive loss, and an in-view contrastive loss. This decomposition allows for a more granular investigation into how MAE learns semantic representations and ensures distribution consistency between learned token embeddings and the original images.

The cross-view loss promotes similarity between token features of different masked views of the same image, fostering invariance to random masking. The in-view loss maintains the consistency of the output distributions with input image patches, which prevents feature collapse—a crucial aspect of effective representation learning in MAE.

Decoder and Masking Contributions

The paper highlights the dual role of the decoder and random masking in MAE's success. An in-depth analysis reveals that the decoder primarily utilizes positional information in its shallow layers and gradually shifts to relying on semantic information in deeper layers. This transition underscores the essence of a deep decoder for acquiring rich semantic representations.

Random masking, a central component of MAE, serves as both a data augmentation technique and a mechanism to restrict the Vision Transformer's effective receptive field. Empirical investigations show that this restriction is vital for enhancing downstream task performance, as it controls the extent of locality considered during MAE's pretraining.

Experimental Results and Analysis

The LC-MAE retains MAE's robust performance on downstream tasks while offering deeper insights into MAE’s workings. The experiments conducted validate the importance of random masking and demonstrate that the proper setting of a receptive field—enabled by parametric or non-parametric decoder designs—is a key factor in finetuning success. Furthermore, the findings from the ablation studies confirm that both the reconstruction and cross-view contrastive losses significantly contribute to finetuning performance improvements.

Additionally, the empirical results reveal that purely contrastive learning, even without reconstruction assistance, still holds substantial potential for high-fidelity feature learning, further endorsing the implicit contrastive learning occurring within MAE.

Conclusion

This paper delivers a comprehensive examination of MAE through a local contrastive lens, offering insights that unify the understanding of reconstruction and contrastive learning paradigms in SSL for visual representation learning. The elucidation of key design principles, such as decoder depth and appropriate receptive field size, provides a foundation for designing more efficacious self-supervised methods. The findings aim to inspire future research in SSL, potentially leading to more powerful and unified frameworks that exploit the strengths of both generative and discriminative learning paradigms in visual tasks.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.