Masked Autoencoders Are Scalable Vision Learners (2111.06377v3)

Published 11 Nov 2021 in cs.CV

Abstract: This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

PDF Abstract

Exploring the Efficiency and Effectiveness of Masked Autoencoders in Vision

Introduction

Recent advancements in self-supervised learning have significantly contributed to the development of powerful and efficient learning methods, particularly in the domain of NLP. Inspired by these successes, there is an increasing interest in adapting self-supervised learning frameworks for computer vision tasks. In this context, the concept of Masked Autoencoders (MAEs) offers a promising direction, as detailed in a comprehensive paper by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, and Ross Girshick from Facebook AI Research (FAIR).

MAE Architecture

The crux of the MAE methodology lies in its simplicity and effectiveness. The process begins by masking a substantial portion of an input image—up to 75%—and tasking the model with reconstructing the missing pixels. This setup hinges on an asymmetric encoder-decoder architecture; the encoder processes only the visible patches of the image, while a lightweight decoder is tasked with reconstructing the original image using the encoded representations along with provided mask tokens. This architecture significantly accelerates training, reduces memory consumption, and permits scaling to larger model sizes more efficiently.

Empirical Validation

The effectiveness of MAEs is supported by strong empirical evidence. When applied to standard benchmarks such as ImageNet-1K, a vanilla ViT-Huge model under the MAE framework achieved an unprecedented accuracy of 87.8%, setting a new state-of-the-art for methods relying solely on ImageNet-1K data. This result highlights the potential of MAEs to efficiently leverage large-scale datasets without the need for sophisticated or compute-intensive strategies.

Furthermore, the paper examines transfer learning performance across a variety of tasks, including object detection, instance segmentation, and semantic segmentation on commonly used datasets like COCO and ADE20K. The findings demonstrate that models pre-trained using the MAE approach consistently outperform their supervised learning counterparts, emphasizing the quality and generalizability of the learned representations.

Theoretical Implications and Future Perspectives

The success of MAEs in computer vision points towards the broader applicability and potential of self-supervised learning paradigms beyond NLP. The ability of MAEs to efficiently learn from partially visible data could inspire new learning algorithms that better mimic human learning processes, which are inherently efficient and capable of learning from incomplete information.

The scalable nature of the MAE framework, coupled with its simplicity, opens up new avenues for research into more efficient and effective training methodologies that could further bridge the gap between AI systems and human-like learning capabilities. Future work could explore the adaptation of MAE frameworks across different modalities, the integration of multi-sensory data, or the development of more complex reasoning tasks that leverage the foundational representations learned by MAEs.

Conclusion

This paper demonstrates the potential of Masked Autoencoders as a scalable and effective self-supervised learning framework for computer vision. The simplicity of the MAE architecture, combined with its notable performance on various benchmarks, sets a new direction for future research in self-supervised learning. As the field continues to evolve, the principles underlying MAEs will likely play a crucial role in developing more efficient and generalizable AI systems across a broad range of applications.