MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers (2205.13137v4)

Published 26 May 2022 in cs.CV

Abstract: In this paper, we propose Mixed and Masked AutoEncoder (MixMAE), a simple but efficient pretraining method that is applicable to various hierarchical Vision Transformers. Existing masked image modeling (MIM) methods for hierarchical Vision Transformers replace a random subset of input tokens with a special [MASK] symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the [MASK] symbol greatly slows down the training and causes pretraining-finetuning inconsistency, due to the large masking ratio (e.g., 60% in SimMIM). On the other hand, MAE does not introduce [MASK] tokens at its encoder at all but is not applicable for hierarchical Vision Transformers. To solve the issue and accelerate the pretraining of hierarchical models, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the two original images from the mixed input, which significantly improves efficiency. While MixMAE can be applied to various hierarchical Transformers, this paper explores using Swin Transformer with a large window size and scales up to huge model size (to reach 600M parameters). Empirical results demonstrate that MixMAE can learn high-quality visual representations efficiently. Notably, MixMAE with Swin-B/W14 achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs. Besides, its transfer performances on the other 6 datasets show that MixMAE has better FLOPs / performance tradeoff than previous popular MIM methods. Code is available at https://github.com/Sense-X/MixMIM.

PDF Abstract

Overview of MixMAE for Vision Transformers

This essay provides an expert analysis of the paper "MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers." The paper introduces MixMAE, a pretraining approach designed to enhance hierarchical Vision Transformers, addressing inefficiencies observed in existing masked image modeling (MIM) techniques. Specifically, it tackles issues such as slow training speeds due to [MASK] token usage and the resulting pretraining-finetuning discrepancies.

Key Contributions

The paper highlights two central challenges faced by traditional MIM models: the integration of the [MASK] symbol leading to computational inefficiencies, and the inconsistency between the pretraining and finetuning stages. Addressing these challenges, MixMAE innovatively utilizes token mixing to replace masked parts of an image with visible tokens from another image, allowing dual reconstruction that enhances training efficiency without the need for [MASK] symbols.

Design and Methodology

Mixed Input Creation: MixMAE introduces a novel process where input tokens from one image are substituted by visible tokens from another, forming a mixed input. This approach significantly reduces the computational overhead associated with processing [MASK] tokens.
Dual Reconstruction: The reconstruction process in MixMAE involves two images, allowing better utilization of all input tokens for learning. The encoder processes these mixed inputs to generate meaningful representations, which are then decoded into the original images.
Hierarchical Vision Transformers: MixMAE is especially adept with hierarchical Vision Transformers, utilizing architectures like the Swin Transformer with larger window sizes to enhance contextual understanding and representation learning.
Pretraining Efficiency: Being compatible with hierarchical ViTs and eschewing the use of [MASK] tokens in the encoder, MixMAE achieves a more efficient pretraining process, validated through empirical results.

Numerical Performance

Empirical evaluations demonstrate MixMAE's efficiency in learning high-quality visual representations. Notably, MixMAE achieves an 85.1% top-1 accuracy on ImageNet-1K after 600 epochs of pretraining. Moreover, its FLOPs/performance ratio outperforms existing methods such as SimMIM, thereby underscoring the method’s advantages in terms of computational efficiency and transfer learning capabilities across several datasets.

Implications and Future Directions

The implications of MixMAE are twofold: practically, it offers a more streamlined and efficient method for pretraining Vision Transformers, and theoretically, it opens avenues for reduced reliance on [MASK] token strategies in masked image modeling. This change not only alleviates pretraining-finetuning inconsistencies but also enhances processing efficiency, making it viable for larger and more complex hierarchical ViTs.

The authors suggest that while this research focuses on vision, similar principles may be extended to other modalities such as text and audio. Future work could explore the application of mixed input strategies in domains beyond image classification, perhaps involving multimodal data where hierarchical representation needs to be understood and processed efficiently.

Conclusion

The paper provides a compelling improvement over traditional MIM techniques by addressing key inefficiencies in Vision Transformer pretraining. By eliminating [MASK] tokens in the encoder and utilizing a hierarchical architecture, MixMAE not only improves the quality of learned visual representations but also achieves impressive computational efficiencies. This signals a forward step in AI research, promising more efficient and scalable models suited for complex vision-based applications.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Jihao Liu (60 papers)
Xin Huang (222 papers)
Jinliang Zheng (10 papers)
Yu Liu (784 papers)
Hongsheng Li (340 papers)

Citations (36)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Sense-X/MixMIM: MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning (127 stars)