SimMIM: A Simple Framework for Masked Image Modeling (2111.09886v2)

Published 18 Nov 2021 in cs.CV

Abstract: This paper presents SimMIM, a simple framework for masked image modeling. We simplify recently proposed related approaches without special designs such as block-wise masking and tokenization via discrete VAE or clustering. To study what let the masked image modeling task learn good representations, we systematically study the major components in our framework, and find that simple designs of each component have revealed very strong representation learning performance: 1) random masking of the input image with a moderately large masked patch size (e.g., 32) makes a strong pre-text task; 2) predicting raw pixels of RGB values by direct regression performs no worse than the patch classification approaches with complex designs; 3) the prediction head can be as light as a linear layer, with no worse performance than heavier ones. Using ViT-B, our approach achieves 83.8% top-1 fine-tuning accuracy on ImageNet-1K by pre-training also on this dataset, surpassing previous best approach by +0.6%. When applied on a larger model of about 650 million parameters, SwinV2-H, it achieves 87.1% top-1 accuracy on ImageNet-1K using only ImageNet-1K data. We also leverage this approach to facilitate the training of a 3B model (SwinV2-G), that by $40\times$ less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks. The code and models will be publicly available at https://github.com/microsoft/SimMIM.

PDF Abstract

SimMIM: A Simple Framework for Masked Image Modeling

The paper, "SimMIM: a Simple Framework for Masked Image Modeling," introduces an efficient and powerful method for learning visual representations through masked image modeling without the need for complex architectural designs. This work methodically explores critical components of the masked image modeling task to understand which elements most contribute to effective representation learning.

Core Contributions and Findings

Random Masking of Image Patches:
- The authors demonstrate that randomly masking image patches of moderately large size (e.g., 32×32 pixels) serves as an effective pretext task for learning robust image representations. This approach diverges from more complex, previously proposed methods such as block-wise masking and tokenization using discrete VAE or clustering.
- A significant finding is that the preferred masking ratios for visual tasks differ from those in LLMing. For instance, mask ratios ranging from 10% to 70% were found to be effective for large patches, which contrast with the small masking ratio of 0.15 commonly used in NLP tasks.
Regression Over Raw Pixel Values:
- Instead of employing complex classification targets derived from tokenization or clustering, the authors use a direct regression task to predict the raw RGB values of the masked patches. This method simplifies the architecture and improves computational efficiency without sacrificing performance.
Lightweight Prediction Head:
- A critical innovation in SimMIM is the use of an extremely lightweight prediction head, primarily a linear layer. This design choice contrasts with more cumbersome prediction heads such as multi-layer perceptrons (MLPs) or inverse transformer models used in prior works. The linear prediction head maintains competitive performance while significantly reducing training costs.

Numerical Results and Performance

The proposed SimMIM framework achieves compelling numerical results:

With the ViT-B architecture, SimMIM attains a top-1 accuracy of 83.8% on the ImageNet-1K dataset when pre-trained on ImageNet-1K data, surpassing the previous best approach (BEiT) by +0.6%.
Demonstrating impressive scalability, SimMIM achieves 87.1% top-1 accuracy using the SwinV2-H model with 650 million parameters on ImageNet-1K, showcasing its effectiveness in handling large-scale models.
The paper also highlights the data efficiency of SimMIM. For instance, a 3 billion parameter model, SwinV2-G, is trained to achieve state-of-the-art accuracy across four vision benchmarks using 40 times less labeled data than required by previous practices involving the JFT-3B dataset.

Implications and Future Directions

Practical Implications:

The simplicity and efficiency of SimMIM make it highly practical for widespread adoption in various computer vision tasks. Its ability to learn strong visual features without the need for complex architectural innovations means it can be easily integrated into existing pipelines or applied to new domains with minimal adaptation.

Theoretical Implications:

The insights gleaned from SimMIM about the importance of prediction distance (as quantifiable by the AvgDist metric) and simplicity in prediction tasks challenge the need for overly complex methodologies in self-supervised learning. The authors posit that fundamental differences between image and language modalities necessitate distinct approaches in self-supervised learning strategies.

Future Directions:

Further exploration may involve applying SimMIM to different architectures beyond vision transformers, such as convolutional neural networks (CNNs) or hybrid models combining CNNs and transformers.
Given its data efficiency, additional research could investigate the use of SimMIM in low-resource settings or with even less pre-processed data.
Extending the framework to other modalities or multimodal tasks remains an enticing avenue for achieving more generalized and robust AI models.

In conclusion, SimMIM stands out as an exemplar of how simplicity can rival and even outperform complexity in the field of self-supervised learning, offering a potent tool for advancing the state-of-the-art in computer vision.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Zhenda Xie (51 papers)
Zheng Zhang (486 papers)
Yue Cao (147 papers)
Yutong Lin (15 papers)
Jianmin Bao (65 papers)
Zhuliang Yao (7 papers)
Qi Dai (58 papers)
Han Hu (196 papers)

Citations (1,160)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/SimMIM: This is an official implementation for "SimMIM: A Simple Framework for Masked Image Modeling". (896 stars)

Tweets

https://twitter.com/clement_poiret1/status/1892513676046778535