ConvMAE: Masked Convolution Meets Masked Autoencoders (2205.03892v2)

Published 8 May 2022 in cs.CV

Abstract: Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks. A simple block-wise masking strategy is proposed to ensure computational efficiency. We also propose to more directly supervise the multi-scale features of the encoder to boost multi-scale features. Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base. On object detection, ConvMAE-Base finetuned for only 25 epochs surpasses MAE-Base fined-tuned for 100 epochs by 2.9% box AP and 2.2% mask AP respectively. Code and pretrained models are available at https://github.com/Alpha-VL/ConvMAE.

Authors (6)

Peng Gao (402 papers)
Teli Ma (22 papers)
Hongsheng Li (340 papers)
Ziyi Lin (12 papers)
Jifeng Dai (131 papers)
Yu Qiao (563 papers)

Citations (114)

View on Semantic Scholar

Summary

Overview of ConvMAE: A Hybrid Convolution-Transformer Encoder

The paper presents a novel architecture, ConvMAE, a hybrid convolution-transformer encoder aimed at enhancing image representation learning. The ConvMAE architecture leverages the strengths of both convolutional and transformer blocks to achieve superior performance compared to its predecessor, MAE, across various model scales. This paper provides detailed architectural insights and experimental results that support the efficacy of the proposed design.

Architectural Details

ConvMAE consists of three distinct stages, each strategically designed to optimize the encoding process:

Stage 1 and Stage 2: These stages focus on local feature extraction using convolutional blocks. Stage 1 generates high-resolution token embeddings via non-overlapping $4 \times 4$ strided convolution, followed by repeated application of convolutional blocks. Stage 2 continues with further downsampling, utilizing $2 \times 2$ strided convolution. The purpose here is to effectively capture fine-grained details at high resolution.
Stage 3: This stage involves a transition to global feature fusion using transformer blocks. The feature map is downsampled to a lower resolution and projected into token embeddings, which are then processed through a pure transformer block to enable global reasoning. This stage is particularly important for enlarging the field-of-view (FOV), which is beneficial for a wide range of downstream tasks.

Model Variants

The ConvMAE architecture is scaled into various model sizes, namely small, base, large, and huge. These variants allow for flexibility in application across diverse computational environments. The architecture details, including channel dimensions, layer numbers, spatial resolutions, and MLP ratios, are meticulously tailored for each model size to ensure a balance between computational efficiency and learning capability.

Performance Evaluation

The paper provides comprehensive experimental results, showcasing ConvMAE's superiority in performance over the traditional MAE models across various scales. Specifically, ConvMAE demonstrates consistent improvements in ImageNet fine-tuning tasks, achieving notable performance gains with fewer pre-training epochs. For instance, the ConvMAE base model achieves an accuracy of 84.6% in contrast to the 83.6% of its MAE counterpart, indicating the architecture's efficiency and effectiveness.

Implications and Future Directions

The paper of ConvMAE underscores the potential for hybrid architectures in advancing the capabilities of machine learning models. This approach effectively combines the localized information capturing prowess of convolutional networks with the global context-awareness of transformers. The implications of these findings are vast, paving the way for more adaptable and efficient architectures in computer vision.

Future research may explore further refinements to the transition between convolutional and transformer blocks, as well as extending the application of ConvMAE to other domains beyond image processing. Additionally, the exploration of different scaling strategies could yield insights into further optimizing the trade-off between model complexity and performance.

In conclusion, the introduction of the ConvMAE architecture furthers the discourse on hybrid model design, providing a foundational framework for continued innovation at the intersection of convolutional and transformer approaches in AI.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - Alpha-VL/ConvMAE: ConvMAE: Masked Convolution Meets Masked Autoencoders (472 stars)