Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking (2303.05475v1)

Published 9 Mar 2023 in cs.CV

Abstract: Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the low-level RGB signals after the decoder and lacks supervision upon high-level semantics for the encoder, thus suffering from sub-optimal learned representations and long pre-training epochs. To alleviate this, previous methods simply replace the pixel reconstruction targets of 75% masked tokens by encoded features from pre-trained image-image (DINO) or image-language (CLIP) contrastive learning. Different from those efforts, we propose to Mimic before Reconstruct for Masked Autoencoders, named as MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training. For high-level semantics, MR-MAE employs a mimic loss over 25% visible tokens from the encoder to capture the pre-trained patterns encoded in CLIP and DINO. For low-level structures, we inherit the reconstruction loss in MAE to predict RGB pixel values for 75% masked tokens after the decoder. As MR-MAE applies high-level and low-level targets respectively at different partitions, the learning conflicts between them can be naturally overcome and contribute to superior visual representations for various downstream tasks. On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning, surpassing the 1600-epoch MAE base by +2.2% and the previous state-of-the-art BEiT V2 base by +0.3%. Code and pre-trained models will be released at https://github.com/Alpha-VL/ConvMAE.

PDF Abstract

Enhancing Masked Autoencoders with Feature Mimicking: MR-MAE

The paper presents a novel framework, MR-MAE, aiming to enhance Masked Autoencoders (MAE) by integrating high-level feature mimicking before pixel reconstruction. Masked Autoencoders have gained prominence for large-scale vision representation, yet they exhibit limitations due to the lack of high-level semantic guidance during pre-training. MR-MAE addresses these limitations by incorporating feature mimicry from pre-trained models such as CLIP and DINO, allowing a more comprehensive learning of both high-level semantics and low-level textures without conflict.

Key Contributions

Mimic Before Reconstruct: MR-MAE introduces a straightforward yet impactful strategy where a mimic loss is applied to the visible tokens from the encoder, aligning the representations with high-level features from CLIP or DINO. This approach derives semantic guidance from the beginning, unlike traditional MAE, which relies solely on pixel-level reconstruction.
Dual Target Approach: By deploying a reconstruction loss for masked tokens and a mimic loss for visible ones, MR-MAE circumvents the conflicts that arise from concurrent high-level and low-level training objectives.
Significant Improvement in Performance: On the ImageNet-1K dataset, MR-MAE achieves a top-1 accuracy of 85.8% after 400 pre-training epochs, surpassing the original MAE by +2.2% and state-of-the-art models such as BEiT V2 by +0.3%.

Experimental Validation

The efficacy of MR-MAE is evidenced through extensive experiments in image classification and object detection:

ImageNet-1K Fine-tuning: With a pre-training duration of only 400 epochs, MR-MAE achieves superior accuracy compared to models that typically require 1600 epochs, demonstrating faster convergence and enhanced learning potential.
COCO Object Detection: The MR-MAE backbone for Mask-RCNN achieves 53.4% in box AP score with only 25 fine-tuning epochs, indicating robust transferability of the learned representations.

Technical Insights

MR-MAE utilizes several technological augmentations:

Focused Mimicking and Multi-layer Fusion: The model intelligently selects salient tokens for mimicry and integrates multiple layers for coherent feature fusion, reinforcing the encoder’s ability to capture high-level semantics.
Incorporation of Multi-scale Architectures: By adopting masked convolution stages, MR-MAE captures hierarchical representations, further solidifying its performance in downstream tasks.

Implications and Future Directions

The approach exemplifies how high-level feature integration can significantly elevate the performance of generative pre-training models. MR-MAE’s framework paves the way for efficient scaling, reduced pre-training times, and the potential for even broader applications in nuanced visual tasks. It invites future exploration into more sophisticated integration of diverse high-level features, potentially leveraging multiple pre-trained models to offer a richer semantic landscape.

Conclusion

MR-MAE represents a meaningful advancement in the domain of vision transformers, effectively merging low-level and high-level training targets to derive a more comprehensive understanding of visual data. By harnessing pre-existing high-level information encoded in models like CLIP and DINO, MR-MAE sets a precedent for future innovations in feature distillation and representation learning.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Peng Gao (402 papers)
Renrui Zhang (100 papers)
Rongyao Fang (18 papers)
Ziyi Lin (12 papers)
Hongyang Li (99 papers)
Hongsheng Li (340 papers)
Qiao Yu (14 papers)

Citations (14)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Alpha-VL/ConvMAE: ConvMAE: Masked Convolution Meets Masked Autoencoders (464 stars)