Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bootstrapped Masked Autoencoders for Vision BERT Pretraining (2207.07116v1)

Published 14 Jul 2022 in cs.CV and cs.LG

Abstract: We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. BootMAE improves the original masked autoencoders (MAE) with two core designs: 1) momentum encoder that provides online feature as extra BERT prediction targets; 2) target-aware decoder that tries to reduce the pressure on the encoder to memorize target-specific information in BERT pretraining. The first design is motivated by the observation that using a pretrained MAE to extract the features as the BERT prediction target for masked tokens can achieve better pretraining performance. Therefore, we add a momentum encoder in parallel with the original MAE encoder, which bootstraps the pretraining performance by using its own representation as the BERT prediction target. In the second design, we introduce target-specific information (e.g., pixel values of unmasked patches) from the encoder directly to the decoder to reduce the pressure on the encoder of memorizing the target-specific information. Thus, the encoder focuses on semantic modeling, which is the goal of BERT pretraining, and does not need to waste its capacity in memorizing the information of unmasked tokens related to the prediction target. Through extensive experiments, our BootMAE achieves $84.2\%$ Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming MAE by $+0.8\%$ under the same pre-training epochs. BootMAE also gets $+1.0$ mIoU improvements on semantic segmentation on ADE20K and $+1.3$ box AP, $+1.4$ mask AP improvement on object detection and segmentation on COCO dataset. Code is released at https://github.com/LightDXY/BootMAE.

Bootstrapped Masked Autoencoders for Vision BERT Pretraining

The research presented in "Bootstrapped Masked Autoencoders for Vision BERT Pretraining" contributes to the field of self-supervised representation learning by advancing masked autoencoder (MAE) techniques for vision transformer models. The paper introduces Bootstrapped Masked Autoencoders (BootMAE), a novel framework designed to enhance BERT pretraining for visual tasks with two pivotal innovations: the use of a momentum encoder to bootstrap the pretraining process and a target-aware decoder to optimize target-specific information handling.

Key Contributions

  1. Momentum Encoder for Bootstrapping: The paper identifies and utilizes a momentum encoder to improve pretraining performance. The momentum encoder generates additional BERT prediction targets as a complementary strategy to enhance the original MAE's feature extraction capabilities. This encoder employs exponentially moving averages (EMA) to refine weights and, consequently, enriches the semantic depth as training progresses.
  2. Target-Aware Decoder: By decoupling target-specific information from semantic modeling, the newly introduced target-aware decoder reduces the need for the encoder to memorize explicit target data. This adjustment allows the encoder to concentrate on semantic modeling, exemplifying a shift towards enhancing the model's structural understanding of images rather than redundant data memorization.

Experimental Results and Analysis

BootMAE demonstrates its efficacy through extensive experimentation across various benchmark datasets, yielding consistently superior results compared to prior methodologies. For instance, BootMAE achieves a remarkable 84.2% top-1 accuracy using the ViT-B backbone on ImageNet-1K after 800 epochs—a notable improvement over conventional MAE methods. Additionally, it surpasses existing models by achieving a 1.0 mIoU enhancement in semantic segmentation on the ADE20K dataset and up to a 1.4% increase in object detection and segmentation accuracy on the COCO dataset.

Crucial to these outcomes are the model's finer architectural adjustments: masking strategies and the decoupling of context learning within the decoder. The paper shows that different prediction targets—pixel vs. feature—favor different masking strategies, thereby underscoring the importance of tailoring these approaches to specific downstream tasks.

Implications and Future Directions

The implications of BootMAE are expansive, offering potential enhancements in downstream vision tasks such as image classification, object detection, and semantic segmentation. By effectively decoupling the responsibilities of the encoder and decoder, BootMAE presents a more efficient paradigm for semantic reasoning, relevant to both academia and industry.

The bootstrapping mechanism employed in BootMAE can stimulate further exploration into dynamic target adaptation during model training. Additionally, future work could investigate the scalability of BootMAE across larger models and more complex datasets, alongside exploring its intersection with cross-modal tasks that integrate visual and linguistic information.

In conclusion, the introduction of BootMAE provides significant insights into optimizing pretraining frameworks for vision transformers, leveraging self-supervised learning principles to advance performance metrics across critical visual benchmarks. This work can lay the foundation for subsequent innovations in the development of robust, semantically aware vision models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Xiaoyi Dong (73 papers)
  2. Jianmin Bao (65 papers)
  3. Ting Zhang (174 papers)
  4. Dongdong Chen (164 papers)
  5. Weiming Zhang (135 papers)
  6. Lu Yuan (130 papers)
  7. Dong Chen (218 papers)
  8. Fang Wen (42 papers)
  9. Nenghai Yu (173 papers)
Citations (67)
Github Logo Streamline Icon: https://streamlinehq.com