BEiT: BERT Pre-Training of Image Transformers (2106.08254v2)

Published 15 Jun 2021 in cs.CV and cs.LG

Abstract: We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.

PDF Abstract

BEiT: BERT Pre-Training of Image Transformers

In this paper, the authors propose BEiT, an innovative self-supervised vision representation model that leverages a pre-training task inspired by BERT for vision Transformers. The model aims to address the data-hungry nature of vision Transformers compared to Convolutional Neural Networks (CNNs) by employing a masked image modeling (MIM) task. This approach is inspired by the success of BERT's masked LLMing task in NLP. Specifically, BEiT introduces a methodology for pre-training vision Transformers that is centered on the masked recovery of visual tokens, rather than naive pixel-level recovery.

Methodology

Image Representations

The core of BEiT involves using two views of images for pre-training: image patches and visual tokens. The image is split into non-overlapping patches of a fixed size (e.g., 16x16 pixels), which serve as the input to the vision Transformer. Meanwhile, visual tokens are generated by a "tokenizer" learned via a discrete Variational Autoencoder (dVAE). During pre-training, a portion of these image patches is masked randomly and then fed to the transformer backbone. The objective is to predict the original visual tokens corresponding to these masked patches.

Backbone Network

The backbone of BEiT is a standard Transformer architecture consistent with the design of ViT. An image’s patches are linearly projected into a sequence, and positional embeddings are added to maintain spatial information. These sequences are then processed through Transformer layers, which enable the self-attention mechanism to capture long-range dependencies and contextual information.

Pre-Training Objective

The pre-training objective for BEiT, formulated as a masked image modeling task, is to predict visual tokens rather than raw pixel values of masked patches. This choice is crucial as it encourages the model to focus on learning high-level abstractions rather than short-range dependencies and high-frequency details. The prediction of visual tokens introduces a significant challenge and opportunity for the model to grasp a more holistic and semantic understanding of the image content.

Experimental Validation

The efficacy of BEiT is validated through extensive experiments on image classification and semantic segmentation tasks. The results are benchmarked against traditional supervised and self-supervised methods.

Image Classification

On the ImageNet-1K dataset, BEiT demonstrates comparable, and in some configurations, superior performance relative to models pre-trained with labeled data. It also outperforms previous self-supervised vision Transformer models such as MoCo v3 and DINO. The detailed experiments reveal that BEiT benefits significantly from higher resolutions, indicating its robust scalability.

Semantic Segmentation

For semantic segmentation, the model is evaluated on the ADE20K dataset, and it achieves higher mean Intersection over Union (mIoU) values compared to traditional supervised pre-training methods. BEiT's ability to capture semantically meaningful regions without any task-specific supervision demonstrates its effectiveness in downstream vision tasks, particularly concerning tasks that require fine-grained understanding.

Ablation Studies

Multiple ablation studies are conducted to validate the importance of different components of BEiT:

Blockwise Masking: Treating image patches as blocks for masking, rather than individual patches, is shown to lead to better performance, especially on tasks requiring semantic understanding such as segmentation.
Visual Token Prediction: Predicting visual tokens rather than raw pixel values of masked patches significantly enhances the model's performance.
Extended Pre-Training: Longer pre-training duration consistently yields performance gains, implying the potential benefits of large-scale and extended self-supervised training.

Implications and Future Work

The BEiT model offers a promising direction for self-supervised learning in computer vision. Its approach to pre-training vision Transformers using a masked image modeling task not only provides competitive performance in comparison to both supervised and self-supervised methods but also emphasizes the effectiveness of learning discrete representations in visual contexts.

The implications are substantial for the practical deployment of vision Transformers, particularly in scenarios with limited labeled datasets. In the future, scaling the BEiT model and extending its application to multimodal settings with unified architectures for textual and visual data will be valuable areas of exploration. Additionally, the potential of BEiT in real-time and resource-constrained environments warrants further research.

In conclusion, BEiT's methodology and results highlight the importance of advanced pre-training techniques and pave the way for further innovations in self-supervised vision Transformers. This framework has the potential to redefine pre-training paradigms for a wide range of computer vision applications.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Hangbo Bao (17 papers)
Li Dong (154 papers)
Songhao Piao (9 papers)
Furu Wei (291 papers)

Citations (2,456)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos