BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers (2208.06366v2)

Published 12 Aug 2022 in cs.CV

Abstract: Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most existing studies operate on low-level image pixels, which hinders the exploitation of high-level semantics for representation models. In this work, we propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level. Specifically, we propose vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. Furthermore, we introduce a patch aggregation strategy which associates discrete image patches to enhance global semantic representation. Experiments on image classification and semantic segmentation show that BEiT v2 outperforms all compared MIM methods. On ImageNet-1K (224 size), the base-size BEiT v2 achieves 85.5% top-1 accuracy for fine-tuning and 80.1% top-1 accuracy for linear probing. The large-size BEiT v2 obtains 87.3% top-1 accuracy for ImageNet-1K (224 size) fine-tuning, and 56.7% mIoU on ADE20K for semantic segmentation. The code and pretrained models are available at https://aka.ms/beitv2.

PDF Abstract

An Analysis of "BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers"

In the paper "BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers", the authors propose an innovative approach to improve self-supervised representation learning in computer vision, leveraging masked image modeling (MIM) techniques. The central innovation in this paper is the development of a semantic-rich visual tokenizer that shifts the reconstruction target of MIM from pixel-level to semantic-level, thereby promoting higher-level feature learning in vision Transformers.

The proposed methodology involves a two-step process. First, the authors introduce a vector-quantized knowledge distillation (VQ-KD) approach, which trains a visual tokenizer capable of discretizing a continuous semantic space into compact codes. This enables the transformation of high-dimensional image features into discrete tokens. The second component of their methodology is the integration of a patch aggregation strategy. This strategy enriches the global semantic representation by associating discrete image patches, thereby improving the overall representation capability of the model.

Experiments demonstrate the effectiveness of the BEiT v2 model across various tasks, including image classification and semantic segmentation. The BEiT v2 achieves notable performance on the ImageNet-1K dataset, with a base model accuracy of 85.5% for fine-tuning and 80.1% for linear probing. For larger model variants, the performance reaches 87.3% accuracy in the ImageNet-1K fine-tuning task, and 56.7% mean Intersection over Union (mIoU) in ADE20K semantic segmentation. These results affirm that BEiT v2 outperforms several existing MIM methods consistently.

The paper's quantitative results underscore the strength of the BEiT v2 approach in moving beyond traditional pixel-centric MIM. By incorporating semantically aware visual tokenizers, the framework aligns more closely with high-level semantics analogous to those seen in masked LLMing tasks. This demonstrates the potential of BEiT v2 in learning richer and more globally coherent visual representations, which could be harnessed across various downstream computer vision tasks.

Looking forward, the proposed research could catalyze further exploration into vector-quantization techniques within MIM schemes and their role in lowering the dimensional complexity of pretraining spaces. Additionally, it ushers in new possibilities for merging visual and language modalities, a growing interest area in AI research. The idea of deploying a universal tokenizer for cross-modal learning could redefine approaches to unified vision-language pretraining architectures.

In conclusion, "BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers" presents a significant advancement in semantic-level pretraining for visual Transformers, pushing the boundaries of self-supervised learning in computer vision. The integration of vector-quantization and enhanced patch aggregation strategies marks a compelling step towards more efficient and semantically enriched model pretraining paradigms.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Zhiliang Peng (13 papers)
Li Dong (154 papers)
Hangbo Bao (17 papers)
Qixiang Ye (110 papers)
Furu Wei (291 papers)

Citations (257)

View on Semantic Scholar

BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers (2208.06366v2)

An Analysis of "BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers"

Related Papers