MST: Masked Self-Supervised Transformer for Visual Representation (2106.05656v2)

Published 10 Jun 2021 in cs.CV

Abstract: Transformer has been widely used for self-supervised pre-training in NLP and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only consider the high-level feature and learning representation from a global perspective, which may fail to transfer to the downstream dense prediction tasks focusing on local features. In this paper, we present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked LLMing (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks. The experiments on multiple datasets demonstrate the effectiveness and generality of the proposed method. For instance, MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation, which outperforms supervised methods with the same epoch by 0.4% and its comparable variant DINO by 1.0\%. For dense prediction tasks, MST also achieves 42.7% mAP on MS COCO object detection and 74.04% mIoU on Cityscapes segmentation only with 100-epoch pre-training.

PDF Abstract

Overview of "MST: Masked Self-Supervised Transformer for Visual Representation"

The paper "MST: Masked Self-Supervised Transformer for Visual Representation" discusses a novel approach for enhancing visual self-supervised learning through transformers. The authors introduce a masked self-supervised transformer (MST) methodology inspired by the masked LLMing (MLM) approach from NLP, adapting it to the visual domain to capture local image context while preserving global semantic structures.

Core Contributions

Masked Token Strategy: The paper proposes an innovative masked token strategy leveraging multi-head self-attention maps. In contrast to traditional random masking strategies, this approach dynamically masks local patch tokens without compromising essential image structures. The strategy aids in maintaining semantic integrity necessary for self-supervised learning by focusing on maintaining crucial spatial information.
Global Image Decoder: MST employs a global image decoder tasked with recovering spatial details from masked and unmasked tokens, thus ensuring that spatial information is preserved. This property is particularly advantageous for downstream dense prediction tasks, such as object detection and semantic segmentation, where spatial accuracy is paramount.
Empirical Validation: Through extensive experiments, MST demonstrates superior performance across a range of datasets. Notably, it achieves a Top-1 accuracy of 76.9% on ImageNet with DeiT-S using only 300 epochs in linear evaluation, surpassing other methodologies like DINO and supervised learning alternatives. Furthermore, MST excels in dense prediction tasks, with a 42.7% mAP on MS COCO object detection and a 74.04% mIoU on Cityscapes segmentation following 100-epoch pre-training.

Implications and Future Directions

The MST approach extends the utility of transformers into visual self-supervised learning by balancing global semantic capture with localized feature extraction. This methodology bridges the gap between self-supervised learning representations and the requirements of pixel-level prediction tasks. The capacity to learn robust visual representations without requiring extensive labeled data makes MST particularly relevant in scenarios involving large-scale datasets where label acquisition is impractical.

Contrasting with mainstream self-supervised strategies that overfit to high-level features unsuitable for dense task transfer, MST's focus on reconstructive tasks alongside transformation embedding promises enhanced generalization. The method underlines a path for future research, where attention-guided systematic masking could be refined further to optimize feature learning, potentially informing advancements in both architectural design and training efficiency.

Future investigations may delve into diverse model architectures using MST principles and explore variations in task complexity. Additionally, analyzing the interplay between different attention-driven masking strategies and their impact on model robustness and adaptability could yield insights that refine the approach and expand its application scope.

In summary, MST presents a significant step forward in the exploitation of transformer architectures for self-supervised visual representation learning, emphasizing the importance of retaining spatial structure and context in overcoming the limitations posed by earlier methodologies.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Zhaowen Li (7 papers)
Zhiyang Chen (27 papers)
Fan Yang (878 papers)
Wei Li (1122 papers)
Yousong Zhu (19 papers)
Chaoyang Zhao (14 papers)
Rui Deng (13 papers)
Liwei Wu (34 papers)
Rui Zhao (241 papers)
Ming Tang (199 papers)
Jinqiao Wang (76 papers)

Citations (152)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos