An Empirical Study of Training Self-Supervised Vision Transformers (2104.02057v4)

Published 5 Apr 2021 in cs.CV and cs.LG

Abstract: This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT). While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging. In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. We reveal that these results are indeed partial failure, and they can be improved when training is made more stable. We benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. We discuss the currently positive evidence as well as challenges and open questions. We hope that this work will provide useful data points and experience for future research.

Authors (3)

Xinlei Chen (106 papers)
Saining Xie (60 papers)
Kaiming He (71 papers)

Citations (1,636)

View on Semantic Scholar

Summary

An Empirical Study of Training Self-Supervised Vision Transformers

The paper presented, "An Empirical Study of Training Self-Supervised Vision Transformers," by Xinlei Chen, Saining Xie, and Kaiming He, investigates the complexities and nuances involved in training Vision Transformers (ViT) in self-supervised settings. This research does not propose a novel method but rather aims to refine existing frameworks, benchmarking them to provide detailed insights necessary for advancing future work in the domain.

Background and Motivation

Self-supervised learning has achieved significant milestones in NLP, with models employing masked auto-encoders rapidly gaining prominence. However, computer vision (CV) poses unique challenges as the prevalent architectures differ fundamentally, typically using convolutional neural networks (CNNs) rather than transformer-based models. Given the recent successes of Vision Transformers (ViT), it becomes imperative to explore their training dynamics under self-supervision to potentially harmonize the methodologies between NLP and CV.

Key Findings and Contributions

Training Stability as a Major Issue

The paper highlights a critical challenge in training self-supervised ViTs: instability. The researchers observed that instability often manifests not as a catastrophic failure but as a subtle degradation in accuracy. This issue is significant since it can obscure genuine improvements in training methodologies, leading to potentially misleading results. The paper introduces a practical solution to enhance training stability: freezing the patch projection layer. This technique consistently resulted in improved stability and, consequently, better accuracy across various frameworks.

Benchmarking Frameworks

The paper benchmarks ViT within several self-supervised learning frameworks, including MoCo v3, SimCLR, BYOL, and SwAV. It was noted that MoCo v3 exhibited superior performance when paired with ViT, suggesting its robustness and efficiency as a framework for self-supervised vision transformations.

Empirical Results and Ablations

Key empirical findings from the benchmarking are summarized as follows:

ViT-B trained under MoCo v3 for 300 epochs achieved a linear probing accuracy of 76.5%, significantly surpassing the accuracy of other self-supervised learning methods evaluated.
A detailed paper of various hyperparameters such as batch size, learning rate, and optimizer choice was conducted, revealing nuanced impacts on training stability and final accuracy.
The use of a random patch projection significantly boosted the stability and performance, with improvements up to 1.7% in specific configurations.
Interesting insights were derived from ablation studies on position embeddings, the utilization of the class token, and the presence of BatchNorm layers in MLP heads. These design decisions manifested different impacts on training efficiency and model performance.

Implications and Future Directions

The findings pave the way for more nuanced approaches to training self-supervised ViTs. The successful use of random patch projection to stabilize training is a particularly notable contribution, as it may be broadly applicable to other variants of Transformers beyond ViT.

Theoretical implications suggest a shift towards architectures that finetune the balance between inductive biases inherent in CNNs and the less rigid structures of Transformers. The empirical analysis also underscores the need for larger datasets and more extensive augmentation techniques to fully capitalize on the potential of ViTs in self-supervised learning.

Future developments could include:

Exploring advanced pretext tasks that push the boundaries of instance-based learning, as current tasks may not provide sufficient complexity for large models.
Scaling up models and datasets concurrently, similar to the trajectory observed in NLP, to observe if larger-scale self-supervised learning can lead to more generalized and robust ViT representations.
Further research on positional information, given indications that current models might not be leveraging positional embedding optimally.

Conclusion

This paper's comprehensive empirical investigation into training self-supervised Vision Transformers offers valuable insights and practical benchmarks for the research community. By highlighting the challenges, particularly around stability, and suggesting actionable solutions, the paper contributes meaningfully to the advancement of self-supervised learning in computer vision, bridging some gaps between current practices in NLP and CV. This foundational work sets the stage for more refined and scalable training methodologies, reflecting a critical step forward in the maturation of Vision Transformers.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos