Three things everyone should know about Vision Transformers (2203.09795v1)

Published 18 Mar 2022 in cs.CV

Abstract: After their initial success in natural language processing, transformer architectures have rapidly gained traction in computer vision, providing state-of-the-art results for tasks such as image classification, detection, segmentation, and video analysis. We offer three insights based on simple and easy to implement variants of vision transformers. (1) The residual layers of vision transformers, which are usually processed sequentially, can to some extent be processed efficiently in parallel without noticeably affecting the accuracy. (2) Fine-tuning the weights of the attention layers is sufficient to adapt vision transformers to a higher resolution and to other classification tasks. This saves compute, reduces the peak memory consumption at fine-tuning time, and allows sharing the majority of weights across tasks. (3) Adding MLP-based patch pre-processing layers improves Bert-like self-supervised training based on patch masking. We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set. Transfer performance is measured across six smaller datasets.

Authors (5)

Hugo Touvron (22 papers)
Matthieu Cord (129 papers)
Alaaeldin El-Nouby (21 papers)
Jakob Verbeek (59 papers)
Hervé Jégou (71 papers)

Citations (107)

View on Semantic Scholar

Summary

Three Things Everyone Should Know about Vision Transformers

The paper "Three things everyone should know about Vision Transformers," authored by Hugo Touvron et al., addresses the application and optimization of transformer models in the domain of computer vision. It delineates three specific insights that offer improvements in efficiency, adaptability, and performance for Vision Transformers (ViTs).

Transformers, a model architecture initially developed for natural language processing, have shown significant promise in computer vision tasks, such as image classification, object detection, and video analysis. Despite their success, there remains potential for enhancing their design and training methods. This paper proposes:

Parallel Vision Transformers: The authors suggest a novel architecture whereby the typically sequential residual blocks of a ViT can be reorganized into a parallel fashion. The premise behind this approach is that parallelizing these layers potentially alleviates depth-related optimization challenges, reduces computational latency, and maintains, if not improves, model performance. Experimental evaluation on ImageNet-1k demonstrates that this parallel architecture can result in efficiency gains, especially in models with higher depth if implemented appropriately. It posits that such parallelism can be computationally equivalent in terms of parameters and FLOPS while offering a meaningful decrease in latency, particularly visible on GPU accelerators.
Fine-tuning Attention Layers: The paper highlights that the fine-tuning of self-attention layers alone is sufficient when adapting a ViT to higher resolutions or transferring to new tasks. This approach offers substantial benefits: it results in reduced computational resources during training and allows most of the model's parameters to remain unchanged, facilitating easier transfer learning across tasks or datasets. Evaluations on multiple datasets reveal that such fine-tuning retains competitive performance compared to a full model fine-tuning, particularly for larger models or datasets with limited data.
Patch Preprocessing with Hierarchical MLP (hMLP) for Self-Supervised Learning: The authors propose integrating MLP-based preprocessing layers for enhancing self-supervised training methods like Bert-style self-supervised learning. Their hierarchical MLP stem design processes image patches in a manner compatible with patch masking strategies, which are fundamental to these self-supervised approaches. This design achieves performance improvements over conventional convolutional stems when integrated with the BeiT self-supervised learning framework, as evidenced by experimental results on datasets such as ImageNet.

The paper's outcomes imply practicality for both applied and theoretical exploration of ViTs. Parallel processing of layers can lead to more efficient training, lower latency in computation-heavy applications, and potentially more straightforward optimization paths. The fine-tuning strategy introduces a potentially more resource-efficient pathway to deploy ViTs in transfer learning scenarios, accommodating resource-constrained settings without significant sacrifices in performance. Lastly, the proposed hMLP stem bears potential implications for self-supervised learning strategies, hinting at broader applicability across different domains that benefit from masked prediction models.

Looking forward, these insights invite further exploration into the field of architectural innovations for transformer models in vision tasks. The proposed strategies may extend beyond vision tasks as they interleave with tasks across multimodal domains, thereby broadening the impact of transformer architectures as a universal function approximator in more areas of artificial intelligence research.