Benchmarking Detection Transfer Learning with Vision Transformers (2111.11429v1)

Published 22 Nov 2021 in cs.CV

Abstract: Object detection is a central downstream task used to test if pre-trained network parameters confer benefits, such as improved accuracy or training speed. The complexity of object detection methods can make this benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive. These difficulties (e.g., architectural incompatibility, slow training, high memory consumption, unknown training formulae, etc.) have prevented recent studies from benchmarking detection transfer learning with standard ViT models. In this paper, we present training techniques that overcome these challenges, enabling the use of standard ViT models as the backbone of Mask R-CNN. These tools facilitate the primary goal of our study: we compare five ViT initializations, including recent state-of-the-art self-supervised learning methods, supervised initialization, and a strong random initialization baseline. Our results show that recent masking-based unsupervised learning methods may, for the first time, provide convincing transfer learning improvements on COCO, increasing box AP up to 4% (absolute) over supervised and prior self-supervised pre-training methods. Moreover, these masking-based initializations scale better, with the improvement growing as model size increases.

Citations (158)

View on Semantic Scholar

Summary

The paper introduces a novel protocol adapting ViT as a backbone in Mask R-CNN for improved detection.
It proposes transfer learning methods that reduce memory and computational overhead through windowed attention mechanisms.
Empirical results show masking-based unsupervised pre-training boosts detection accuracy and accelerates convergence.

Benchmarking Detection Transfer Learning with Vision Transformers

The paper "Benchmarking Detection Transfer Learning with Vision Transformers" by Yanghao Li et al. addresses the complexities and challenges involved in using Vision Transformers (ViT) for transfer learning in object detection tasks. Object detection is a pivotal downstream task used to evaluate if pre-trained network parameters enhance accuracy or training efficiency. With the advent of Vision Transformer architectures, benchmarking these models for detection tasks presents unique obstacles due to architectural incompatibility, computational overhead, and increased memory requirements.

Overview and Motivations

Traditional convolutional neural networks (CNNs) have established a robust framework in object detection due to their widespread application and ease of parameter transfer methods. However, the integration of Vision Transformers, introduced by Dosovitskiy et al., into object detection models like Mask R-CNN necessitates the development of new transfer learning protocols. This paper embarks on this task by establishing foundational training techniques that allow the seamless integration of standard ViT models into the Mask R-CNN backbone. The researchers provide methods to mitigate issues such as large memory demands and single-scale feature generation inherent in ViTs.

Methodological Contributions

ViT as Backbone in Mask R-CNN: The authors have adapted Vision Transformers to function within the Mask R-CNN framework, a prevalent model in transfer learning research. They employ non-overlapping windowed attention mechanisms to curtail memory and computational costs, alongside windowed and global self-attention to optimize performance.
Transfer Learning Protocol: The paper introduces an evaluation protocol for leveraging ViT models in object detection, specifically with the COCO dataset. By refining Mask R-CNN components and introducing upsampling and downsampling modules within ViTs, they achieve enhanced compatibility with feature pyramid networks (FPN).
Training Formula and Hyperparameter Tuning: A standardized training formula allowing models to train from scratch or fine-tune pre-trained variants is presented. It includes strategies such as large-scale jitter (LSJ) and AdamW optimizer, with hyperparameters like learning rate, weight decay, and drop path rate tuned for optimal efficacy.

Results and Implications

The empirical analysis reveals that masking-based unsupervised learning methods, namely BEiT and MAE, yield substantial improvements in object detection tasks compared to conventional supervised pre-training on ImageNet or MoCo v3, a contrastive learning framework. Notably, with larger model sizes like ViT-L, these methods drive even greater performance gains, supporting the hypothesis of better scaling properties for masking-based approaches.

Furthermore, the paper highlights that state-of-the-art Vision Transformers significantly accelerate convergence rates during fine-tuning, achieving compelling results with less training data relative to random initialization—despite the latter showing commendable results when executed over extended schedules.

Future Directions

This research lays the groundwork for extending the use of Vision Transformers to even more advanced architectures, such as Swin Transformers and Multiscale Vision Transformers (MViT). By releasing the techniques and code via Detectron2, the authors seek to enable the community to further explore and enhance transfer learning applications utilizing ViT and its derivatives.

Conclusion

The paper successfully bridges a critical gap in transfer learning applications. It proposes robust methodologies to benchmark and integrate Vision Transformers into object detection frameworks effectively. Through these advancements, the paper promises to set a precedent for future investigations using transformers in a variety of vision-related tasks, further expanding their applicability beyond traditional CNN constraints.