Exploring Plain Vision Transformer Backbones for Object Detection (2203.16527v2)

Published 30 Mar 2022 in cs.CV

Abstract: We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.

PDF Abstract

Exploring Plain Vision Transformer Backbones for Object Detection

The paper, "Exploring Plain Vision Transformer Backbones for Object Detection," presents research on leveraging the vanilla Vision Transformer (ViT) architecture as a backbone for object detection tasks. This work diverges from the prevalent hierarchical designs, notably ConvNets and their modern Transformer counterparts like Swin and MViT, by adopting a plain, non-hierarchical approach.

Key Insights

The paper challenges the necessity of hierarchical structures in vision backbones for object detection by demonstrating that simple adaptations can achieve competitive performance. Notably, two surprising findings emerged:

Simple Feature Pyramid: The researchers argue that it is sufficient to build a straightforward feature pyramid using outputs from a single-scale feature map, eschewing the complex multi-scale feature combination used in traditional hierarchical designs like FPN.
Window Attention: They further assert that window attention without shifting, supplemented by a minimal number of cross-window propagation blocks, suffices for effective feature extraction at high resolutions.

Methodological Approach

The core focus of the research lies in adapting the plain ViT architecture for object detection without altering its pretraining design. This involves:

Feature Pyramid Construction: The paper emphasizes constructing a simple feature pyramid by applying convolutions and deconvolutions to the output of the ViT, effectively bypassing the need for hierarchically organized networks.
Backbone Adaptation: The adaptation involves using windowed attention blocks with occasional global attention or convolutional propagation to handle large input sizes efficiently.

Numerical Results

The plain-backbone detector, termed ViTDet, exhibits competitive performance against state-of-the-art hierarchical-backbone detectors. For the COCO dataset, ViTDet achieves up to 61.3 AP on bounding boxes, using merely ImageNet-1K pre-training without labels. These results underscore the potential effectiveness of a plain-backbone approach in contexts traditionally dominated by hierarchical designs.

Implications and Future Directions

This research opens avenues for further exploration into plain-backbone architectures, suggesting a shift towards reducing inductive biases in backbone design. By decoupling pretraining from downstream applications, the model architecture can remain static across different problem domains, similar to trends in NLP pretraining frameworks like BERT and GPT.

Furthermore, the paper proposes that the effectiveness of masked autoencoder (MAE) pretraining significantly enhances performance, particularly for plain backbones. This could lead to further inquiries into self-supervised learning approaches applicable to both plain and hierarchical models.

Conclusion

In essence, this research encourages a reevaluation of the necessity of hierarchical architectures in vision tasks, presenting plain ViT backbones as a viable alternative capable of achieving competitive performance with simple, impactful modifications. Future research could investigate incorporating enhanced block designs and exploring the integration of additional inductive biases specifically tailored for detection tasks. Such developments could refine and extend the applicability of plain-backbone detection across diverse visual recognition challenges.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Yanghao Li (43 papers)
Hanzi Mao (8 papers)
Ross Girshick (75 papers)
Kaiming He (71 papers)

Citations (677)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/giffmana/status/1795530768216928549