Exploring Plain Vision Transformer Backbones for Object Detection
The paper, "Exploring Plain Vision Transformer Backbones for Object Detection," presents research on leveraging the vanilla Vision Transformer (ViT) architecture as a backbone for object detection tasks. This work diverges from the prevalent hierarchical designs, notably ConvNets and their modern Transformer counterparts like Swin and MViT, by adopting a plain, non-hierarchical approach.
Key Insights
The paper challenges the necessity of hierarchical structures in vision backbones for object detection by demonstrating that simple adaptations can achieve competitive performance. Notably, two surprising findings emerged:
- Simple Feature Pyramid: The researchers argue that it is sufficient to build a straightforward feature pyramid using outputs from a single-scale feature map, eschewing the complex multi-scale feature combination used in traditional hierarchical designs like FPN.
- Window Attention: They further assert that window attention without shifting, supplemented by a minimal number of cross-window propagation blocks, suffices for effective feature extraction at high resolutions.
Methodological Approach
The core focus of the research lies in adapting the plain ViT architecture for object detection without altering its pretraining design. This involves:
- Feature Pyramid Construction: The paper emphasizes constructing a simple feature pyramid by applying convolutions and deconvolutions to the output of the ViT, effectively bypassing the need for hierarchically organized networks.
- Backbone Adaptation: The adaptation involves using windowed attention blocks with occasional global attention or convolutional propagation to handle large input sizes efficiently.
Numerical Results
The plain-backbone detector, termed ViTDet, exhibits competitive performance against state-of-the-art hierarchical-backbone detectors. For the COCO dataset, ViTDet achieves up to 61.3 AP on bounding boxes, using merely ImageNet-1K pre-training without labels. These results underscore the potential effectiveness of a plain-backbone approach in contexts traditionally dominated by hierarchical designs.
Implications and Future Directions
This research opens avenues for further exploration into plain-backbone architectures, suggesting a shift towards reducing inductive biases in backbone design. By decoupling pretraining from downstream applications, the model architecture can remain static across different problem domains, similar to trends in NLP pretraining frameworks like BERT and GPT.
Furthermore, the paper proposes that the effectiveness of masked autoencoder (MAE) pretraining significantly enhances performance, particularly for plain backbones. This could lead to further inquiries into self-supervised learning approaches applicable to both plain and hierarchical models.
Conclusion
In essence, this research encourages a reevaluation of the necessity of hierarchical architectures in vision tasks, presenting plain ViT backbones as a viable alternative capable of achieving competitive performance with simple, impactful modifications. Future research could investigate incorporating enhanced block designs and exploring the integration of additional inductive biases specifically tailored for detection tasks. Such developments could refine and extend the applicability of plain-backbone detection across diverse visual recognition challenges.