Toward Transformer-Based Object Detection (2012.09958v1)

Published 17 Dec 2020 in cs.CV, cs.AI, and cs.LG

Abstract: Transformers have become the dominant model in natural language processing, owing to their ability to pretrain on massive amounts of data, then transfer to smaller, more specific tasks via fine-tuning. The Vision Transformer was the first major attempt to apply a pure transformer model directly to images as input, demonstrating that as compared to convolutional networks, transformer-based architectures can achieve competitive results on benchmark classification tasks. However, the computational complexity of the attention operator means that we are limited to low-resolution inputs. For more complex tasks such as detection or segmentation, maintaining a high input resolution is crucial to ensure that models can properly identify and reflect fine details in their output. This naturally raises the question of whether or not transformer-based architectures such as the Vision Transformer are capable of performing tasks other than classification. In this paper, we determine that Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results. The model that we propose, ViT-FRCNN, demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance. We also investigate improvements over a standard detection backbone, including superior performance on out-of-domain images, better performance on large objects, and a lessened reliance on non-maximum suppression. We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.

PDF Abstract

Toward Transformer-Based Object Detection: A Comprehensive Assessment

The paper "Toward Transformer-Based Object Detection" explores the adaptation of transformer architectures, which have been prolific in NLP, for the task of object detection in computer vision. While the Vision Transformer (ViT) has shown promise with image classification, its efficacy in more spatially-dependent tasks such as object detection remains under investigation.

Vision Transformer for Object Detection

The authors propose ViT-FRCNN, a model leveraging a transformer-based backbone combined with detection-specific task heads, demonstrating the feasibility of deploying transformers for object detection. The model effectively repurposes outputs from the Vision Transformer's final layer into spatial feature maps, introducing region proposal networks and detection heads akin to Faster R-CNN. ViT-FRCNN exhibits competitive performance on the COCO detection challenge, positioned as a noteworthy endeavor toward utilizing pure-transformer models for comprehensive vision tasks.

Methodology and Experimental Insights

The methodology introduces the concept of creating spatial feature maps by utilizing the patch-wise outputs from the Vision Transformer, circumventing the potential spatial correspondence loss inherent to global attention mechanisms. The ViT-FRCNN architecture benefits from increased input resolutions, with analysis revealing that finer patch sizes (e.g., 16×16 pixels) yield better detection outcomes than larger patches (32×32 pixels), underscoring the significance of detail preservation for object detection.

Pretraining and Model Efficiency

Pretraining on large-scale datasets manifests substantial gains in detection accuracy. Transitioning from datasets like ImageNet-21k to Annotations-1.3B enhances performance by 1.2-1.6 AP points, reflecting the advantages of large-scale, diversified pretraining datasets for downstream tasks. Despite not attaining state-of-the-art benchmarks on COCO, ViT-FRCNN draws attention to transformer models' potential in harnessing expansive pretraining for efficient task-specific adaptations.

Implications and Future Directions

The exploration extends to out-of-domain settings, with ViT-FRCNN outperforming traditional CNN-based models (e.g., ResNet-FRCNN) on datasets with controlled biases like ObjectNet-D, suggesting improved generalization capabilities owing to the model's pretraining paradigms.

Moreover, the transformer-based architecture presents intriguing aspects, such as a reduced incidence of spurious overdetections compared to its convolutional counterparts. This attribute may be attributable to transformers' inherent capacity for global feature contextualization, mitigating non-essential duplication of object detection.

Concluding Perspectives

Ultimately, ViT-FRCNN stands as a testament to the adaptability of transformers within the field of computer vision. While adjustments in model architecture and pretraining strategies are necessary to actualize its full potential, these findings encourage further exploration into pure-transformer solutions for multifaceted vision tasks. The paper invites further investigation into the scalability of transformers concerning complex spatial tasks, serving as a gateway to revolutionize traditional convolutional approaches with emerging attention-based methodologies.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Josh Beal (5 papers)
Eric Kim (17 papers)
Eric Tzeng (17 papers)
Dong Huk Park (12 papers)
Andrew Zhai (13 papers)
Dmitry Kislyuk (8 papers)

Citations (188)

View on Semantic Scholar