You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection (2106.00666v3)

Published 1 Jun 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Can Transformer perform 2D object- and region-level recognition from a pure sequence-to-sequence perspective with minimal knowledge about the 2D spatial structure? To answer this question, we present You Only Look at One Sequence (YOLOS), a series of object detection models based on the vanilla Vision Transformer with the fewest possible modifications, region priors, as well as inductive biases of the target task. We find that YOLOS pre-trained on the mid-sized ImageNet-1k dataset only can already achieve quite competitive performance on the challenging COCO object detection benchmark, e.g., YOLOS-Base directly adopted from BERT-Base architecture can obtain 42.0 box AP on COCO val. We also discuss the impacts as well as limitations of current pre-train schemes and model scaling strategies for Transformer in vision through YOLOS. Code and pre-trained models are available at https://github.com/hustvl/YOLOS.

Citations (279)

View on Semantic Scholar

Summary

The paper introduces a novel sequence-to-sequence formulation for object detection that uses a pure Vision Transformer architecture.
The model, YOLOS, replaces conventional CNNs with detection tokens and a bipartite matching loss, achieving a box AP of 42.0 on the COCO benchmark.
The approach highlights the transferability of pre-trained Vision Transformers, suggesting a streamlined alternative to traditional CNN-based detectors.

Overview of "You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection"

In the paper titled "You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection," the authors propose a novel approach to object detection, leveraging the capabilities of the Vision Transformer (ViT). This research challenges the traditional reliance on convolutional neural networks (CNNs) for dense prediction tasks, offering a streamlined Transformer-based model for object detection, termed YOLOS.

Key Contributions

The paper introduces several pivotal contributions to the computer vision field:

Pure Sequence-to-Sequence Problem Formulation: The authors explore the feasibility of executing 2D object detection tasks using a sequence-to-sequence methodology, minimizing the spatial priors typically embedded in CNN architectures. This approach capitalizes on the Vision Transformer's inherent design, traditionally effective in natural language processing, to process visual data.
YOLOS Model Architecture: YOLOS adheres closely to the canonical Vision Transformer architecture with deliberate simplicity. It replaces the classification token used in ViT with multiple detection tokens and adapitates the loss function to a bipartite matching loss. This setup negates the need to interpret the output as 2D feature maps, thereby maintaining a sequence-based processing paradigm.
Pre-training and Transferability: The research demonstrates that pre-training on the ImageNet-1k dataset is sufficient to achieve competitive results on the COCO benchmark. YOLOS achieves a box AP of 42.0 when transitioning from a pre-trained ViT, showcasing its capability in handling complex object detection tasks with minimal task-specific architecture adjustments.
Benchmarking Pre-training Strategies: YOLOS serves as a testbed for examining various pre-training strategies, both supervised and self-supervised, highlighting the impact of such methods on downstream VIS tasks, particularly object detection. The findings emphasize the qualitative and quantitative variances in performance across different pre-training configurations.
Scaling Analysis: The paper investigates different scaling methodologies for YOLOS, such as width scaling and compound scaling, to evaluate their effects on both pre-training and transfer learning performance. The analysis provides insights into how traditional CNN scaling strategies translate to Transformer-based models.

Results and Implications

The results from the YOLOS experiments are significant, indicating a promising alternative to CNN-based detectors. The object detection performance of YOLOS models is competitive with smaller-sized CNN architectures, thus suggesting a shift in how vision tasks might be approached in future research.

Furthermore, YOLOS highlights the adaptability of pre-trained Vision Transformers to object-level recognition tasks without heavy reliance on 2D spatial inductive biases. This flexibility underscores the versatility of the Transformer architecture across diverse task domains, aligning more closely with NLP methodologies.

Future Directions

The paper suggests several future research trajectories:

Enhanced Pre-training Strategies: Further exploration into self-supervised methods could enhance the transferability of ViT to complex visual tasks. This aligns with broader trends in AI towards reducing the dependency on large labeled datasets.
Exploration of Diverse Task-Specific Modifications: While YOLOS advocates for minimal modifications, examining task-specific adaptations could potentially improve performance without compromising the sequence-to-sequence processing advantage.
Model Scaling and Efficiency Improvements: Understanding the trade-offs in model scaling, especially concerning computational efficiency and performance balance, remains critical for practical deployments of YOLOS.

In sum, the paper presents an insightful rethinking of Transformer application in vision, emphasizing a streamlined approach that leverages the transfer learning potential and adaptability of pre-trained models. Through YOLOS, the authors advocate for more generalized architectures that bridge NLP achievements with vision-centric challenges.