Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ViDT: An Efficient and Effective Fully Transformer-based Object Detector (2110.03921v2)

Published 8 Oct 2021 in cs.CV and cs.LG

Abstract: Transformers are transforming the landscape of computer vision, especially for recognition tasks. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing fully transformer-based object detectors, and achieves 49.2AP owing to its high scalability for large models. We will release the code and trained models at https://github.com/naver-ai/vidt

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Hwanjun Song (44 papers)
  2. Deqing Sun (68 papers)
  3. Sanghyuk Chun (49 papers)
  4. Varun Jampani (125 papers)
  5. Dongyoon Han (49 papers)
  6. Byeongho Heo (33 papers)
  7. Wonjae Kim (25 papers)
  8. Ming-Hsuan Yang (377 papers)
Citations (68)