Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation (2112.09747v3)

Published 17 Dec 2021 in cs.CV

Abstract: This work presents a simple vision transformer design as a strong baseline for object localization and instance segmentation tasks. Transformers recently demonstrate competitive performance in image classification tasks. To adopt ViT to object detection and dense prediction tasks, many works inherit the multistage design from convolutional networks and highly customized ViT architectures. Behind this design, the goal is to pursue a better trade-off between computational cost and effective aggregation of multiscale global contexts. However, existing works adopt the multistage architectural design as a black-box solution without a clear understanding of its true benefits. In this paper, we comprehensively study three architecture design choices on ViT -- spatial reduction, doubled channels, and multiscale features -- and demonstrate that a vanilla ViT architecture can fulfill this goal without handcrafting multiscale features, maintaining the original ViT design philosophy. We further complete a scaling rule to optimize our model's trade-off on accuracy and computation cost / model size. By leveraging a constant feature resolution and hidden size throughout the encoder blocks, we propose a simple and compact ViT architecture called Universal Vision Transformer (UViT) that achieves strong performance on COCO object detection and instance segmentation tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Wuyang Chen (32 papers)
  2. Xianzhi Du (30 papers)
  3. Fan Yang (877 papers)
  4. Lucas Beyer (46 papers)
  5. Xiaohua Zhai (51 papers)
  6. Tsung-Yi Lin (49 papers)
  7. Huizhong Chen (5 papers)
  8. Jing Li (621 papers)
  9. Xiaodan Song (13 papers)
  10. Zhangyang Wang (374 papers)
  11. Denny Zhou (65 papers)
Citations (16)