Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer (2203.10790v2)

Published 21 Mar 2022 in cs.CV and cs.AI

Abstract: The vanilla self-attention mechanism inherently relies on pre-defined and steadfast computational dimensions. Such inflexibility restricts it from possessing context-oriented generalization that can bring more contextual cues and global representations. To mitigate this issue, we propose a Scalable Self-Attention (SSA) mechanism that leverages two scaling factors to release dimensions of query, key, and value matrices while unbinding them with the input. This scalability fetches context-oriented generalization and enhances object sensitivity, which pushes the whole network into a more effective trade-off state between accuracy and cost. Furthermore, we propose an Interactive Window-based Self-Attention (IWSA), which establishes interaction between non-overlapping regions by re-merging independent value tokens and aggregating spatial information from adjacent windows. By stacking the SSA and IWSA alternately, the Scalable Vision Transformer (ScalableViT) achieves state-of-the-art performance in general-purpose vision tasks. For example, ScalableViT-S outperforms Twins-SVT-S by 1.4% and Swin-T by 1.8% on ImageNet-1K classification.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Rui Yang (221 papers)
  2. Hailong Ma (8 papers)
  3. Jie Wu (231 papers)
  4. Yansong Tang (82 papers)
  5. Xuefeng Xiao (51 papers)
  6. Min Zheng (32 papers)
  7. Xiu Li (166 papers)
Citations (47)

Summary

We haven't generated a summary for this paper yet.