ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer (2203.10790v2)

Published 21 Mar 2022 in cs.CV and cs.AI

Abstract: The vanilla self-attention mechanism inherently relies on pre-defined and steadfast computational dimensions. Such inflexibility restricts it from possessing context-oriented generalization that can bring more contextual cues and global representations. To mitigate this issue, we propose a Scalable Self-Attention (SSA) mechanism that leverages two scaling factors to release dimensions of query, key, and value matrices while unbinding them with the input. This scalability fetches context-oriented generalization and enhances object sensitivity, which pushes the whole network into a more effective trade-off state between accuracy and cost. Furthermore, we propose an Interactive Window-based Self-Attention (IWSA), which establishes interaction between non-overlapping regions by re-merging independent value tokens and aggregating spatial information from adjacent windows. By stacking the SSA and IWSA alternately, the Scalable Vision Transformer (ScalableViT) achieves state-of-the-art performance in general-purpose vision tasks. For example, ScalableViT-S outperforms Twins-SVT-S by 1.4% and Swin-T by 1.8% on ImageNet-1K classification.

Authors (7)

Rui Yang (221 papers)
Hailong Ma (8 papers)
Jie Wu (231 papers)
Yansong Tang (82 papers)
Xuefeng Xiao (51 papers)
Min Zheng (32 papers)
Xiu Li (166 papers)

Citations (47)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer (2203.10790v2)

Summary

Related Papers