Shunted Self-Attention via Multi-Scale Token Aggregation (2111.15193v2)

Published 30 Nov 2021 in cs.CV

Abstract: Recent Vision Transformer~(ViT) models have demonstrated encouraging results across various computer vision tasks, thanks to their competence in modeling long-range dependencies of image patches or tokens via self-attention. These models, however, usually designate the similar receptive fields of each token feature within each layer. Such a constraint inevitably limits the ability of each self-attention layer in capturing multi-scale features, thereby leading to performance degradation in handling images with multiple objects of different scales. To address this issue, we propose a novel and generic strategy, termed shunted self-attention~(SSA), that allows ViTs to model the attentions at hybrid scales per attention layer. The key idea of SSA is to inject heterogeneous receptive field sizes into tokens: before computing the self-attention matrix, it selectively merges tokens to represent larger object features while keeping certain tokens to preserve fine-grained features. This novel merging scheme enables the self-attention to learn relationships between objects with different sizes and simultaneously reduces the token numbers and the computational cost. Extensive experiments across various tasks demonstrate the superiority of SSA. Specifically, the SSA-based transformer achieves 84.0\% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet with only half of the model size and computation cost, and surpasses Focal Transformer by 1.3 mAP on COCO and 2.9 mIOU on ADE20K under similar parameter and computation cost. Code has been released at https://github.com/OliverRensu/Shunted-Transformer.

PDF Abstract

Shunted Self-Attention via Multi-Scale Token Aggregation: A Discussion

The paper under consideration proposes a novel modification to the Vision Transformer (ViT) architecture, titled the "Shunted Self-Attention" (SSA), aimed at addressing limitations in handling multi-scale objects within images. Specifically, the paper critiques existing ViT models for their uniform application of self-attention across equally-sized receptive fields, which can impede performance on images containing objects of varying scales. This issue is particularly salient in tasks where recognizing fine-grained details of objects alongside larger structures is crucial.

Key Contributions

The primary innovation introduced is the Shunted Self-Attention mechanism. This methodology injects heterogeneity in receptive field sizes within each layer of the transformer model. By selectively merging tokens before computing the self-attention matrix, SSA facilitates a more nuanced representation that can simultaneously capture large object features while preserving detailed information of smaller objects. This is posited to improve the model's ability to learn object relationships at varied scales efficiently.

A detailed empirical analysis demonstrates the efficacy of the proposed SSA, showing that the SSA-based transformers outperform state-of-the-art models such as the Focal Transformer. Numerical results highlight significant gains, achieving an 84.0% Top-1 accuracy on the ImageNet dataset while utilizing only half the model size and computational cost when compared to leading alternatives. Moreover, SSA demonstrated improvements in COCO and ADE20K benchmarks, surpassing the Focal Transformer by 1.3 mAP and 2.9 mIOU respectively.

Practical and Theoretical Implications

Practically, the proposed model demonstrates potential for deployment in computational settings where resources are constrained but high accuracy remains paramount. The ability to rival and outperform current advanced models, while maintaining a smaller footprint, makes it particularly attractive for edge computing and applications in mobile technologies.

Theoretically, the introduction of SSA opens new avenues in exploring how attention mechanisms can be more flexibly structured to adaptively handle input variations within a single layer. The notion of dynamically tuned receptive fields could inspire further modifications in not just vision transformers, but other multi-layered models dealing with varying granularity of input data.

Speculation on Future AI Developments

Looking ahead, SSA's integration could see expanded use in domains beyond traditional vision tasks. The dynamic token aggregation approach might benefit natural language processing tasks where text segmentation similarly requires flexibility. As AI models trend towards multi-task and multi-domain capabilities, the inherent adaptability of SSA can play a pivotal role in cross-disciplinary model architectures.

Moreover, as models continue to scale and integrate multiple data types, the principle of shunting different feature dimensions without increasing computational burden aligns with the broader AI objective of developing efficient yet powerful general-purpose models.

Conclusion

The Shunted Self-Attention via Multi-Scale Token Aggregation represents a significant step in refining transformer models for more complex visual recognition tasks. By challenging the status quo of uniform attention mechanisms, the authors provide a compelling argument and experimental validation for reconsidering how receptive fields are structured within transformer architectures. This innovation marks a noteworthy contribution to advancing the efficiency and capability of vision transformers in modeling real-world complexities.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Sucheng Ren (33 papers)
Daquan Zhou (47 papers)
Shengfeng He (72 papers)
Jiashi Feng (295 papers)
Xinchao Wang (203 papers)

Citations (184)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - OliverRensu/Shunted-Transformer (209 stars)

Tweets

https://twitter.com/_akhaliq/status/1465911814365978625