SG-Former: Self-guided Transformer with Evolving Token Reallocation (2308.12216v1)

Published 23 Aug 2023 in cs.CV

Abstract: Vision Transformer has demonstrated impressive success across various vision tasks. However, its heavy computation cost, which grows quadratically with respect to the token sequence length, largely limits its power in handling large feature maps. To alleviate the computation cost, previous works rely on either fine-grained self-attentions restricted to local small regions, or global self-attentions but to shorten the sequence length resulting in coarse granularity. In this paper, we propose a novel model, termed as Self-guided Transformer~(SG-Former), towards effective global self-attention with adaptive fine granularity. At the heart of our approach is to utilize a significance map, which is estimated through hybrid-scale self-attention and evolves itself during training, to reallocate tokens based on the significance of each region. Intuitively, we assign more tokens to the salient regions for achieving fine-grained attention, while allocating fewer tokens to the minor regions in exchange for efficiency and global receptive fields. The proposed SG-Former achieves performance superior to state of the art: our base size model achieves \textbf{84.7\%} Top-1 accuracy on ImageNet-1K, \textbf{51.2mAP} bbAP on CoCo, \textbf{52.7mIoU} on ADE20K surpassing the Swin Transformer by \textbf{+1.3\% / +2.7 mAP/ +3 mIoU}, with lower computation costs and fewer parameters. The code is available at \href{https://github.com/OliverRensu/SG-Former}{https://github.com/OliverRensu/SG-Former}

PDF Abstract

Overview of SG-Former: Self-guided Transformer with Evolving Token Reallocation

Vision Transformers (ViTs) have gained significant traction in the field of computer vision, largely due to their capacity to model long-range dependencies through self-attention mechanisms. Despite their success, the computational cost of ViTs scales quadratically with the token sequence length, posing challenges when dealing with large feature maps. Many existing solutions limit self-attention to local regions or perform coarse global attention by reducing sequence lengths, often at the cost of model efficacy. The paper at hand introduces SG-Former, a Self-guided Transformer which aims to address these challenges by implementing evolving token reallocation to achieve efficient global self-attention with adaptive fine granularity.

Key Contributions

The SG-Former introduces a novel approach wherein a significance map guides token reallocation, enabling more tokens in salient regions to participate in fine-grained attention and fewer tokens in less important regions, thus maintaining a global receptive field with computational efficiency. The main contributions of the paper include:

Hybrid-Scale Self-Attention: The innovative use of hybrid-scale self-attention allows SG-Former to capture both fine-grained local and global information within the same attention mechanism, aiding the extraction of significance information that is then used for token reallocation.
Self-Guided Attention: Diverging from predefined strategies for token aggregation, SG-Former customizes token reallocation uniquely for each input based on the learned significance map. This strategy allows the model to retain detailed information in crucial areas while optimizing computational resource usage across minor regions.
Empirical Results: The SG-Former surpasses other state-of-the-art Transformer models in performance metrics across multiple computer vision tasks. Specifically, it achieves 84.7% Top-1 accuracy on ImageNet-1K, a 51.2mAP on COCO for object detection, and a 52.7mIoU on ADE20K for semantic segmentation, outperforming the Swin Transformer by notable margins.

Implications and Future Directions

The results demonstrated by SG-Former have several significant implications. It highlights the potential of leveraging a self-evolving mechanism for token reallocation in preserving both computational efficiency and model performance. The success of SG-Former underlines the importance of balancing local and global attention mechanisms, particularly in dealing with increasingly large datasets and feature maps.

Practically, the introduction of SG-Former could pave the way for more efficient and adaptive models that can effectively handle high-resolution images and complex scene parsing tasks. Theoretically, the SG-Former presents a compelling case for the exploration of dynamic attention mechanisms where models self-configure based on task and data characteristics rather than relying on static architectural constraints.

Looking ahead, further research can explore extending the SG-Former framework to other domains that benefit from hierarchical attention structures and dynamic token processing, such as video analysis and multimodal datasets. Additionally, advancements in reducing the overall complexity of ViTs without compromising accuracy will remain a valuable area for exploration, potentially integrating novel ideas from sparse attention mechanisms or neural architecture search.

In conclusion, SG-Former represents a meaningful step towards the efficient deployment of Transformer-based architectures in vision tasks, offering an adaptable and performant solution that thoughtfully balances computational demands with high-quality attention across complex feature landscapes.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Sucheng Ren (33 papers)
Xingyi Yang (45 papers)
Songhua Liu (33 papers)
Xinchao Wang (203 papers)

Citations (28)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - OliverRensu/SG-Former (76 stars)