CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2107.00652v3)

Published 1 Jul 2021 in cs.CV and cs.LG

Abstract: We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. We provide a mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost. We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks. Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4\% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 52.2 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting. By further pretraining on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K with 55.7 mIoU. The code and models are available at https://github.com/microsoft/CSWin-Transformer.

PDF Abstract

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

The paper "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows" introduces the CSWin Transformer, a novel architecture designed to enhance transformer efficiency and effectiveness in general vision tasks. The core innovation of this paper is the Cross-Shaped Window (CSWin) self-attention mechanism, which computes self-attention in horizontal and vertical stripes in parallel, addressing the computational inefficiency of global self-attention and the locality constraints of traditional windowed self-attention.

Cross-Shaped Window Self-Attention

The CSWin self-attention mechanism mitigates the computational complexity typically associated with full self-attention methods by dividing the input into horizontal and vertical stripes, which are processed in parallel:

Parallel Multi-Head Grouping: The self-attention computation splits the multi-heads into two groups. The first group performs horizontal self-attention, and the second group computes vertical self-attention. This design allows each token to have a larger attention area within a single Transformer block relative to previous methods that sequentially apply the same attention operation.
Dynamic Stripe Widths: The stripe width, which denotes the number of tokens in the stripes, is adjusted dynamically across the network layers. Small stripe widths are used in the shallow layers, while larger widths are employed in the deeper layers, achieving a balance between strong modeling capability and computational cost.

Locally-Enhanced Positional Encoding (LePE)

The authors also introduce a novel Locally-Enhanced Positional Encoding (LePE) scheme that enhances local positional information more effectively than existing positional encoding methods:

Local Awareness: Unlike traditional absolute or relative positional encodings, LePE operates as a parallel module to the self-attention mechanism, directly modifying the projected values (V) within each Transformer block.
Resolution Adaptability: LePE naturally supports arbitrary input resolutions, making it especially suitable for downstream tasks such as object detection and segmentation.

Empirical Evaluation

The CSWin Transformer demonstrates superior performance across various computer vision tasks:

Image Classification: On ImageNet-1K, the CSWin base model (CSWin-B) achieves a Top-1 accuracy of 85.4%, surpassing previous state-of-the-art Swin Transformer by 1.2%. Further pretraining on ImageNet-21K with subsequent fine-tuning on ImageNet-1K achieves an enhanced Top-1 accuracy of 87.5%.
Object Detection and Instance Segmentation: Using Mask R-CNN on COCO, CSWin-B attains 53.9 box AP and 46.4 mask AP, surpassing Swin Transformer's metrics by 2.0 points each.
Semantic Segmentation: On ADE20K, CSWin-B reaches 52.2 mIoU, outperforming Swin Transformer by 2.0 points.

Implications and Future Directions

The CSWin self-attention framework's ability to efficiently enlarge the receptive field while maintaining computational efficiency has meaningful implications for a range of vision-based applications, from image classification to more complex tasks such as object detection and segmentation. The hierarchical structure and adaptive stripe width confer robustness and scalability, suggesting that CSWin Transformers can be effectively scaled up for even larger and more complex datasets.

Future research could explore further optimization of the CSWin Transformer architecture, potentially integrating additional innovations such as network pruning, quantization, or integration with other types of neural architectures. Furthermore, expanding the applicability of the CSWin model to tasks like video understanding, 3D vision, or broader multimodal applications could be a fruitful area of exploration.

In conclusion, the CSWin Transformer represents a significant step forward in vision transformer design, offering a scalable, efficient, and high-performing backbone for a variety of vision tasks. The advancements introduced in this paper, particularly regarding the CSWin self-attention mechanism and LePE, pave the way for future research and development in the field of computer vision.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Xiaoyi Dong (73 papers)
Jianmin Bao (65 papers)
Dongdong Chen (164 papers)
Weiming Zhang (135 papers)
Nenghai Yu (173 papers)
Lu Yuan (130 papers)
Dong Chen (218 papers)
Baining Guo (53 papers)

Citations (829)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/CSWin-Transformer: CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped, CVPR 2022 (571 stars)