Shifted Window Self-Attention
- SW-MSA is a self-attention mechanism that segments images into fixed windows and applies a shift to enable cross-window context propagation.
- It alternates between standard and shifted window attention in a hierarchical architecture, ensuring linear computational scaling with high-resolution inputs.
- Empirical results demonstrate that SW-MSA leads to improved accuracy and throughput in image classification, object detection, and semantic segmentation tasks.
Shifted Window Multi-head Self-Attention (SW-MSA) is a self-attention mechanism foundational to the Swin Transformer architecture, introduced in "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (Liu et al., 2021). SW-MSA replaces the standard global self-attention, which is computationally prohibitive for high-resolution images, with a window-based variant that alternates between non-overlapping and shifted partitions of the input tokens. This approach enables linear computational scaling with respect to image size and facilitates both local and cross-window interaction, addressing core challenges in adapting transformers to visual data.
1. Window-based Self-Attention and the Shifted Windowing Scheme
Traditional self-attention operations in vision transformers entail pairwise computation across all input tokens, resulting in complexity for an patch map and channel dimension . SW-MSA circumvents this cost by segmenting the input into non-overlapping windows of fixed size and applying multi-head self-attention locally: where , , , is the head dimension, and is a learnable relative position bias matrix of shape . This windowing yields computational cost , linear in image size (given constant ).
A key innovation in SW-MSA is the introduction of shifted windows. In consecutive transformer blocks, the window partitioning is offset by pixels. This shift causes patches initially at window boundaries to become window centers in the following layer, establishing inter-window connections and promoting context propagation.
2. Formal SW-MSA Block Alternation and Architecture
The Swin Transformer alternates between standard window multi-head self-attention (W-MSA) and shifted window multi-head self-attention (SW-MSA), formalized as: This alternating scheme ensures that each block captures local structure (W-MSA) as well as inter-window dependencies (SW-MSA), enhancing representational richness. Layer normalization (LN) and feed-forward networks (MLP) follow each attention step, as in standard transformer architectures.
3. Hierarchical Architecture and Patch Merging
SW-MSA is embedded within a hierarchical transformer powered by patch merging operations. Initially, images are split into patches (tokens). As depth increases, adjacent patches are merged (e.g., , reducing resolution and increasing channel dimension), forming a feature pyramid. This hierarchical design supports multi-scale modeling critical for dense prediction tasks (object detection, semantic segmentation) and ensures that model complexity grows linearly with image size.
4. Efficiency and Throughput Advantages
By restricting attention computation to local windows and using shifting only for cross-window mixing, SW-MSA achieves throughput superior to both global attention and sliding window approaches. For example, Swin-T registers 755 images/sec on a V100 GPU, compared to significantly lower throughput for ViT or naive sliding window models. Cyclic-shifting, used for efficient implementation, minimizes memory overhead.
SW-MSA thus offers a balance between local computation and effective contextual aggregation, resulting in favorable scalability for high-resolution inputs.
5. Empirical Performance Across Vision Benchmarks
Ablation studies reported in (Liu et al., 2021) established the effectiveness of shifted windows:
- Swin-T with shifted windows demonstrates top-1 accuracy on ImageNet-1K versus non-shifted local attention
- On COCO, object detection box AP and mask AP improve by and points, respectively
- Semantic segmentation on ADE20K gains mIoU
General-purpose backbone capability is validated by state-of-the-art results across image classification (up to top-1 accuracy), detection ($58.7$ box AP), and segmentation ($53.5$ mIoU), improving prior benchmarks by substantial margins (+2.7 box AP, +2.6 mask AP, +3.2 mIoU).
6. Contextual Innovations and Extensions
The paper reports that the shifted window mechanism extends naturally to architectures beyond canonical transformers, such as all-MLP models. SW-MSA provides a parameter-efficient route for multiscale modeling, competitive with convolutional and global attention-based designs. Notably, the inclusion of relative position bias in the attention scores encodes spatial relationships vital for vision tasks.
A plausible implication is that shifted window mechanisms could be generalized further, e.g., via adaptive or multi-scale windowing (see dynamic window strategies (Ren et al., 2022)), or through context-injected variants in medical imaging (as in CSW-SA (Imran et al., 23 Jan 2024)).
7. Summary and Significance
SW-MSA is distinguished by its combination of local computational efficiency, cross-window contextual connectivity via window shifting, incorporation of hierarchical patch merging, and superior empirical results. The approach marks a substantive advance in vision transformer design, reconciling the demands of spatial locality, scalability, and high performance across a diverse range of visual tasks, as extensively validated in (Liu et al., 2021).