Swin Transformer Architecture Overview
- Swin Transformer is a hierarchical vision transformer that uses window-based self-attention with shifted windows to efficiently capture local and global context in images.
- Its design features patch partitioning, linear embedding, and patch merging to build a multi-scale feature pyramid, ensuring compatibility with dense prediction frameworks.
- By restricting self-attention to fixed-size windows, the architecture achieves linear computational complexity while delivering high accuracy across image classification, segmentation, and video tasks.
The Swin Transformer is a hierarchical vision Transformer utilizing a shifted windowing scheme for efficient, scalable, and high-performing modeling across computer vision tasks—image classification, object detection, semantic segmentation, and video recognition. Unlike conventional global-attention-based Vision Transformers (ViTs), Swin Transformer restricts self-attention to local non-overlapping windows, introducing a window shift between blocks to facilitate cross-window information exchange. This architecture achieves linear computational complexity in image size and produces multi-scale feature representations, which makes it compatible with standard dense vision pipelines and extensible to video and generative models (Liu et al., 2021).
1. Hierarchical Architecture and Feature Pyramid
Swin Transformer processes input images of size through a four-stage hierarchy:
- Patch Partition & Embedding: The image is partitioned into non-overlapping patches (yielding tokens of 48 dimensions), followed by a linear embedding to a feature dimension .
- Four-Stage Pipeline:
- Stage 1: resolution, channels, blocks.
- Stage 2: , $2C$ channels, blocks.
- Stage 3: , $4C$ channels, blocks.
- Stage 4: , $8C$ channels, blocks.
- Patch Merging: Between stages, adjacent patches are merged, halving spatial resolution and increasing channels, thereby constructing a feature pyramid analogous to CNNs and ensuring compatibility with existing dense prediction frameworks (Liu et al., 2021, Pinasthika et al., 2023).
This hierarchical design supports direct application to object detection and segmentation, facilitating multi-scale feature extraction intrinsic to these tasks.
2. Window-Based and Shifted Window Multi-Head Self-Attention
Central to Swin Transformer is window-based multi-head self-attention (W-MSA):
- Local Windows: Tokens are partitioned into non-overlapping windows. Within each, for input and heads, attention is computed as:
where is a learnable relative position bias indexed over window offsets.
- Shifted Window Scheme (SW-MSA): Consecutive blocks alternate between regular partitioning and partitioning windows shifted by , so windows in one layer overlap with the boundaries of the previous, enabling inter-window connections and thereby efficient context aggregation. Padding and masking are implemented to manage discontiguous sub-windows during shift operations (Liu et al., 2021).
Block update equations for two consecutive blocks:
Local windows reduce FLOPs and memory requirements without sacrificing network expressivity.
3. Computational Efficiency and Complexity Analysis
Swin Transformer achieves linear complexity in image size () by restricting attention to fixed-size windows:
- Global Attention: complexity due to all-to-all token interactions.
- Window-Based Attention: , since each token interacts with others (with fixed), not the entire sequence.
For W-MSA: versus global MSA:
This scalability enables use on high-resolution images and dense prediction benchmarks with practical resource budgets (Liu et al., 2021, Pinasthika et al., 2023).
4. Architectural Hyperparameters and Model Variants
Swin Transformer is parameterized for multiple capacity regimes, e.g., Tiny, Small, Base, Large. For input, typical configuration (window size , MLP ratio , head dim ):
| Variant | Stage 1 | Stage 2 | Stage 3 | Stage 4 |
|---|---|---|---|---|
| Swin-T | C=96, L₁=2, H=3 | 2C=192, L₂=2, H=6 | 4C=384, L₃=6, H=12 | 8C=768, L₄=2, H=24 |
| Swin-S | C=96, L₁=2, H=3 | 2C=192, L₂=2, H=6 | 4C=384, L₃=18,H=12 | 8C=768, L₄=2, H=24 |
| Swin-B | C=128, L₁=2,H=4 | 2C=256, L₂=2,H=8 | 4C=512, L₃=18,H=16 | 8C=1024,L₄=2,H=32 |
| Swin-L | C=192, L₁=2,H=6 | 2C=384, L₂=2,H=12 | 4C=768, L₃=18,H=24 | 8C=1536,L₄=2,H=48 |
Within each block: LayerNorm → (W-MSA/SW-MSA) → residual → LayerNorm → 2-layer MLP → residual.
These hyperparameters confer flexibility to trade off accuracy, memory, and computation (Liu et al., 2021).
5. Extensions: SparseSwin, Video Swin Transformer, Swin DiT
SparseSwin and SparTa Block
SparseSwin modifies the Swin-T backbone, replacing Stage 4 with “SparTa” blocks that employ a sparse token converter to reduce the number of tokens. The converter maps a feature map into tokens via convolution and a linear projection, followed by standard self-attention and MLP steps. This yields substantial parameter reduction (~36%) and state-of-the-art accuracy on ImageNet100, CIFAR10, and CIFAR100 at 17.58 M parameters (Pinasthika et al., 2023).
Video Swin Transformer
Video Swin Transformer extends the architecture to spatiotemporal domains by using 3D non-overlapping windows (P frames spatial) and shifted windowing along time and space. This approach maintains hierarchical locality and efficiency, adapting relative position bias tables to three axes. Transferring Swin weights pre-trained on images offers parameter and data efficiency. On Kinetics-400 and Kinetics-600, Video Swin achieves 84.9% and 86.1% top-1 accuracy, respectively, outperforming global-attention-based video Transformers with lower FLOPs and fewer parameters (Liu et al., 2021).
Swin DiT and Pseudo-Shifted Window Attention
Swin DiT enhances diffusion-based generative models by adopting “Pseudo-Shifted Window Attention” (PSWA) in place of full global or shifted-window attention:
- PSWA: Splits channels into a window-attention branch and a high-frequency bridging branch; the latter uses depthwise convolutions to exchange local and cross-window information efficiently, simulating the effects of window shifts.
- Progressive Coverage Channel Allocation (PCCA): Gradually reassigns channels from bridging to window branches across layers, enabling high-order and global information flow with reduced computational cost.
Swin DiT achieves a 54% FID improvement over DiT-XL/2 and increases throughput while reducing FLOPs and memory usage in generative modeling (Wu et al., 19 May 2025).
6. Comparative Analysis and Empirical Results
Swin Transformer frequently surpasses the state of the art established by CNN backbones and global-attention ViTs:
- ImageNet-1K Image Classification: Swin achieves 87.3% top-1 accuracy, exceeding contemporary methods (Liu et al., 2021).
- COCO Detection and ADE20K Segmentation: Swin surpasses existing models by +2.7 box AP, +2.6 mask AP, and +3.2 mIoU, enabling high-throughput, high-resolution dense prediction.
- SparseSwin Results: Outperforms Swin-T, ViT-B, and numerous CNNs on ImageNet100 (86.96%), CIFAR-10 (97.43%), and CIFAR-100 (85.35%) at dramatically reduced parameter count (Pinasthika et al., 2023).
- Video Swin: Matches or outperforms global attention models on Kinetics and SSv2 with up to 20× less pre-training data and ~3× fewer parameters (Liu et al., 2021).
- Swin DiT in Diffusion Generation: Reduces FID relative to U-DiT and DiT, with highest gains for large models on ImageNet-256 (Wu et al., 19 May 2025).
7. Design Justifications and Limitations
The combination of hierarchical structure, local window attention, shifted windowing, and learnable relative position biases is justified by empirical performance and computational scaling:
- Hierarchical pyramid design supports dense vision tasks and FPN integration.
- Window-based attention exploits image locality, mitigating quadratic scaling.
- Shifted windows provide cross-window context at low computational cost.
- Relative position bias enhances translation invariance, crucial for dense prediction.
ViT and DeiT, using global attention, suffer from quadratic scaling and poor feature resolution, leading to inefficacy in high-resolution, dense tasks. Swin remedies these issues, delivering superior or comparable accuracy at significantly lower computational and memory budgets (Liu et al., 2021).
Extensions (SparseSwin, Swin DiT) demonstrate the adaptability of the shifted-window concept to parameter- and memory-constrained regimes and to domains beyond static images, including video and generative modeling (Pinasthika et al., 2023, Wu et al., 19 May 2025). A plausible implication is that windowed locality and hierarchical representations will continue to underpin scalable, general-purpose vision Transformer architectures.