Swin Transformer Architecture Overview

Updated 14 March 2026

Swin Transformer is a hierarchical vision transformer that uses window-based self-attention with shifted windows to efficiently capture local and global context in images.
Its design features patch partitioning, linear embedding, and patch merging to build a multi-scale feature pyramid, ensuring compatibility with dense prediction frameworks.
By restricting self-attention to fixed-size windows, the architecture achieves linear computational complexity while delivering high accuracy across image classification, segmentation, and video tasks.

The Swin Transformer is a hierarchical vision Transformer utilizing a shifted windowing scheme for efficient, scalable, and high-performing modeling across computer vision tasks—image classification, object detection, semantic segmentation, and video recognition. Unlike conventional global-attention-based Vision Transformers (ViTs), Swin Transformer restricts self-attention to local non-overlapping windows, introducing a window shift between blocks to facilitate cross-window information exchange. This architecture achieves linear computational complexity in image size and produces multi-scale feature representations, which makes it compatible with standard dense vision pipelines and extensible to video and generative models (Liu et al., 2021).

1. Hierarchical Architecture and Feature Pyramid

Swin Transformer processes input images of size $H \times W$ through a four-stage hierarchy:

Patch Partition & Embedding: The image is partitioned into non-overlapping $4 \times 4$ patches (yielding $(H/4) \times (W/4)$ tokens of 48 dimensions), followed by a linear embedding to a feature dimension $C$ .
Four-Stage Pipeline:
- Stage 1: $(H/4) \times (W/4)$ resolution, $C$ channels, $L_1$ blocks.
- Stage 2: $(H/8) \times (W/8)$ , $2C$ channels, $L_2$ blocks.
- Stage 3: $(H/16) \times (W/16)$ , $4C$ channels, $L_3$ blocks.
- Stage 4: $(H/32) \times (W/32)$ , $8C$ channels, $L_4$ blocks.
Patch Merging: Between stages, adjacent $2 \times 2$ patches are merged, halving spatial resolution and increasing channels, thereby constructing a feature pyramid analogous to CNNs and ensuring compatibility with existing dense prediction frameworks (Liu et al., 2021, Pinasthika et al., 2023).

This hierarchical design supports direct application to object detection and segmentation, facilitating multi-scale feature extraction intrinsic to these tasks.

2. Window-Based and Shifted Window Multi-Head Self-Attention

Central to Swin Transformer is window-based multi-head self-attention (W-MSA):

Local Windows: Tokens are partitioned into non-overlapping $M \times M$ windows. Within each, for input $X \in \mathbb{R}^{M^2 \times C}$ and $H$ heads, attention is computed as:

$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$

$\text{Attention}(Q, K, V) = \text{Softmax} \left( \frac{QK^\top}{\sqrt{d}} + B \right) V$

where $B$ is a learnable relative position bias indexed over window offsets.

Shifted Window Scheme (SW-MSA): Consecutive blocks alternate between regular partitioning and partitioning windows shifted by $(\lfloor M/2 \rfloor, \lfloor M/2 \rfloor)$ , so windows in one layer overlap with the boundaries of the previous, enabling inter-window connections and thereby efficient context aggregation. Padding and masking are implemented to manage discontiguous sub-windows during shift operations (Liu et al., 2021).

Block update equations for two consecutive blocks: $\hat{z}^{(l)} = \mathrm{W\text{-}MSA}\left(\mathrm{LN}(z^{(l-1)})\right) + z^{(l-1)}, \quad z^{(l)} = \mathrm{MLP}\left(\mathrm{LN}(\hat{z}^{(l)})\right) + \hat{z}^{(l)}$

$\hat{z}^{(l+1)} = \mathrm{SW\text{-}MSA}\left(\mathrm{LN}(z^{(l)})\right) + z^{(l)}, \quad z^{(l+1)} = \mathrm{MLP}\left(\mathrm{LN}(\hat{z}^{(l+1)})\right) + \hat{z}^{(l+1)}$

Local windows reduce FLOPs and memory requirements without sacrificing network expressivity.

3. Computational Efficiency and Complexity Analysis

Swin Transformer achieves linear complexity in image size ( $N = H \cdot W$ ) by restricting attention to fixed-size windows:

Global Attention: $O(N^2 C)$ complexity due to all-to-all token interactions.
Window-Based Attention: $O(N \cdot C)$ , since each token interacts with $M^2$ others (with $M$ fixed), not the entire sequence.

For W-MSA: $\Omega(\text{W-MSA}) = 4N C^2 + 2 M^2 N C$ versus global MSA: $\Omega(\text{MSA}) = 4N C^2 + 2 N^2 C$

This scalability enables use on high-resolution images and dense prediction benchmarks with practical resource budgets (Liu et al., 2021, Pinasthika et al., 2023).

4. Architectural Hyperparameters and Model Variants

Swin Transformer is parameterized for multiple capacity regimes, e.g., Tiny, Small, Base, Large. For $224 \times 224$ input, typical configuration (window size $M=7$ , MLP ratio $\alpha=4$ , head dim $d=32$ ):

Variant	Stage 1	Stage 2	Stage 3	Stage 4
Swin-T	C=96, L₁=2, H=3	2C=192, L₂=2, H=6	4C=384, L₃=6, H=12	8C=768, L₄=2, H=24
Swin-S	C=96, L₁=2, H=3	2C=192, L₂=2, H=6	4C=384, L₃=18,H=12	8C=768, L₄=2, H=24
Swin-B	C=128, L₁=2,H=4	2C=256, L₂=2,H=8	4C=512, L₃=18,H=16	8C=1024,L₄=2,H=32
Swin-L	C=192, L₁=2,H=6	2C=384, L₂=2,H=12	4C=768, L₃=18,H=24	8C=1536,L₄=2,H=48

Within each block: LayerNorm → (W-MSA/SW-MSA) → residual → LayerNorm → 2-layer MLP → residual.

These hyperparameters confer flexibility to trade off accuracy, memory, and computation (Liu et al., 2021).

5. Extensions: SparseSwin, Video Swin Transformer, Swin DiT

SparseSwin and SparTa Block

SparseSwin modifies the Swin-T backbone, replacing Stage 4 with “SparTa” blocks that employ a sparse token converter to reduce the number of tokens. The converter maps a $(H/32) \times (W/32) \times C_3$ feature map into $t \ll N_3$ tokens via convolution and a linear projection, followed by standard self-attention and MLP steps. This yields substantial parameter reduction (~36%) and state-of-the-art accuracy on ImageNet100, CIFAR10, and CIFAR100 at 17.58 M parameters (Pinasthika et al., 2023).

Video Swin Transformer

Video Swin Transformer extends the architecture to spatiotemporal domains by using 3D non-overlapping windows (P frames $\times$ $M \times M$ spatial) and shifted windowing along time and space. This approach maintains hierarchical locality and efficiency, adapting relative position bias tables to three axes. Transferring Swin weights pre-trained on images offers parameter and data efficiency. On Kinetics-400 and Kinetics-600, Video Swin achieves 84.9% and 86.1% top-1 accuracy, respectively, outperforming global-attention-based video Transformers with lower FLOPs and fewer parameters (Liu et al., 2021).

Swin DiT and Pseudo-Shifted Window Attention

Swin DiT enhances diffusion-based generative models by adopting “Pseudo-Shifted Window Attention” (PSWA) in place of full global or shifted-window attention:

PSWA: Splits channels into a window-attention branch and a high-frequency bridging branch; the latter uses depthwise convolutions to exchange local and cross-window information efficiently, simulating the effects of window shifts.
Progressive Coverage Channel Allocation (PCCA): Gradually reassigns channels from bridging to window branches across layers, enabling high-order and global information flow with reduced computational cost.

Swin DiT achieves a 54% FID improvement over DiT-XL/2 and increases throughput while reducing FLOPs and memory usage in generative modeling (Wu et al., 19 May 2025).

6. Comparative Analysis and Empirical Results

Swin Transformer frequently surpasses the state of the art established by CNN backbones and global-attention ViTs:

ImageNet-1K Image Classification: Swin achieves 87.3% top-1 accuracy, exceeding contemporary methods (Liu et al., 2021).
COCO Detection and ADE20K Segmentation: Swin surpasses existing models by +2.7 box AP, +2.6 mask AP, and +3.2 mIoU, enabling high-throughput, high-resolution dense prediction.
SparseSwin Results: Outperforms Swin-T, ViT-B, and numerous CNNs on ImageNet100 (86.96%), CIFAR-10 (97.43%), and CIFAR-100 (85.35%) at dramatically reduced parameter count (Pinasthika et al., 2023).
Video Swin: Matches or outperforms global attention models on Kinetics and SSv2 with up to 20× less pre-training data and ~3× fewer parameters (Liu et al., 2021).
Swin DiT in Diffusion Generation: Reduces FID relative to U-DiT and DiT, with highest gains for large models on ImageNet-256 (Wu et al., 19 May 2025).

7. Design Justifications and Limitations

The combination of hierarchical structure, local window attention, shifted windowing, and learnable relative position biases is justified by empirical performance and computational scaling:

Hierarchical pyramid design supports dense vision tasks and FPN integration.
Window-based attention exploits image locality, mitigating quadratic scaling.
Shifted windows provide cross-window context at low computational cost.
Relative position bias enhances translation invariance, crucial for dense prediction.

ViT and DeiT, using global attention, suffer from quadratic scaling and poor feature resolution, leading to inefficacy in high-resolution, dense tasks. Swin remedies these issues, delivering superior or comparable accuracy at significantly lower computational and memory budgets (Liu et al., 2021).

Extensions (SparseSwin, Swin DiT) demonstrate the adaptability of the shifted-window concept to parameter- and memory-constrained regimes and to domains beyond static images, including video and generative modeling (Pinasthika et al., 2023, Wu et al., 19 May 2025). A plausible implication is that windowed locality and hierarchical representations will continue to underpin scalable, general-purpose vision Transformer architectures.

Markdown Report Issue Upgrade to Chat

References (4)

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (2021)

SparseSwin: Swin Transformer with Sparse Transformer Block (2023)

Video Swin Transformer (2021)

Swin DiT: Diffusion Transformer using Pseudo Shifted Windows (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Swin Transformer Architecture.

Swin Transformer Architecture Overview

1. Hierarchical Architecture and Feature Pyramid

2. Window-Based and Shifted Window Multi-Head Self-Attention

3. Computational Efficiency and Complexity Analysis

4. Architectural Hyperparameters and Model Variants

5. Extensions: SparseSwin, Video Swin Transformer, Swin DiT

SparseSwin and SparTa Block

Video Swin Transformer

Swin DiT and Pseudo-Shifted Window Attention

6. Comparative Analysis and Empirical Results

7. Design Justifications and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Swin Transformer Architecture Overview

1. Hierarchical Architecture and Feature Pyramid

2. Window-Based and Shifted Window Multi-Head Self-Attention

3. Computational Efficiency and Complexity Analysis

4. Architectural Hyperparameters and Model Variants

5. Extensions: SparseSwin, Video Swin Transformer, Swin DiT

SparseSwin and SparTa Block

Video Swin Transformer

Swin DiT and Pseudo-Shifted Window Attention

6. Comparative Analysis and Empirical Results

7. Design Justifications and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research