Swin Transformer: Hierarchical Vision Model

Updated 12 December 2025

Swin Transformer is a hierarchical model that uses window-based and shifted self-attention to capture long-range dependencies with linear complexity.
It employs patch merging and shifted window strategies to achieve scalable feature extraction, leading to state-of-the-art results in image and video tasks.
The architecture extends to video, speech, and volumetric data, with variants like Swin V2 enhancing stability and performance on high-resolution inputs.

The Swin Transformer is a hierarchical Transformer model employing local self-attention within shifted, non-overlapping windows. It achieves scalability and competitive accuracy in vision domains by reducing the quadratic computational complexity of traditional global self-attention to linear complexity relative to input size, while retaining the ability to capture long-range dependencies via cross-window communication. The architecture has been widely adopted as a general-purpose backbone for image and video understanding, and has been further extended to speech, volumetric, and spatiotemporal data modalities.

1. Core Architectural Principles

The Swin Transformer operates by partitioning an input (e.g., image, video, spectrogram, or tensor) into fixed-size non-overlapping patches, projecting each to an embedding, and hierarchically processing representations across four stages. Each stage contains multiple Swin Transformer blocks, which comprise:

Window-based Multi-Head Self-Attention (W-MSA): Self-attention is restricted to windows of size $M\times M$ (or $P\times M\times M$ in videos, or higher dimensions for tensors), computed independently within each window. This yields computational cost $O(M^2 H W d)$ for an input feature map of shape $H\times W\times d$ with window size $M$ , offering linear scaling.
Shifted Window MSA (SW-MSA): To propagate information between windows and alleviate locality bias, every alternate Transformer block shifts the window partitioning by half the window size (along each spatial or spatiotemporal axis), applies W-MSA, then inverts the shift. This design ensures global connectivity in deep networks.
Hierarchical Feature Extraction: Patch merging operations downsample the spatial resolution by grouping $2\times 2$ (or analogous higher-dimensional) patches, concatenating their representations, and linearly projecting to increase channel dimensions, analogous to the feature pyramid in CNNs.

The standard block sequence for each stage is: $\begin{align*} \hat{\mathbf{z}}^\ell & = \text{W-MSA}(\text{LN}(\mathbf{z}^{\ell-1})) + \mathbf{z}^{\ell-1} \ \mathbf{z}^\ell & = \text{MLP}(\text{LN}(\hat{\mathbf{z}}^\ell)) + \hat{\mathbf{z}}^\ell \ \hat{\mathbf{z}}^{\ell+1} & = \text{SW-MSA}(\text{LN}(\mathbf{z}^{\ell})) + \mathbf{z}^{\ell} \ \mathbf{z}^{\ell+1} & = \text{MLP}(\text{LN}(\hat{\mathbf{z}}^{\ell+1})) + \hat{\mathbf{z}}^{\ell+1} \end{align*}$ where each MLP has expansion ratio $\alpha = 4$ with GELU activations.

Relative positional bias $B \in \mathbb{R}^{M^2 \times M^2}$ is learnable, with lookup indices generated from coordinate offsets within each window (Liu et al., 2021).

2. Model Scaling and Performance

Swin Transformer variants are parametrized by embedding dimension $C$ and number of blocks per stage $\{L_1,L_2,L_3,L_4\}$ :

Swin-T (Tiny): $C=96$ , $\{2,2,6,2\}$
Swin-S (Small): $C=96$ , $\{2,2,18,2\}$
Swin-B (Base): $C=128$ , $\{2,2,18,2\}$
Swin-L (Large): $C=192$ , $\{2,2,18,2\}$

Swin Transformer has achieved:

87.3% top-1 accuracy (Swin-L) on ImageNet-1K with 384 $\times$ 384 inputs (pretrained on ImageNet-22K).
58.7 box AP and 51.1 mask AP on COCO test-dev using Cascade Mask R-CNN.
53.5 mIoU on ADE20K semantic segmentation (UperNet, Swin-L, pre-trained on ImageNet-22K) (Liu et al., 2021).

The hierarchy enables direct integration with architectures requiring multi-scale features, such as U-Net-based models for restoration (Fan et al., 2022) or detection pipelines.

3. Window Attention: Mathematical Formalism and Efficiency

Given input tokens $X \in \mathbb{R}^{N \times d}$ within a window, for each attention head $h$ :

$Q_h = X W^Q_h$ , $K_h = X W^K_h$ , $V_h = X W^V_h$ , with $W^{Q,K,V}_h \in \mathbb{R}^{d \times d_h}$
Attention per head: $A_h = \mathrm{Softmax}\left(\frac{Q_h K_h^\top}{\sqrt{d_h}} + B\right)$

$\text{head}_h = A_h V_h$

$\text{MSA}(X) = \text{Concat}(\text{head}_1, \dots, \text{head}_H) W^O$

Complexity of global attention: $O((H W)^2 d)$ ; windowed attention: $O(M^2 H W d)$ with $M$ fixed, allowing scalable training and inference on high-resolution data (Liu et al., 2021).

Shifted windows elegantly introduce cross-window connections and expand the network’s effective receptive field without reverting to global computation, maintaining linear complexity.

4. Extensions: Video, Speech, and Spatiotemporal Domains

The Swin paradigm generalizes across data types:

Video Swin Transformer introduces 3D (space-time) windowed and shifted attention, partitions into patches of size $P \times M \times M$ , and applies spatiotemporal self-attention. Patch merging/downsampling is performed on the spatial axes only, preserving temporal fidelity. Joint attention in local spatiotemporal volumes balances efficiency and accuracy, achieving 84.9% top-1 on Kinetics-400 (Swin-L) (Liu et al., 2021).
Speech Swin Transformer operates on log-Mel spectrograms, organizing as segment-level patches in the time domain, and applies local (frame-scale) and shifted attention to capture multi-scale emotional features. Patch merging aggregates and expands representation from frame to segment scale (Wang et al., 19 Jan 2024).
4D Swin (SwiFT) extends the technique to 4D fMRI, employing local 4D windows and both temporal and spatial absolute positional embeddings. Efficient handling of high-dimensional volumetric data and contrastive self-supervised pre-training are key features (Kim et al., 2023).

A simplified (non-shifted) Swin structure can be advantageous in trajectory modeling, where the locality of transitions renders global or cross-window connections less critical (Wang et al., 2023).

5. Swin Transformer V2: Stability, Position Bias, and Scaling

Swin Transformer V2 introduces core improvements for large-scale, high-resolution vision models:

Residual-Post-Norm: Moves normalization after the residual add, stabilizing activations in very deep networks:

$y = x + A(x) \quad \text{then} \quad \text{out} = \text{LN}(y)$

Scaled Cosine Attention: Replaces $q k^\top/\sqrt{d}$ with $[\cos(q,k)/\tau + B]$ where $\tau$ is learnable, promoting magnitude-insensitivity and further stabilizing optimization for 600M–3B parameter networks.
Log-Continuous Position Bias (Log-CPB): A small MLP computes bias $B(\Delta x, \Delta y)$ on log-spaced patch offsets, facilitating extrapolation to larger images at finetuning. For example:

$\widehat{\Delta x} = \mathrm{sign}(\Delta x) \log(1 + |\Delta x|)$

Self-supervised Pre-training (SimMIM): Masked image modeling drives label efficiency, enabling Swin V2-G (3.0B params) to match or exceed earlier large ViTs with 40 $\times$ less data and computation, e.g., top-1=84.0% on ImageNet-V2 with 70M labels (Liu et al., 2021).

Swin V2 thus demonstrates stable scaling to 1,536 $\times$ 1,536 images, state-of-the-art on classification, detection, segmentation, and video, and efficient label and computation requirements.

6. SUNet and Practical Task Integration

Swin Transformers can replace CNNs or vanilla Transformer blocks in U-Net–style encoder–decoders for restoration or reconstruction. For image denoising (SUNet) (Fan et al., 2022):

Every UNet convolutional block is replaced by an 8-layer Swin Transformer Block (W-MSA/SW-MSA alternation).
Patch-merging executes down-sampling; a dual up-sample module in the decoder (bilinear + sub-pixel convolution) mitigates artifacts.
Results: On CBSD68 (σ=10/30/50), SUNet achieves PSNR = 35.94/30.28/27.85, SSIM = 0.958/0.870/0.799; superior or competitive with state-of-the-art CNN-based (DnCNN, IrCNN, FFDNet) and U-Net models at lower parameter and FLOP counts.

Ablation studies highlight that dual up-sample reduces artifacts compared to transpose convolutions, confirming the architectural flexibility of the Swin block hierarchy.

7. Variants, Limitations, and Research Directions

Variants: Simpler Swin variants remove shifted windows entirely when data properties (e.g., contiguous taxi trajectories) render cross-window mixing unnecessary, yielding reduced runtime with minimal accuracy loss or slight gains (Wang et al., 2023).
Limitations: While shifted windows maintain linear complexity and facilitate global communication, their benefit is data-dependent. For modalities with strict locality, non-shifted variants may be preferable. The design also locks in window size and hierarchy depth as key hyperparameters, requiring domain-aligned tuning.
Open Directions: Scaling to higher dimensions (4D+), advancing continuous position encoding, and integrating domain-specific self-supervised objectives represent ongoing efforts. The architecture’s modularity facilitates transfer to non-vision modalities, evident in downstream advances in speech emotion recognition (Wang et al., 19 Jan 2024) and functional neuroimaging (Kim et al., 2023).

References

"Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (Liu et al., 2021)
"Swin Transformer V2: Scaling Up Capacity and Resolution" (Liu et al., 2021)
"Video Swin Transformer" (Liu et al., 2021)
"SUNet: Swin Transformer UNet for Image Denoising" (Fan et al., 2022)
"SwiFT: Swin 4D fMRI Transformer" (Kim et al., 2023)
"SST: A Simplified Swin Transformer-based Model for Taxi Destination Prediction" (Wang et al., 2023)
"Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition" (Wang et al., 19 Jan 2024)