Papers
Topics
Authors
Recent
2000 character limit reached

Swin Transformer V2: Scalable Vision Model

Updated 6 January 2026
  • Swin Transformer V2 is a hierarchical vision transformer that integrates residual post-normalization, scaled cosine attention, and log-spaced continuous position bias to tackle training instability and resolution mismatch.
  • Its training methodology, combining SimMIM pre-training with large-scale supervised fine-tuning, significantly reduces labeled data requirements and training cost.
  • The model achieves state-of-the-art performance on benchmarks like ImageNet and COCO and serves as an effective backbone for tasks such as image restoration and super-resolution.

Swin Transformer V2 is a hierarchical vision transformer architecture that introduces crucial modifications over its predecessor to achieve efficient and stable scaling to billions of parameters and ultra-high input resolutions. Swin Transformer V2 (“Swin V2”) addresses three core challenges endemic to large-scale vision models: training instability, resolution mismatch between pre-training and fine-tuning, and large labeled data requirements. Central innovations include a residual post-normalization backbone, scaled cosine attention with learned temperature, and a log-spaced continuous position bias mechanism. Swin V2 established state-of-the-art results on multiple computer vision benchmarks, and its backbone is leveraged in specialized models such as Swin2SR for compressed image super-resolution and restoration (Liu et al., 2021, Conde et al., 2022).

1. Architectural Innovations

Swin Transformer V2 retains the four-stage, spatially hierarchical (“pyramid”) structure of Swin V1, wherein successive stages use patch merging to reduce resolution and increase feature dimensionality. The three defining architectural innovations of Swin V2 are:

  • Residual Post-Normalization: Swin V2 transitions from pre-norm (applying LayerNorm before each block) to post-norm, stabilizing training at scale. For an input xl1x_{l-1} and block Bl()B_l(\cdot), the update is:

yl=LN(xl1+Bl(xl1)),xl=yly_l = \mathrm{LN}(x_{l-1} + B_l(x_{l-1})), \quad x_l = y_l

This approach suppresses unbounded increases in activation amplitudes with depth and width—an issue documented in very large vision transformers.

  • Scaled Cosine Attention: The standard dot-product in attention is replaced with a cosine similarity, divided by a learned temperature parameter τ\tau per layer and per head:

Sij=qikjqikjτ+BijS_{ij} = \frac{q_i \cdot k_j}{\|q_i\| \|k_j\| \tau} + B_{ij}

A=Softmax(S)VA = \text{Softmax}(S) V

The cosine similarity, bounded within [1,1][-1, 1], curbs logit explosion and smooths gradients.

  • Log-Spaced Continuous Position Bias (Log-CPB): Instead of a learned bias table, relative offsets (Δx,Δy)(\Delta x,\Delta y) are mapped via log transformation:

Δx^=sign(Δx)log(1+Δx),Δy^=sign(Δy)log(1+Δy)\hat{\Delta x} = \mathrm{sign}(\Delta x)\cdot\log(1+|\Delta x|), \quad \hat{\Delta y} = \mathrm{sign}(\Delta y)\cdot\log(1+|\Delta y|)

A small MLP G()G(\cdot) computes the bias:

B(Δx,Δy)=G(Δx^,Δy^)B(\Delta x, \Delta y) = G(\hat{\Delta x}, \hat{\Delta y})

This flexible bias generalizes to arbitrary window sizes, allowing seamless transfer from low-resolution pre-training to high-resolution fine-tuning.

2. Training Methodology and Scaling

Swin V2 is designed for efficient scaling to extreme model sizes and input resolutions. The flagship “Swin V2-G” model features approximately 3 billion parameters, achieved by stage-1 dimensionality C=512C=512 and a depth configuration {2,2,42,4}\{2,2,42,4\}.

Key training strategies include:

  • Pre-training with SimMIM: A masked image modeling objective, where a fraction of image patches Ω\Omega is masked, and only those are reconstructed:

L=1ΩiΩxifθ(x~)i22\mathcal{L} = \frac{1}{|\Omega|} \sum_{i \in \Omega} \| x_i - f_\theta(\tilde{x})_i \|_2^2

20 epochs of SimMIM on 70M images provide strong initialization with minimal labeled data.

  • Large-scale Supervised Training: After SimMIM, supervised fine-tuning on ImageNet-22K-ext (30 epochs) and then on ImageNet-1K (6402^2) for 10 epochs further optimizes the model.
  • Memory and Optimization Techniques: ZeRO Stage 1 optimizer, activation checkpointing, and sequential window-wise attention computation manage memory footprint and permit high batch sizes.
  • Data Efficiency: The training regime uses only 1/40th the labeled data and training time of Google’s ViT-G on JFT-3B, with 70M labeled images and \sim500 A100 core-hours (Liu et al., 2021).

3. Empirical Evaluation and Performance

Swin V2 establishes new state-of-the-art results across a diverse suite of computer vision benchmarks:

Task Metric Swin V2-G Prior Best
ImageNet-1K V2 Top-1 accuracy (%) 84.0 83.3
COCO (HTC++) Box / Mask AP 63.1 / 54.4 +1.8 / +1.4 over prior
ADE20K (UperNet) mIoU 59.9 +1.5
Kinetics-400 (4x5 views) Top-1 accuracy (%) 86.8 +1.4

In ImageNet-V1 classification, Swin V2-G achieves 90.17% at 6402^2 input—comparable to CoAtNet-7 (90.88%) but with roughly 40×\times lower pre-training cost (Liu et al., 2021).

Ablations indicate that:

  • Post-norm delivers +0.2 to 0.5% absolute accuracy gain and prevents collapse in wide/deep settings.
  • Scaled cosine attention adds +0.1 to 0.3%.
  • Log-CPB imparts +3 to 10% gain when transferring from low to high resolution in downstream tasks.

4. Implications for Vision Model Pre-training

Swin V2 demonstrates that masked image modeling (SimMIM) combined with architectural gains supports effective pre-training with significantly reduced labeled data. Empirically, with only 70M images and a brief SimMIM phase, the 3B-parameter Swin V2 model achieves competitive results relative to models trained with orders of magnitude more data (e.g., ViT-G with 164B images). This suggests that architectural normalization, attention stabilization, and position bias generalization substantially mitigate large data requirements (Liu et al., 2021).

Moreover, the use of log-spaced continuous position bias obviates the need for complex pre-training/fine-tuning recipes to address resolution gaps, facilitating transfer from coarse to fine downstream tasks.

5. Application to Image Restoration and Super-Resolution (Swin2SR)

The Swin2SR model extends Swin V2 as a backbone for compressed image restoration and super-resolution (Conde et al., 2022). In this context, Swin V2’s post-norm, scaled cosine attention, and log-CPB confer:

  • Enhanced training convergence and stability, permitting deeper and more effective transformers.
  • Removal of the need for multi-stage pre-training tricks (e.g., train on ×\times2 super-resolution then fine-tune on ×\times4).
  • Efficient handling of various resolution scenarios (“dynamic SR”), with a single model accommodating multiple upsampling factors.

Quantitative gains reported in Swin2SR include:

  • For JPEG artifact removal (LIVE1, quality=10): Swin2SR achieves 29.98 dB, exceeding SwinIR (29.86 dB).
  • For classical ×\times2 super-resolution on Set5: Swin2SR yields 38.43 dB (vs. 38.42 dB, SwinIR), with similar incremental improvements across Set14 and Urban100.
  • In the AIM 2022 Compressed SR Challenge (JPEG q=10q=10, ×\times4): Swin2SR reaches up to 23.616 dB (self-ensemble), ranked in the top-5 with inference of ≈1.4 s/image on an A100 GPU.

Loss functions in Swin2SR include pixel-wise 1\ell_1 reconstruction, auxiliary low-resolution consistency (downsampled 1\ell_1), and high-frequency sharpening (1\ell_1 on high-frequencies after subtraction of Gaussian-blurred baseline) (Conde et al., 2022).

6. Underlying Mechanisms and Contributions to Stability

Each Swin V2 architectural adjustment is empirically linked to specific training and generalization benefits (Liu et al., 2021, Conde et al., 2022):

  • Post-norm curtails feature variance explosion in deeper layers, avoiding gradient instability and collapse in wide/deep networks.
  • Scaled cosine attention ensures smooth, bounded logits and prevents “spiky” activations from dominating attention, contributing to faster and more stable convergence.
  • Log-CPB provides a parameter-efficient, generalizable bias representation that seamlessly accommodates changing window sizes, circumventing extrapolation errors faced by discrete bias tables.

Collectively, these design elements reduce total training iterations required for convergence (by ≈30% in Swin2SR) and enforce robust behavior across diverse input resolutions.

7. Summary and Outlook

Swin Transformer V2 demonstrates that carefully selected architectural modifications—specifically, residual post-normalization, scaled cosine attention, and log-spaced continuous position bias—alongside efficient masked image modeling pre-training, facilitate the scaling of dense vision transformers to 3 billion parameters and input resolutions up to 1,536×1,5361,536 \times 1,536. These models attain state-of-the-art results across classical computer vision benchmarks as well as restoration and super-resolution tasks, with marked advantages in optimization stability, data efficiency, and downstream transferability (Liu et al., 2021, Conde et al., 2022). For both foundational vision modeling and specialized tasks such as compressed image restoration, Swin V2’s innovations provide measurable improvements in performance and training efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Swin Transformer V2.