Papers
Topics
Authors
Recent
2000 character limit reached

SwinV2-Tiny: Efficient Vision Transformer

Updated 4 December 2025
  • The paper introduces a residual-post-norm design and Log-CPB technique, ensuring stable training and improved resolution transfer for high-resolution tasks.
  • It employs a scaled cosine self-attention mechanism with a learnable scaling factor to maintain bounded outputs and robust performance.
  • Experimental results show incremental accuracy gains and enhanced transferability compared to SwinV1, highlighting its efficiency in visual representation learning.

SwinV2-Tiny (SwinV2-T) is a small-scale variant within the Swin Transformer V2 family, designed for efficient and stable high-resolution visual representation learning. Building upon the hierarchical window-based Transformer framework of its predecessor, SwinV1, SwinV2-Tiny introduces refinements to residual structure, attention, and positional encoding, resulting in improved training stability, more robust transfer to high-resolution inputs, and modest gains in image recognition accuracy (Liu et al., 2021).

1. Architecture and Structural Innovations

SwinV2-Tiny retains the multi-scale design of Swin Transformer, employing four processing stages. The input image (typically 256×256256 \times 256) undergoes patch partitioning into 4×44\times4 non-overlapping regions, followed by linear embedding with C=96C=96 channels. Sequence processing proceeds through four Transformer stages with channel dimensions and block counts as follows:

Stage Channels Blocks Heads (32-dim each)
1 96 2 3
2 192 2 6
3 384 6 12
4 768 2 24

The window size for local self-attention is typically 8×88\times8 on 256×256256\times256 images but may be adapted for larger resolutions (e.g., 12×1212\times12 for 384×384384\times384 inputs) to ensure even partitioning.

A major design revision over SwinV1 is the shift from “pre-norm” to “residual-post-norm” (“Res-Post-Norm”) structure in Transformer blocks. The post-norm approach normalizes activations after each residual addition: y=x+MHA(x) yˉ=LN(y) z=yˉ+MLP(yˉ) xnext=LN(z)\begin{align*} y &= x + \text{MHA}(x) \ \bar{y} &= \text{LN}(y) \ z &= \bar{y} + \text{MLP}(\bar{y}) \ x_{\text{next}} &= \text{LN}(z) \end{align*} This ordering mitigates uncontrolled activation magnitude growth, ensuring stable training especially as model depth and size scale up.

2. Scaled Cosine Self-Attention Mechanism

SwinV2-T substitutes the conventional dot-product similarity in self-attention with a scaled cosine similarity, combined with a learnable positive scaling factor τ\tau and a relative position bias BB. Formally, attention computation is: Attention(Q,K,V)=Softmax(cosine(Q,K)τ+B)V\text{Attention}(Q,K,V) = \text{Softmax}\left(\frac{\text{cosine}(Q,K)}{\tau} + B\right)V where

cosine(qi,kj)=qikjqikj\text{cosine}(q_i,k_j) = \frac{q_i \cdot k_j}{\|q_i\|\|k_j\|}

Each head and layer acquires an independent τ>0.01\tau > 0.01, safeguarding against unbounded logits and contributing to training stability in deep architectures. The bounded nature of cosine similarity ([–1,1]) further assists in numerical control.

3. Log-Spaced Continuous Relative Position Bias (Log-CPB)

To enable robust transfer from low-resolution pre-training to high-resolution fine-tuning, SwinV2 introduces a log-spaced continuous bias function for relative positional encoding. Rather than using a fixed parameter table, SwinV2 computes the position bias for offset (Δx,Δy)(\Delta x, \Delta y) as: B(Δx,Δy)=G(Δx^,Δy^)B(\Delta x, \Delta y) = G\left(\widehat{\Delta x}, \widehat{\Delta y}\right) where

Δx^=sign(Δx)log(1+Δx),Δy^=sign(Δy)log(1+Δy)\widehat{\Delta x} = \text{sign}(\Delta x)\log(1+|\Delta x|),\quad \widehat{\Delta y} = \text{sign}(\Delta y)\log(1+|\Delta y|)

GG is a lightweight fully-connected MLP (2 layers, ReLU activation), evaluated across all offsets at runtime, yielding a bias matrix continuously adaptable to window sizes unseen during pre-training. Empirical results show that Log-CPB substantially reduces accuracy degradation when transferring SwinV2-Tiny from 8×88\times8 to 24×2424\times24 windows, compared to bicubic interpolation of parameterized tables.

4. Training Procedure and Hyperparameters

SwinV2-Tiny is pretrained on ImageNet-1K for 300 epochs using the AdamW optimizer, with the following hyperparameter regimen:

Parameter Value
Input size / window 256×256256\times256 (W=8)
Initial LR 1×1031\times10^{-3} (cosine decay, 20-epoch warmup)
Weight decay 0.05
Batch size 1,024
Stochastic depth 0.2
Gradient clipping max-norm 5.0
Augmentations RandAugment, Mixup, CutMix, random erasing
Fine-tuning LR 4×1054\times10^{-5} (cosine decay)
Fine-tuning epochs 30

For fine-tuning on higher resolution images, identical augmentation and optimizer settings are applied with appropriate adjustments to window size and learning rate.

5. SimMIM Self-Supervised Pre-Training Modality

Although SwinV2-Tiny is primarily reported with supervised pre-training, the SwinV2 framework leverages SimMIM (Simple Masked Image Modeling) to reduce dependence on labeled data. In SimMIM, random block-wise masking (typically \sim60% patches) is performed, and the network predicts raw pixel values in the masked patches using a mean squared error (2\ell_2) loss, without additional weighting factors. For larger SwinV2 models (e.g., the 3B parameter SwinV2-G), a two-stage pre-training regime is described: SimMIM on ImageNet-22K-ext, followed by supervised classification.

6. Performance Evaluation and Transfer Behavior

On ImageNet-1K, SwinV2-Tiny demonstrates incremental gains over SwinV1-Tiny. When trained from scratch at 256×256256\times256 resolution with 8×88\times8 windows, top-1 accuracies are as follows:

Configuration Top-1 Accuracy (%)
SwinV1-T (pre-norm, dot-product) 81.5
+ Res-post-norm only 81.6
+ Res-post-norm & cosine-attention 81.7
+ Log-CPB (full SwinV2-T) 81.8

The cumulative improvements (+0.2–0.3%) attribute to each architectural enhancement. Notably, when transferring to larger input/window sizes without fine-tuning (384×384384\times384, 12×1212\times12 window), SwinV2-Tiny retains approximately 81.8% accuracy, whereas SwinV1-Tiny drops to 79.4%. No results for downstream tasks (e.g., object detection, semantic segmentation) were reported for the Tiny variant in the source.

7. Implementation Considerations in Practice

SwinV2-Tiny benefits from several practical engineering recommendations:

  • For distributed training, DeepSpeed’s ZeRO-1 is used to shard optimizer states and model weights across GPUs.
  • Activation checkpointing may be applied to Transformer layers (noting a 30% slowdown for large models).
  • For very large windows, attention is computed sequentially rather than in batch for efficiency.
  • Maintain the learned τ\tau parameter above 0.01 (typically initialized to 1.0 then trained).
  • Window sizes should be even numbers to facilitate correct window shifting.
  • For models larger than approximately 200M parameters, additional LayerNorm is inserted on the main branch every six blocks to enhance stability.

These techniques ensure robust scaling, computational efficiency, and stable convergence in high-capacity visual Transformer training.

8. Relationship to Larger Models and Scalability

SwinV2-Tiny represents the foundational scaled-down design within the SwinV2 family, sharing core innovations with larger models such as SwinV2-Medium, SwinV2-Large, and SwinV2-Giant. The architectural improvements introduced in Tiny are critical for enabling the training of much larger vision Transformers (up to 3 billion parameters, e.g., SwinV2-G). All models benefit from stabilized deep learning dynamics, resolution transferability, and, in large settings, reduced requirements for labeled data compared to prior billion-parameter visual models, with reported efficiency gains by an order of magnitude in labeled data and training time consumption (Liu et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SwinV2-tiny.