SwinV2-Tiny: Efficient Vision Transformer

Updated 4 December 2025

The paper introduces a residual-post-norm design and Log-CPB technique, ensuring stable training and improved resolution transfer for high-resolution tasks.
It employs a scaled cosine self-attention mechanism with a learnable scaling factor to maintain bounded outputs and robust performance.
Experimental results show incremental accuracy gains and enhanced transferability compared to SwinV1, highlighting its efficiency in visual representation learning.

SwinV2-Tiny (SwinV2-T) is a small-scale variant within the Swin Transformer V2 family, designed for efficient and stable high-resolution visual representation learning. Building upon the hierarchical window-based Transformer framework of its predecessor, SwinV1, SwinV2-Tiny introduces refinements to residual structure, attention, and positional encoding, resulting in improved training stability, more robust transfer to high-resolution inputs, and modest gains in image recognition accuracy (Liu et al., 2021).

1. Architecture and Structural Innovations

SwinV2-Tiny retains the multi-scale design of Swin Transformer, employing four processing stages. The input image (typically $256 \times 256$ ) undergoes patch partitioning into $4\times4$ non-overlapping regions, followed by linear embedding with $C=96$ channels. Sequence processing proceeds through four Transformer stages with channel dimensions and block counts as follows:

Stage	Channels	Blocks	Heads (32-dim each)
1	96	2	3
2	192	2	6
3	384	6	12
4	768	2	24

The window size for local self-attention is typically $8\times8$ on $256\times256$ images but may be adapted for larger resolutions (e.g., $12\times12$ for $384\times384$ inputs) to ensure even partitioning.

A major design revision over SwinV1 is the shift from “pre-norm” to “residual-post-norm” (“Res-Post-Norm”) structure in Transformer blocks. The post-norm approach normalizes activations after each residual addition: $\begin{align*} y &= x + \text{MHA}(x) \ \bar{y} &= \text{LN}(y) \ z &= \bar{y} + \text{MLP}(\bar{y}) \ x_{\text{next}} &= \text{LN}(z) \end{align*}$ This ordering mitigates uncontrolled activation magnitude growth, ensuring stable training especially as model depth and size scale up.

2. Scaled Cosine Self-Attention Mechanism

SwinV2-T substitutes the conventional dot-product similarity in self-attention with a scaled cosine similarity, combined with a learnable positive scaling factor $\tau$ and a relative position bias $B$ . Formally, attention computation is: $\text{Attention}(Q,K,V) = \text{Softmax}\left(\frac{\text{cosine}(Q,K)}{\tau} + B\right)V$ where

$\text{cosine}(q_i,k_j) = \frac{q_i \cdot k_j}{\|q_i\|\|k_j\|}$

Each head and layer acquires an independent $\tau > 0.01$ , safeguarding against unbounded logits and contributing to training stability in deep architectures. The bounded nature of cosine similarity ([–1,1]) further assists in numerical control.

3. Log-Spaced Continuous Relative Position Bias (Log-CPB)

To enable robust transfer from low-resolution pre-training to high-resolution fine-tuning, SwinV2 introduces a log-spaced continuous bias function for relative positional encoding. Rather than using a fixed parameter table, SwinV2 computes the position bias for offset $(\Delta x, \Delta y)$ as: $B(\Delta x, \Delta y) = G\left(\widehat{\Delta x}, \widehat{\Delta y}\right)$ where

$\widehat{\Delta x} = \text{sign}(\Delta x)\log(1+|\Delta x|),\quad \widehat{\Delta y} = \text{sign}(\Delta y)\log(1+|\Delta y|)$

$G$ is a lightweight fully-connected MLP (2 layers, ReLU activation), evaluated across all offsets at runtime, yielding a bias matrix continuously adaptable to window sizes unseen during pre-training. Empirical results show that Log-CPB substantially reduces accuracy degradation when transferring SwinV2-Tiny from $8\times8$ to $24\times24$ windows, compared to bicubic interpolation of parameterized tables.

4. Training Procedure and Hyperparameters

SwinV2-Tiny is pretrained on ImageNet-1K for 300 epochs using the AdamW optimizer, with the following hyperparameter regimen:

Parameter	Value
Input size / window	$256\times256$ (W=8)
Initial LR	$1\times10^{-3}$ (cosine decay, 20-epoch warmup)
Weight decay	0.05
Batch size	1,024
Stochastic depth	0.2
Gradient clipping	max-norm 5.0
Augmentations	RandAugment, Mixup, CutMix, random erasing
Fine-tuning LR	$4\times10^{-5}$ (cosine decay)
Fine-tuning epochs	30

For fine-tuning on higher resolution images, identical augmentation and optimizer settings are applied with appropriate adjustments to window size and learning rate.

5. SimMIM Self-Supervised Pre-Training Modality

Although SwinV2-Tiny is primarily reported with supervised pre-training, the SwinV2 framework leverages SimMIM (Simple Masked Image Modeling) to reduce dependence on labeled data. In SimMIM, random block-wise masking (typically $\sim$ 60% patches) is performed, and the network predicts raw pixel values in the masked patches using a mean squared error ( $\ell_2$ ) loss, without additional weighting factors. For larger SwinV2 models (e.g., the 3B parameter SwinV2-G), a two-stage pre-training regime is described: SimMIM on ImageNet-22K-ext, followed by supervised classification.

6. Performance Evaluation and Transfer Behavior

On ImageNet-1K, SwinV2-Tiny demonstrates incremental gains over SwinV1-Tiny. When trained from scratch at $256\times256$ resolution with $8\times8$ windows, top-1 accuracies are as follows:

Configuration	Top-1 Accuracy (%)
SwinV1-T (pre-norm, dot-product)	81.5
+ Res-post-norm only	81.6
+ Res-post-norm & cosine-attention	81.7
+ Log-CPB (full SwinV2-T)	81.8

The cumulative improvements (+0.2–0.3%) attribute to each architectural enhancement. Notably, when transferring to larger input/window sizes without fine-tuning ( $384\times384$ , $12\times12$ window), SwinV2-Tiny retains approximately 81.8% accuracy, whereas SwinV1-Tiny drops to 79.4%. No results for downstream tasks (e.g., object detection, semantic segmentation) were reported for the Tiny variant in the source.

7. Implementation Considerations in Practice

SwinV2-Tiny benefits from several practical engineering recommendations:

For distributed training, DeepSpeed’s ZeRO-1 is used to shard optimizer states and model weights across GPUs.
Activation checkpointing may be applied to Transformer layers (noting a 30% slowdown for large models).
For very large windows, attention is computed sequentially rather than in batch for efficiency.
Maintain the learned $\tau$ parameter above 0.01 (typically initialized to 1.0 then trained).
Window sizes should be even numbers to facilitate correct window shifting.
For models larger than approximately 200M parameters, additional LayerNorm is inserted on the main branch every six blocks to enhance stability.

These techniques ensure robust scaling, computational efficiency, and stable convergence in high-capacity visual Transformer training.

8. Relationship to Larger Models and Scalability

SwinV2-Tiny represents the foundational scaled-down design within the SwinV2 family, sharing core innovations with larger models such as SwinV2-Medium, SwinV2-Large, and SwinV2-Giant. The architectural improvements introduced in Tiny are critical for enabling the training of much larger vision Transformers (up to 3 billion parameters, e.g., SwinV2-G). All models benefit from stabilized deep learning dynamics, resolution transferability, and, in large settings, reduced requirements for labeled data compared to prior billion-parameter visual models, with reported efficiency gains by an order of magnitude in labeled data and training time consumption (Liu et al., 2021).

PDF Markdown Chat (Pro)

References (1)

Swin Transformer V2: Scaling Up Capacity and Resolution (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to SwinV2-tiny.