SwinV2-Tiny: Efficient Vision Transformer
- The paper introduces a residual-post-norm design and Log-CPB technique, ensuring stable training and improved resolution transfer for high-resolution tasks.
- It employs a scaled cosine self-attention mechanism with a learnable scaling factor to maintain bounded outputs and robust performance.
- Experimental results show incremental accuracy gains and enhanced transferability compared to SwinV1, highlighting its efficiency in visual representation learning.
SwinV2-Tiny (SwinV2-T) is a small-scale variant within the Swin Transformer V2 family, designed for efficient and stable high-resolution visual representation learning. Building upon the hierarchical window-based Transformer framework of its predecessor, SwinV1, SwinV2-Tiny introduces refinements to residual structure, attention, and positional encoding, resulting in improved training stability, more robust transfer to high-resolution inputs, and modest gains in image recognition accuracy (Liu et al., 2021).
1. Architecture and Structural Innovations
SwinV2-Tiny retains the multi-scale design of Swin Transformer, employing four processing stages. The input image (typically ) undergoes patch partitioning into non-overlapping regions, followed by linear embedding with channels. Sequence processing proceeds through four Transformer stages with channel dimensions and block counts as follows:
| Stage | Channels | Blocks | Heads (32-dim each) |
|---|---|---|---|
| 1 | 96 | 2 | 3 |
| 2 | 192 | 2 | 6 |
| 3 | 384 | 6 | 12 |
| 4 | 768 | 2 | 24 |
The window size for local self-attention is typically on images but may be adapted for larger resolutions (e.g., for inputs) to ensure even partitioning.
A major design revision over SwinV1 is the shift from “pre-norm” to “residual-post-norm” (“Res-Post-Norm”) structure in Transformer blocks. The post-norm approach normalizes activations after each residual addition: This ordering mitigates uncontrolled activation magnitude growth, ensuring stable training especially as model depth and size scale up.
2. Scaled Cosine Self-Attention Mechanism
SwinV2-T substitutes the conventional dot-product similarity in self-attention with a scaled cosine similarity, combined with a learnable positive scaling factor and a relative position bias . Formally, attention computation is: where
Each head and layer acquires an independent , safeguarding against unbounded logits and contributing to training stability in deep architectures. The bounded nature of cosine similarity ([–1,1]) further assists in numerical control.
3. Log-Spaced Continuous Relative Position Bias (Log-CPB)
To enable robust transfer from low-resolution pre-training to high-resolution fine-tuning, SwinV2 introduces a log-spaced continuous bias function for relative positional encoding. Rather than using a fixed parameter table, SwinV2 computes the position bias for offset as: where
is a lightweight fully-connected MLP (2 layers, ReLU activation), evaluated across all offsets at runtime, yielding a bias matrix continuously adaptable to window sizes unseen during pre-training. Empirical results show that Log-CPB substantially reduces accuracy degradation when transferring SwinV2-Tiny from to windows, compared to bicubic interpolation of parameterized tables.
4. Training Procedure and Hyperparameters
SwinV2-Tiny is pretrained on ImageNet-1K for 300 epochs using the AdamW optimizer, with the following hyperparameter regimen:
| Parameter | Value |
|---|---|
| Input size / window | (W=8) |
| Initial LR | (cosine decay, 20-epoch warmup) |
| Weight decay | 0.05 |
| Batch size | 1,024 |
| Stochastic depth | 0.2 |
| Gradient clipping | max-norm 5.0 |
| Augmentations | RandAugment, Mixup, CutMix, random erasing |
| Fine-tuning LR | (cosine decay) |
| Fine-tuning epochs | 30 |
For fine-tuning on higher resolution images, identical augmentation and optimizer settings are applied with appropriate adjustments to window size and learning rate.
5. SimMIM Self-Supervised Pre-Training Modality
Although SwinV2-Tiny is primarily reported with supervised pre-training, the SwinV2 framework leverages SimMIM (Simple Masked Image Modeling) to reduce dependence on labeled data. In SimMIM, random block-wise masking (typically 60% patches) is performed, and the network predicts raw pixel values in the masked patches using a mean squared error () loss, without additional weighting factors. For larger SwinV2 models (e.g., the 3B parameter SwinV2-G), a two-stage pre-training regime is described: SimMIM on ImageNet-22K-ext, followed by supervised classification.
6. Performance Evaluation and Transfer Behavior
On ImageNet-1K, SwinV2-Tiny demonstrates incremental gains over SwinV1-Tiny. When trained from scratch at resolution with windows, top-1 accuracies are as follows:
| Configuration | Top-1 Accuracy (%) |
|---|---|
| SwinV1-T (pre-norm, dot-product) | 81.5 |
| + Res-post-norm only | 81.6 |
| + Res-post-norm & cosine-attention | 81.7 |
| + Log-CPB (full SwinV2-T) | 81.8 |
The cumulative improvements (+0.2–0.3%) attribute to each architectural enhancement. Notably, when transferring to larger input/window sizes without fine-tuning (, window), SwinV2-Tiny retains approximately 81.8% accuracy, whereas SwinV1-Tiny drops to 79.4%. No results for downstream tasks (e.g., object detection, semantic segmentation) were reported for the Tiny variant in the source.
7. Implementation Considerations in Practice
SwinV2-Tiny benefits from several practical engineering recommendations:
- For distributed training, DeepSpeed’s ZeRO-1 is used to shard optimizer states and model weights across GPUs.
- Activation checkpointing may be applied to Transformer layers (noting a 30% slowdown for large models).
- For very large windows, attention is computed sequentially rather than in batch for efficiency.
- Maintain the learned parameter above 0.01 (typically initialized to 1.0 then trained).
- Window sizes should be even numbers to facilitate correct window shifting.
- For models larger than approximately 200M parameters, additional LayerNorm is inserted on the main branch every six blocks to enhance stability.
These techniques ensure robust scaling, computational efficiency, and stable convergence in high-capacity visual Transformer training.
8. Relationship to Larger Models and Scalability
SwinV2-Tiny represents the foundational scaled-down design within the SwinV2 family, sharing core innovations with larger models such as SwinV2-Medium, SwinV2-Large, and SwinV2-Giant. The architectural improvements introduced in Tiny are critical for enabling the training of much larger vision Transformers (up to 3 billion parameters, e.g., SwinV2-G). All models benefit from stabilized deep learning dynamics, resolution transferability, and, in large settings, reduced requirements for labeled data compared to prior billion-parameter visual models, with reported efficiency gains by an order of magnitude in labeled data and training time consumption (Liu et al., 2021).