SparseSwin: Efficient Vision Transformer
- The paper introduces the SparTa Block, which embeds a sparse token converter in the Swin hierarchy to enable global self-attention over compressed tokens.
- It achieves a 98.4% token reduction by converting high-dimensional feature maps into a small latent token set, significantly lowering computational complexity.
- Benchmark results show that SparseSwin outperforms traditional vision transformers in accuracy while reducing parameters by approximately 36.3%.
SparseSwin is a vision transformer architecture that embeds a sparse token selection (the SparTa Block) within the Swin Transformer hierarchy to achieve parameter efficiency and strong classification performance. The design incorporates a sparse token converter that compresses feature maps into a small set of latent tokens, on which global self-attention and subsequent processing are performed. This architectural innovation delivers significant parameter reduction while improving accuracy across canonical image recognition benchmarks (Pinasthika et al., 2023).
1. SparTa (Sparse Transformer) Block Architecture
The SparTa Block is the centerpiece of SparseSwin. It consists of two primary components: the sparse token converter and a stack of standard transformer sub-blocks.
1.1 Sparse Token Converter
Given the output feature map from Stage 3 of the Swin-T backbone, with
SparseSwin applies a %%%%1%%%% convolution to mix channels, producing
Flattening the spatial dimensions and projecting them via a linear layer yields a token sequence: Here, is the input token count, is the compressed token set, and the sparsity rate is . With , , , resulting in , i.e., 98.4% token reduction from the initial count ().
1.2 Transformer Sub-block Stack
The compressed sequence undergoes transformer layers, each comprising pre-norm multi-head self-attention (MSA) and MLP with residuals: Final normalization is applied to yield .
2. Integration within the Swin Transformer Hierarchy
SparseSwin retains the four-stage, progressively downsampling hierarchy of Swin-T through Stage 3. The full workflow is:
- Stage 1: Patch partition and merging, Swin block , output shape .
- Stage 2: Patch merge, Swin block , output .
- Stage 3: Patch merge, Swin block , output .
- Stage 4: Replaced by SparTa Block (sparse token conversion and transformer stack).
Pseudocode outline of the forward path:
1 2 3 4 5 6 7 |
Input: x ∈ ℝ^{B×3×H×W}
f₁ = Stage1(x)
f₂ = Stage2(f₁)
f₃ = Stage3(f₂) # ℝ^{B×C×7×7}
S_T = SparseTokenConverter(f₃) # ℝ^{B×t×e}
Z = SparTaBlock(S_T) # ℝ^{B×t×e}
y = Classifier(Pool(Z)) |
This approach leverages Swin's early spatial reduction to deliver a token set amenable to global attention, but at dramatically reduced token count.
3. Computational Complexity and Parameter Efficiency
SparseSwin achieves lower complexity by shifting quadratic self-attention to act on a reduced number of tokens.
For a transformer block, multi-head self-attention complexity is
where is sequence length, embedding dimension, number of heads.
- In Swin-T, windowed attention partitions the feature map into windows, each processed independently.
- SparseSwin applies global self-attention to tokens after conversion, yielding
( convolution kernel size).
With (e.g., vs. ), the dominant quadratic term is sharply curtailed.
Parameter count comparison:
- Swin-T: M
- SparseSwin: M
- Relative reduction: Despite this, expressive capacity is maintained as sparsification targets only the final transformer processing.
4. Training Regimen and Regularization
The following experimental protocols are utilized:
- Datasets: ImageNet100 (subset of ImageNet1K with 100 classes, 224 resolution), CIFAR-10, and CIFAR-100 (resized to 224).
- ImageNet100: Optimizer Adam, learning rate , batch 128, 100 epochs, first two stages frozen and pretrained on ImageNet1K. Key SparTa hyperparameters: .
- CIFAR-10/100: AdamW, learning rate , weight decay $0.01$, with standard data augmentation (random crop, flip, resize-crop).
- Attention Regularization: The loss is
where gathers attention scores in SparTa, and is cross-entropy.
5. Benchmarking and Comparative Evaluation
Empirical results on standard image classification datasets highlight SparseSwin's parameter efficiency and accuracy improvements.
ImageNet100 (224×224):
| Model | Parameters (M) | Type | Accuracy (%) |
|---|---|---|---|
| Swin-T | 27.6 | Transformer | 85.22 |
| ViT-B | 85.9 | Transformer | 80.90 |
| DLME (ResNet-50) | – | ConvNet | 79.3 |
| SparseSwin (L2 reg.) | 17.58 | Transformer | 86.96 |
CIFAR-10:
| Model | Parameters (M) | Type | Accuracy (%) |
|---|---|---|---|
| DenseNet-BC-190+Mixup | 25.6 | ConvNet | 97.3 |
| ResNet-XnIDR | 23.86 | ConvNet | 96.87 |
| NesT-B | 97.2 | Transformer | 97.2 |
| CRATE-S/B/L | 13.1/22.8/77.6 | Transformer | 96.0/96.8/97.2 |
| SparseSwin | 17.58 | Transformer | 97.43 |
CIFAR-100:
| Model | Parameters (M) | Type | Accuracy (%) |
|---|---|---|---|
| ResNeXt-50 | 25.03 | ConvNet | 84.42 |
| NesT-B | 97.2 | Transformer | 82.56 |
| CRATE-S/B/L | 13.12/22.8/77.6 | Transformer | 81.0/82.7/83.6 |
| SparseSwin | 17.58 | Transformer | 85.35 |
Accuracy per million parameters (ImageNet100): SparseSwin, ; Swin-T, .
A plausible implication is that the focus of heavy transformer computation on a small, optimized set of latent tokens preserves or even enhances information modeling, while yielding substantial gains in compute and parameter efficiency.
6. Significance within Vision Transformer Research
SparseSwin introduces a modular sparse token selection mechanism that can be integrated seamlessly with hierarchical transformer backbones. By removing the quadratic bottleneck associated with attention over large spatial grids, it demonstrates that locality in earlier stages combined with global reasoning over compressed tokens is highly effective. SparseSwin establishes new accuracy baselines in the low-parameter regime for transformer-based image recognizers, revealing a promising direction for the design of efficient, scalable vision transformers (Pinasthika et al., 2023).