The paper introduces the SparTa Block, which embeds a sparse token converter in the Swin hierarchy to enable global self-attention over compressed tokens.
It achieves a 98.4% token reduction by converting high-dimensional feature maps into a small latent token set, significantly lowering computational complexity.
Benchmark results show that SparseSwin outperforms traditional vision transformers in accuracy while reducing parameters by approximately 36.3%.
SparseSwin is a vision transformer architecture that embeds a sparse token selection (the SparTa Block) within the Swin Transformer hierarchy to achieve parameter efficiency and strong classification performance. The design incorporates a sparse token converter that compresses feature maps into a small set of latent tokens, on which global self-attention and subsequent processing are performed. This architectural innovation delivers significant parameter reduction while improving accuracy across canonical image recognition benchmarks (Pinasthika et al., 2023).
1. SparTa (Sparse Transformer) Block Architecture
The SparTa Block is the centerpiece of SparseSwin. It consists of two primary components: the sparse token converter and a stack of L standard transformer sub-blocks.
1.1 Sparse Token Converter
Given the output feature map from Stage 3 of the Swin-T backbone, with
SparseSwin applies a 3×3 convolution to mix channels, producing
X3​​=Conv3×3​(X3​)∈RB×e×H3​×W3​
Flattening the spatial dimensions and projecting them via a linear layer yields a token sequence: ST​=LinearH3​W3​→t​(Flatten(X3​​))∈RB×t×e
Here, N:=H3​W3​ is the input token count, N′:=t is the compressed token set, and the sparsity rate is r=1−NN′​. With (H,W)=(224,224), H3​=W3​=7⇒N=49, X3​∈RB×C×H3​×W3​,H3​=32H​,W3​=32W​0, resulting in X3​∈RB×C×H3​×W3​,H3​=32H​,W3​=32W​1, i.e., 98.4% token reduction from the initial count (X3​∈RB×C×H3​×W3​,H3​=32H​,W3​=32W​2).
1.2 Transformer Sub-block Stack
The compressed sequence X3​∈RB×C×H3​×W3​,H3​=32H​,W3​=32W​3 undergoes X3​∈RB×C×H3​×W3​,H3​=32H​,W3​=32W​4 transformer layers, each comprising pre-norm multi-head self-attention (MSA) and MLP with residuals: X3​∈RB×C×H3​×W3​,H3​=32H​,W3​=32W​5
Final normalization is applied to yield X3​∈RB×C×H3​×W3​,H3​=32H​,W3​=32W​6.
2. Integration within the Swin Transformer Hierarchy
SparseSwin retains the four-stage, progressively downsampling hierarchy of Swin-T through Stage 3. The full workflow is:
With X3​​=Conv3×3​(X3​)∈RB×e×H3​×W3​1 (e.g., X3​​=Conv3×3​(X3​)∈RB×e×H3​×W3​2 vs. X3​​=Conv3×3​(X3​)∈RB×e×H3​×W3​3), the dominant quadratic term is sharply curtailed.
Relative reduction: X3​​=Conv3×3​(X3​)∈RB×e×H3​×W3​6
Despite this, expressive capacity is maintained as sparsification targets only the final transformer processing.
4. Training Regimen and Regularization
The following experimental protocols are utilized:
Datasets: ImageNet100 (subset of ImageNet1K with 100 classes, 224X3​​=Conv3×3​(X3​)∈RB×e×H3​×W3​7 resolution), CIFAR-10, and CIFAR-100 (resized to 224X3​​=Conv3×3​(X3​)∈RB×e×H3​×W3​8).
ImageNet100: Optimizer Adam, learning rate X3​​=Conv3×3​(X3​)∈RB×e×H3​×W3​9, batch 128, 100 epochs, first two stages frozen and pretrained on ImageNet1K. Key SparTa hyperparameters: ST​=LinearH3​W3​→t​(Flatten(X3​​))∈RB×t×e0.
CIFAR-10/100: AdamW, learning rate ST​=LinearH3​W3​→t​(Flatten(X3​​))∈RB×t×e1, weight decay ST​=LinearH3​W3​→t​(Flatten(X3​​))∈RB×t×e2, with standard data augmentation (random crop, flip, resize-crop).
where ST​=LinearH3​W3​→t​(Flatten(X3​​))∈RB×t×e4 gathers attention scores in SparTa, and ST​=LinearH3​W3​→t​(Flatten(X3​​))∈RB×t×e5 is cross-entropy.
5. Benchmarking and Comparative Evaluation
Empirical results on standard image classification datasets highlight SparseSwin's parameter efficiency and accuracy improvements.
ImageNet100 (224×224):
Model
Parameters (M)
Type
Accuracy (%)
Swin-T
27.6
Transformer
85.22
ViT-B
85.9
Transformer
80.90
DLME (ResNet-50)
–
ConvNet
79.3
SparseSwin (L2 reg.)
17.58
Transformer
86.96
CIFAR-10:
Model
Parameters (M)
Type
Accuracy (%)
DenseNet-BC-190+Mixup
25.6
ConvNet
97.3
ResNet-XnIDR
23.86
ConvNet
96.87
NesT-B
97.2
Transformer
97.2
CRATE-S/B/L
13.1/22.8/77.6
Transformer
96.0/96.8/97.2
SparseSwin
17.58
Transformer
97.43
CIFAR-100:
Model
Parameters (M)
Type
Accuracy (%)
ResNeXt-50
25.03
ConvNet
84.42
NesT-B
97.2
Transformer
82.56
CRATE-S/B/L
13.12/22.8/77.6
Transformer
81.0/82.7/83.6
SparseSwin
17.58
Transformer
85.35
Accuracy per million parameters (ImageNet100): SparseSwin, ST​=LinearH3​W3​→t​(Flatten(X3​​))∈RB×t×e6; Swin-T, ST​=LinearH3​W3​→t​(Flatten(X3​​))∈RB×t×e7.
A plausible implication is that the focus of heavy transformer computation on a small, optimized set of latent tokens preserves or even enhances information modeling, while yielding substantial gains in compute and parameter efficiency.
6. Significance within Vision Transformer Research
SparseSwin introduces a modular sparse token selection mechanism that can be integrated seamlessly with hierarchical transformer backbones. By removing the quadratic bottleneck associated with attention over large spatial grids, it demonstrates that locality in earlier stages combined with global reasoning over compressed tokens is highly effective. SparseSwin establishes new accuracy baselines in the low-parameter regime for transformer-based image recognizers, revealing a promising direction for the design of efficient, scalable vision transformers (Pinasthika et al., 2023).