Papers
Topics
Authors
Recent
Search
2000 character limit reached

SparseSwin: Efficient Vision Transformer

Updated 22 February 2026
  • The paper introduces the SparTa Block, which embeds a sparse token converter in the Swin hierarchy to enable global self-attention over compressed tokens.
  • It achieves a 98.4% token reduction by converting high-dimensional feature maps into a small latent token set, significantly lowering computational complexity.
  • Benchmark results show that SparseSwin outperforms traditional vision transformers in accuracy while reducing parameters by approximately 36.3%.

SparseSwin is a vision transformer architecture that embeds a sparse token selection (the SparTa Block) within the Swin Transformer hierarchy to achieve parameter efficiency and strong classification performance. The design incorporates a sparse token converter that compresses feature maps into a small set of latent tokens, on which global self-attention and subsequent processing are performed. This architectural innovation delivers significant parameter reduction while improving accuracy across canonical image recognition benchmarks (Pinasthika et al., 2023).

1. SparTa (Sparse Transformer) Block Architecture

The SparTa Block is the centerpiece of SparseSwin. It consists of two primary components: the sparse token converter and a stack of LL standard transformer sub-blocks.

1.1 Sparse Token Converter

Given the output feature map from Stage 3 of the Swin-T backbone, with

X3RB×C×H3×W3,H3=H32,  W3=W32\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}

SparseSwin applies a %%%%1%%%% convolution to mix channels, producing

X3~=Conv3×3(X3)RB×e×H3×W3\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}

Flattening the spatial dimensions and projecting them via a linear layer yields a token sequence: ST=LinearH3W3t(Flatten(X3~))RB×t×e\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e} Here, N:=H3W3N := H_3W_3 is the input token count, N:=tN' := t is the compressed token set, and the sparsity rate is r=1NNr = 1 - \frac{N'}{N}. With (H,W)=(224,224)(H, W) = (224, 224), H3=W3=7N=49H_3 = W_3 = 7 \Rightarrow N = 49, t=49t = 49, resulting in r0.9844r \approx 0.9844, i.e., 98.4% token reduction from the initial count (3136493136 \to 49).

1.2 Transformer Sub-block Stack

The compressed sequence ST\mathbf{S}_T undergoes L=2L=2 transformer layers, each comprising pre-norm multi-head self-attention (MSA) and MLP with residuals: Z^()=ST(1)+MSA(LN(ST(1))) Z()=Z^()+MLP(LN(Z^()))\begin{aligned} \hat{\mathbf{Z}}^{(\ell)} &= \mathbf{S}_T^{(\ell-1)} + \mathrm{MSA}(\mathrm{LN}(\mathbf{S}_T^{(\ell-1)})) \ \mathbf{Z}^{(\ell)} &= \hat{\mathbf{Z}}^{(\ell)} + \mathrm{MLP}(\mathrm{LN}(\hat{\mathbf{Z}}^{(\ell)})) \end{aligned} Final normalization is applied to yield Z=LN(Z(L))RB×t×e\mathbf{Z} = \mathrm{LN}(\mathbf{Z}^{(L)}) \in \mathbb{R}^{B \times t \times e}.

2. Integration within the Swin Transformer Hierarchy

SparseSwin retains the four-stage, progressively downsampling hierarchy of Swin-T through Stage 3. The full workflow is:

  1. Stage 1: Patch partition and merging, Swin block ×2\times 2, output shape H4×W4\frac{H}{4} \times \frac{W}{4}.
  2. Stage 2: Patch merge, Swin block ×2\times 2, output H8×W8\frac{H}{8} \times \frac{W}{8}.
  3. Stage 3: Patch merge, Swin block ×6\times 6, output H16×W16\frac{H}{16} \times \frac{W}{16}.
  4. Stage 4: Replaced by SparTa Block (sparse token conversion and transformer stack).

Pseudocode outline of the forward path:

1
2
3
4
5
6
7
Input: x ∈ ℝ^{B×3×H×W}
f₁ = Stage1(x)
f₂ = Stage2(f₁)
f₃ = Stage3(f₂)               # ℝ^{B×C×7×7}
S_T = SparseTokenConverter(f₃) # ℝ^{B×t×e}
Z = SparTaBlock(S_T)           # ℝ^{B×t×e}
y = Classifier(Pool(Z))

This approach leverages Swin's early spatial reduction to deliver a token set amenable to global attention, but at dramatically reduced token count.

3. Computational Complexity and Parameter Efficiency

SparseSwin achieves lower complexity by shifting quadratic self-attention to act on a reduced number of tokens.

For a transformer block, multi-head self-attention complexity is

4Nd2+2N2d/h4Nd^2 + 2N^2 d/h

where NN is sequence length, dd embedding dimension, hh number of heads.

  • In Swin-T, windowed attention partitions the feature map into M×MM\times M windows, each processed independently.
  • SparseSwin applies global self-attention to tt tokens after conversion, yielding

4td2+2t2d/h+O(NCk2)+O(Nte)4t d^2 + 2 t^2 d/h + O(N C k^2) + O(N t e)

(kk convolution kernel size).

With tNt \ll N (e.g., t=49t=49 vs. N=3136N=3136), the dominant quadratic term is sharply curtailed.

Parameter count comparison:

  • Swin-T: 27.6\approx 27.6 M
  • SparseSwin: 17.58\approx 17.58 M
  • Relative reduction: 27.617.5827.6×100%36.3%\frac{27.6-17.58}{27.6} \times 100\% \approx 36.3\% Despite this, expressive capacity is maintained as sparsification targets only the final transformer processing.

4. Training Regimen and Regularization

The following experimental protocols are utilized:

  • Datasets: ImageNet100 (subset of ImageNet1K with 100 classes, 2242^2 resolution), CIFAR-10, and CIFAR-100 (resized to 2242^2).
  • ImageNet100: Optimizer Adam, learning rate 1×1041 \times 10^{-4}, batch 128, 100 epochs, first two stages frozen and pretrained on ImageNet1K. Key SparTa hyperparameters: t=49, e=512, h=16, L=2t=49,\ e=512,\ h=16,\ L=2.
  • CIFAR-10/100: AdamW, learning rate 1×1051 \times 10^{-5}, weight decay $0.01$, with standard data augmentation (random crop, flip, resize-crop).
  • Attention Regularization: The loss is

L=LCE(y^,y)+λAp,p{1,2},  λ{104,105}\mathcal{L} = \mathcal{L}_\mathrm{CE}(\hat y, y) + \lambda \|\mathbf{A}\|_p,\quad p\in\{1,2\},\; \lambda\in\{10^{-4},\,10^{-5}\}

where A\mathbf{A} gathers attention scores in SparTa, and LCE\mathcal{L}_\mathrm{CE} is cross-entropy.

5. Benchmarking and Comparative Evaluation

Empirical results on standard image classification datasets highlight SparseSwin's parameter efficiency and accuracy improvements.

ImageNet100 (224×224):

Model Parameters (M) Type Accuracy (%)
Swin-T 27.6 Transformer 85.22
ViT-B 85.9 Transformer 80.90
DLME (ResNet-50) ConvNet 79.3
SparseSwin (L2 reg.) 17.58 Transformer 86.96

CIFAR-10:

Model Parameters (M) Type Accuracy (%)
DenseNet-BC-190+Mixup 25.6 ConvNet 97.3
ResNet-XnIDR 23.86 ConvNet 96.87
NesT-B 97.2 Transformer 97.2
CRATE-S/B/L 13.1/22.8/77.6 Transformer 96.0/96.8/97.2
SparseSwin 17.58 Transformer 97.43

CIFAR-100:

Model Parameters (M) Type Accuracy (%)
ResNeXt-50 25.03 ConvNet 84.42
NesT-B 97.2 Transformer 82.56
CRATE-S/B/L 13.12/22.8/77.6 Transformer 81.0/82.7/83.6
SparseSwin 17.58 Transformer 85.35

Accuracy per million parameters (ImageNet100): SparseSwin, 86.96/17.584.95%86.96/17.58 \approx 4.95\%; Swin-T, 85.22/27.63.09%85.22/27.6 \approx 3.09\%.

A plausible implication is that the focus of heavy transformer computation on a small, optimized set of latent tokens preserves or even enhances information modeling, while yielding substantial gains in compute and parameter efficiency.

6. Significance within Vision Transformer Research

SparseSwin introduces a modular sparse token selection mechanism that can be integrated seamlessly with hierarchical transformer backbones. By removing the quadratic bottleneck associated with attention over large spatial grids, it demonstrates that locality in earlier stages combined with global reasoning over compressed tokens is highly effective. SparseSwin establishes new accuracy baselines in the low-parameter regime for transformer-based image recognizers, revealing a promising direction for the design of efficient, scalable vision transformers (Pinasthika et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SparseSwin.