Papers
Topics
Authors
Recent
Search
2000 character limit reached

SparseSwin: Efficient Vision Transformer

Updated 22 February 2026
  • The paper introduces the SparTa Block, which embeds a sparse token converter in the Swin hierarchy to enable global self-attention over compressed tokens.
  • It achieves a 98.4% token reduction by converting high-dimensional feature maps into a small latent token set, significantly lowering computational complexity.
  • Benchmark results show that SparseSwin outperforms traditional vision transformers in accuracy while reducing parameters by approximately 36.3%.

SparseSwin is a vision transformer architecture that embeds a sparse token selection (the SparTa Block) within the Swin Transformer hierarchy to achieve parameter efficiency and strong classification performance. The design incorporates a sparse token converter that compresses feature maps into a small set of latent tokens, on which global self-attention and subsequent processing are performed. This architectural innovation delivers significant parameter reduction while improving accuracy across canonical image recognition benchmarks (Pinasthika et al., 2023).

1. SparTa (Sparse Transformer) Block Architecture

The SparTa Block is the centerpiece of SparseSwin. It consists of two primary components: the sparse token converter and a stack of LL standard transformer sub-blocks.

1.1 Sparse Token Converter

Given the output feature map from Stage 3 of the Swin-T backbone, with

X3∈RB×C×H3×W3,H3=H32,  W3=W32\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}

SparseSwin applies a 3×33\times 3 convolution to mix channels, producing

X3~=Conv3×3(X3)∈RB×e×H3×W3\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}

Flattening the spatial dimensions and projecting them via a linear layer yields a token sequence: ST=Linear H3W3→t(Flatten(X3~))∈RB×t×e\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e} Here, N:=H3W3N := H_3W_3 is the input token count, N′:=tN' := t is the compressed token set, and the sparsity rate is r=1−N′Nr = 1 - \frac{N'}{N}. With (H,W)=(224,224)(H, W) = (224, 224), H3=W3=7⇒N=49H_3 = W_3 = 7 \Rightarrow N = 49, X3∈RB×C×H3×W3,H3=H32,  W3=W32\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}0, resulting in X3∈RB×C×H3×W3,H3=H32,  W3=W32\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}1, i.e., 98.4% token reduction from the initial count (X3∈RB×C×H3×W3,H3=H32,  W3=W32\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}2).

1.2 Transformer Sub-block Stack

The compressed sequence X3∈RB×C×H3×W3,H3=H32,  W3=W32\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}3 undergoes X3∈RB×C×H3×W3,H3=H32,  W3=W32\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}4 transformer layers, each comprising pre-norm multi-head self-attention (MSA) and MLP with residuals: X3∈RB×C×H3×W3,H3=H32,  W3=W32\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}5 Final normalization is applied to yield X3∈RB×C×H3×W3,H3=H32,  W3=W32\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}6.

2. Integration within the Swin Transformer Hierarchy

SparseSwin retains the four-stage, progressively downsampling hierarchy of Swin-T through Stage 3. The full workflow is:

  1. Stage 1: Patch partition and merging, Swin block X3∈RB×C×H3×W3,H3=H32,  W3=W32\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}7, output shape X3∈RB×C×H3×W3,H3=H32,  W3=W32\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}8.
  2. Stage 2: Patch merge, Swin block X3∈RB×C×H3×W3,H3=H32,  W3=W32\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}9, output 3×33\times 30.
  3. Stage 3: Patch merge, Swin block 3×33\times 31, output 3×33\times 32.
  4. Stage 4: Replaced by SparTa Block (sparse token conversion and transformer stack).

Pseudocode outline of the forward path: ST=Linear H3W3→t(Flatten(X3~))∈RB×t×e\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}8

This approach leverages Swin's early spatial reduction to deliver a token set amenable to global attention, but at dramatically reduced token count.

3. Computational Complexity and Parameter Efficiency

SparseSwin achieves lower complexity by shifting quadratic self-attention to act on a reduced number of tokens.

For a transformer block, multi-head self-attention complexity is

3×33\times 33

where 3×33\times 34 is sequence length, 3×33\times 35 embedding dimension, 3×33\times 36 number of heads.

  • In Swin-T, windowed attention partitions the feature map into 3×33\times 37 windows, each processed independently.
  • SparseSwin applies global self-attention to 3×33\times 38 tokens after conversion, yielding

3×33\times 39

(X3~=Conv3×3(X3)∈RB×e×H3×W3\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}0 convolution kernel size).

With X3~=Conv3×3(X3)∈RB×e×H3×W3\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}1 (e.g., X3~=Conv3×3(X3)∈RB×e×H3×W3\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}2 vs. X3~=Conv3×3(X3)∈RB×e×H3×W3\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}3), the dominant quadratic term is sharply curtailed.

Parameter count comparison:

  • Swin-T: X3~=Conv3×3(X3)∈RB×e×H3×W3\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}4 M
  • SparseSwin: X3~=Conv3×3(X3)∈RB×e×H3×W3\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}5 M
  • Relative reduction: X3~=Conv3×3(X3)∈RB×e×H3×W3\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}6 Despite this, expressive capacity is maintained as sparsification targets only the final transformer processing.

4. Training Regimen and Regularization

The following experimental protocols are utilized:

  • Datasets: ImageNet100 (subset of ImageNet1K with 100 classes, 224X3~=Conv3×3(X3)∈RB×e×H3×W3\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}7 resolution), CIFAR-10, and CIFAR-100 (resized to 224X3~=Conv3×3(X3)∈RB×e×H3×W3\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}8).
  • ImageNet100: Optimizer Adam, learning rate X3~=Conv3×3(X3)∈RB×e×H3×W3\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}9, batch 128, 100 epochs, first two stages frozen and pretrained on ImageNet1K. Key SparTa hyperparameters: ST=Linear H3W3→t(Flatten(X3~))∈RB×t×e\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}0.
  • CIFAR-10/100: AdamW, learning rate ST=Linear H3W3→t(Flatten(X3~))∈RB×t×e\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}1, weight decay ST=Linear H3W3→t(Flatten(X3~))∈RB×t×e\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}2, with standard data augmentation (random crop, flip, resize-crop).
  • Attention Regularization: The loss is

ST=Linear H3W3→t(Flatten(X3~))∈RB×t×e\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}3

where ST=Linear H3W3→t(Flatten(X3~))∈RB×t×e\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}4 gathers attention scores in SparTa, and ST=Linear H3W3→t(Flatten(X3~))∈RB×t×e\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}5 is cross-entropy.

5. Benchmarking and Comparative Evaluation

Empirical results on standard image classification datasets highlight SparseSwin's parameter efficiency and accuracy improvements.

ImageNet100 (224×224):

Model Parameters (M) Type Accuracy (%)
Swin-T 27.6 Transformer 85.22
ViT-B 85.9 Transformer 80.90
DLME (ResNet-50) – ConvNet 79.3
SparseSwin (L2 reg.) 17.58 Transformer 86.96

CIFAR-10:

Model Parameters (M) Type Accuracy (%)
DenseNet-BC-190+Mixup 25.6 ConvNet 97.3
ResNet-XnIDR 23.86 ConvNet 96.87
NesT-B 97.2 Transformer 97.2
CRATE-S/B/L 13.1/22.8/77.6 Transformer 96.0/96.8/97.2
SparseSwin 17.58 Transformer 97.43

CIFAR-100:

Model Parameters (M) Type Accuracy (%)
ResNeXt-50 25.03 ConvNet 84.42
NesT-B 97.2 Transformer 82.56
CRATE-S/B/L 13.12/22.8/77.6 Transformer 81.0/82.7/83.6
SparseSwin 17.58 Transformer 85.35

Accuracy per million parameters (ImageNet100): SparseSwin, ST=Linear H3W3→t(Flatten(X3~))∈RB×t×e\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}6; Swin-T, ST=Linear H3W3→t(Flatten(X3~))∈RB×t×e\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}7.

A plausible implication is that the focus of heavy transformer computation on a small, optimized set of latent tokens preserves or even enhances information modeling, while yielding substantial gains in compute and parameter efficiency.

6. Significance within Vision Transformer Research

SparseSwin introduces a modular sparse token selection mechanism that can be integrated seamlessly with hierarchical transformer backbones. By removing the quadratic bottleneck associated with attention over large spatial grids, it demonstrates that locality in earlier stages combined with global reasoning over compressed tokens is highly effective. SparseSwin establishes new accuracy baselines in the low-parameter regime for transformer-based image recognizers, revealing a promising direction for the design of efficient, scalable vision transformers (Pinasthika et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SparseSwin.