SparseSwin: Efficient Vision Transformer

Updated 22 February 2026

The paper introduces the SparTa Block, which embeds a sparse token converter in the Swin hierarchy to enable global self-attention over compressed tokens.
It achieves a 98.4% token reduction by converting high-dimensional feature maps into a small latent token set, significantly lowering computational complexity.
Benchmark results show that SparseSwin outperforms traditional vision transformers in accuracy while reducing parameters by approximately 36.3%.

SparseSwin is a vision transformer architecture that embeds a sparse token selection (the SparTa Block) within the Swin Transformer hierarchy to achieve parameter efficiency and strong classification performance. The design incorporates a sparse token converter that compresses feature maps into a small set of latent tokens, on which global self-attention and subsequent processing are performed. This architectural innovation delivers significant parameter reduction while improving accuracy across canonical image recognition benchmarks (Pinasthika et al., 2023).

1. SparTa (Sparse Transformer) Block Architecture

The SparTa Block is the centerpiece of SparseSwin. It consists of two primary components: the sparse token converter and a stack of $L$ standard transformer sub-blocks.

1.1 Sparse Token Converter

Given the output feature map from Stage 3 of the Swin-T backbone, with

$\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}$

SparseSwin applies a $3\times 3$ convolution to mix channels, producing

$\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}$

Flattening the spatial dimensions and projecting them via a linear layer yields a token sequence: $\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}$ Here, $N := H_3W_3$ is the input token count, $N' := t$ is the compressed token set, and the sparsity rate is $r = 1 - \frac{N'}{N}$ . With $(H, W) = (224, 224)$ , $H_3 = W_3 = 7 \Rightarrow N = 49$ , $\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}$ 0, resulting in $\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}$ 1, i.e., 98.4% token reduction from the initial count ( $\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}$ 2).

1.2 Transformer Sub-block Stack

The compressed sequence $\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}$ 3 undergoes $\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}$ 4 transformer layers, each comprising pre-norm multi-head self-attention (MSA) and MLP with residuals: $\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}$ 5 Final normalization is applied to yield $\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}$ 6.

2. Integration within the Swin Transformer Hierarchy

SparseSwin retains the four-stage, progressively downsampling hierarchy of Swin-T through Stage 3. The full workflow is:

Stage 1: Patch partition and merging, Swin block $\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}$ 7, output shape $\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}$ 8.
Stage 2: Patch merge, Swin block $\mathbf{X}_3 \in \mathbb{R}^{B\times C \times H_3 \times W_3}, \quad H_3 = \frac{H}{32},\; W_3 = \frac{W}{32}$ 9, output $3\times 3$ 0.
Stage 3: Patch merge, Swin block $3\times 3$ 1, output $3\times 3$ 2.
Stage 4: Replaced by SparTa Block (sparse token conversion and transformer stack).

Pseudocode outline of the forward path: $\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}$ 8

This approach leverages Swin's early spatial reduction to deliver a token set amenable to global attention, but at dramatically reduced token count.

3. Computational Complexity and Parameter Efficiency

SparseSwin achieves lower complexity by shifting quadratic self-attention to act on a reduced number of tokens.

For a transformer block, multi-head self-attention complexity is

$3\times 3$ 3

where $3\times 3$ 4 is sequence length, $3\times 3$ 5 embedding dimension, $3\times 3$ 6 number of heads.

In Swin-T, windowed attention partitions the feature map into $3\times 3$ 7 windows, each processed independently.
SparseSwin applies global self-attention to $3\times 3$ 8 tokens after conversion, yielding

$3\times 3$ 9

( $\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}$ 0 convolution kernel size).

With $\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}$ 1 (e.g., $\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}$ 2 vs. $\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}$ 3), the dominant quadratic term is sharply curtailed.

Parameter count comparison:

Swin-T: $\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}$ 4 M
SparseSwin: $\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}$ 5 M
Relative reduction: $\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}$ 6 Despite this, expressive capacity is maintained as sparsification targets only the final transformer processing.

4. Training Regimen and Regularization

The following experimental protocols are utilized:

Datasets: ImageNet100 (subset of ImageNet1K with 100 classes, 224 $\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}$ 7 resolution), CIFAR-10, and CIFAR-100 (resized to 224 $\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}$ 8).
ImageNet100: Optimizer Adam, learning rate $\widetilde{\mathbf{X}_3} = \mathrm{Conv}_{3\times3}(\mathbf{X}_3) \in \mathbb{R}^{B\times e \times H_3 \times W_3}$ 9, batch 128, 100 epochs, first two stages frozen and pretrained on ImageNet1K. Key SparTa hyperparameters: $\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}$ 0.
CIFAR-10/100: AdamW, learning rate $\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}$ 1, weight decay $\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}$ 2, with standard data augmentation (random crop, flip, resize-crop).
Attention Regularization: The loss is

$\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}$ 3

where $\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}$ 4 gathers attention scores in SparTa, and $\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}$ 5 is cross-entropy.

5. Benchmarking and Comparative Evaluation

Empirical results on standard image classification datasets highlight SparseSwin's parameter efficiency and accuracy improvements.

ImageNet100 (224×224):

Model	Parameters (M)	Type	Accuracy (%)
Swin-T	27.6	Transformer	85.22
ViT-B	85.9	Transformer	80.90
DLME (ResNet-50)	–	ConvNet	79.3
SparseSwin (L2 reg.)	17.58	Transformer	86.96

CIFAR-10:

Model	Parameters (M)	Type	Accuracy (%)
DenseNet-BC-190+Mixup	25.6	ConvNet	97.3
ResNet-XnIDR	23.86	ConvNet	96.87
NesT-B	97.2	Transformer	97.2
CRATE-S/B/L	13.1/22.8/77.6	Transformer	96.0/96.8/97.2
SparseSwin	17.58	Transformer	97.43

CIFAR-100:

Model	Parameters (M)	Type	Accuracy (%)
ResNeXt-50	25.03	ConvNet	84.42
NesT-B	97.2	Transformer	82.56
CRATE-S/B/L	13.12/22.8/77.6	Transformer	81.0/82.7/83.6
SparseSwin	17.58	Transformer	85.35

Accuracy per million parameters (ImageNet100): SparseSwin, $\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}$ 6; Swin-T, $\mathbf{S}_T = \mathrm{Linear}_{\,H_3W_3 \to t}\left(\mathrm{Flatten}(\widetilde{\mathbf{X}_3})\right) \in \mathbb{R}^{B\times t \times e}$ 7.

A plausible implication is that the focus of heavy transformer computation on a small, optimized set of latent tokens preserves or even enhances information modeling, while yielding substantial gains in compute and parameter efficiency.

6. Significance within Vision Transformer Research

SparseSwin introduces a modular sparse token selection mechanism that can be integrated seamlessly with hierarchical transformer backbones. By removing the quadratic bottleneck associated with attention over large spatial grids, it demonstrates that locality in earlier stages combined with global reasoning over compressed tokens is highly effective. SparseSwin establishes new accuracy baselines in the low-parameter regime for transformer-based image recognizers, revealing a promising direction for the design of efficient, scalable vision transformers (Pinasthika et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

SparseSwin: Swin Transformer with Sparse Transformer Block (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SparseSwin.

SparseSwin: Efficient Vision Transformer

1. SparTa (Sparse Transformer) Block Architecture

1.1 Sparse Token Converter

1.2 Transformer Sub-block Stack

2. Integration within the Swin Transformer Hierarchy

3. Computational Complexity and Parameter Efficiency

4. Training Regimen and Regularization

5. Benchmarking and Comparative Evaluation

6. Significance within Vision Transformer Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SparseSwin: Efficient Vision Transformer

1. SparTa (Sparse Transformer) Block Architecture

1.1 Sparse Token Converter

1.2 Transformer Sub-block Stack

2. Integration within the Swin Transformer Hierarchy

3. Computational Complexity and Parameter Efficiency

4. Training Regimen and Regularization

5. Benchmarking and Comparative Evaluation

6. Significance within Vision Transformer Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research