Papers
Topics
Authors
Recent
Search
2000 character limit reached

SimViT: Efficient Local-Global Vision Transformer

Updated 2 April 2026
  • SimViT is a vision transformer that uses multi-head central self-attention within sliding windows to preserve native 2D local structure.
  • It employs a four-stage hierarchical feature extraction pipeline without explicit positional encodings, ensuring translation invariance.
  • SimViT achieves competitive accuracy on classification, detection, and segmentation tasks while reducing model complexity and parameters.

SimViT is a vision transformer architecture specifically designed to address the limitations of standard Vision Transformers (ViTs) in capturing 2D spatial and local correlations among image patches. Unlike ViTs such as ViT, PVT, or Swin, which typically flatten an H×WH \times W image into HWHW tokens and rely on global or non-overlapping windowed self-attention (often augmented with learned positional embeddings), SimViT introduces architectural changes that enable translation-invariant, parameter-efficient, and locally-biased visual representations. Central to SimViT is the Multi-head Central Self-Attention (MCSA) mechanism combined with a sliding-window approach to preserve spatial context, integrated within a four-stage hierarchical feature extraction pipeline. Across multiple visual benchmarks—including ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation—SimViT demonstrates competitive or superior accuracy with markedly reduced model complexity relative to comparable transformer-based models (Li et al., 2021).

1. Motivation and Key Architectural Innovations

SimViT is motivated by two central deficiencies in prevailing ViT approaches: (1) the destruction of native 2D local structure through flattening, which obscures meaningful neighborhood relationships; and (2) the computational cost and lack of inductive bias in global Multi-head Self-Attention (MSA), which treats all pairwise token interactions equally, leading to quadratic complexity in token count and suboptimal local modeling. Traditional models also rely on explicit positional encodings, compromising translation invariance. SimViT addresses these concerns as follows:

  • Employs “central” self-attention within localized sliding windows, updating only the central token per window.
  • Eliminates explicit positional encodings by construction, ensuring translation invariance.
  • Incorporates both local (via MCSA in early stages) and global (via MSA in the final stage) attention within a hierarchical transformer backbone.
  • Structures the model to facilitate dense prediction via multi-scale feature pyramids.

2. Multi-head Central Self-Attention (MCSA)

The MCSA is a novel form of local self-attention. In standard MSA, for XRHW×dX \in \mathbb{R}^{HW \times d}, all tokens interact globally: SA(Q,K,V)=softmax(QKTd+B)V,Q=XWQ,K=XWK,V=XWV\mathrm{SA}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d}} + B\right)V, \quad Q = XW^Q,\, K = XW^K,\, V = XW^V with BB as learnable positional bias.

In SimViT, Central Self-Attention (CSA) computes attention separately for each spatial location (u,v)(u, v) in the feature map xRHi×Wi×dx \in \mathbb{R}^{H_i \times W_i \times d}, focusing on its k×kk\times k local neighborhood Ωu,v\Omega_{u,v}. Only the central token's update is computed: Ωu,v={(m,n)muk12,nvk12}\Omega_{u,v} = \{(m, n) \mid |m-u| \le \tfrac{k-1}{2},\, |n-v| \le \tfrac{k-1}{2}\}

HWHW0

HWHW1

HWHW2

The outputs from HWHW3 heads are concatenated and projected: HWHW4 The computational complexity becomes HWHW5 instead of HWHW6, significantly improving efficiency at fixed resolution.

3. Sliding-Window Attention Mechanism

SimViT eschews non-overlapping window partitioning in favor of overlapping sliding windows of size HWHW7 with stride HWHW8 and padding HWHW9: XRHW×dX \in \mathbb{R}^{HW \times d}0 For all experiments, XRHW×dX \in \mathbb{R}^{HW \times d}1, XRHW×dX \in \mathbb{R}^{HW \times d}2, XRHW×dX \in \mathbb{R}^{HW \times d}3 are used, ensuring each spatial location acts as the center of one window and as a neighbor in eight surrounding windows. This design enables each token to contribute locally multiple times per stage, fostering smooth inter-window feature propagation and enhancing local feature coherence.

4. Four-Stage Hierarchical Feature Extraction

SimViT adopts a four-stage structure analogous to CNN backbones, with progressive spatial downsampling and channel scaling:

  • Stage 1: Non-overlapping 4XRHW×dX \in \mathbb{R}^{HW \times d}44 patch embedding reduces input to XRHW×dX \in \mathbb{R}^{HW \times d}5 (XRHW×dX \in \mathbb{R}^{HW \times d}6 channels).
  • Stages 2–4: Each uses XRHW×dX \in \mathbb{R}^{HW \times d}7 patch embedding further reducing the resolution to XRHW×dX \in \mathbb{R}^{HW \times d}8 (XRHW×dX \in \mathbb{R}^{HW \times d}9), SA(Q,K,V)=softmax(QKTd+B)V,Q=XWQ,K=XWK,V=XWV\mathrm{SA}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d}} + B\right)V, \quad Q = XW^Q,\, K = XW^K,\, V = XW^V0 (SA(Q,K,V)=softmax(QKTd+B)V,Q=XWQ,K=XWK,V=XWV\mathrm{SA}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d}} + B\right)V, \quad Q = XW^Q,\, K = XW^K,\, V = XW^V1), and SA(Q,K,V)=softmax(QKTd+B)V,Q=XWQ,K=XWK,V=XWV\mathrm{SA}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d}} + B\right)V, \quad Q = XW^Q,\, K = XW^K,\, V = XW^V2 (SA(Q,K,V)=softmax(QKTd+B)V,Q=XWQ,K=XWK,V=XWV\mathrm{SA}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d}} + B\right)V, \quad Q = XW^Q,\, K = XW^K,\, V = XW^V3), respectively.

Each stage comprises:

  • Linear patch embedding and LayerNorm.
  • SA(Q,K,V)=softmax(QKTd+B)V,Q=XWQ,K=XWK,V=XWV\mathrm{SA}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d}} + B\right)V, \quad Q = XW^Q,\, K = XW^K,\, V = XW^V4 consecutive transformer blocks: MCSA for stages 1–3, global MSA for stage 4.
  • Depthwise convolutional feed-forward network (as in PVT v2) within each block.

The network produces a feature pyramid: SA(Q,K,V)=softmax(QKTd+B)V,Q=XWQ,K=XWK,V=XWV\mathrm{SA}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d}} + B\right)V, \quad Q = XW^Q,\, K = XW^K,\, V = XW^V5 This facilitates direct use in standard detection or segmentation heads.

5. Model Variants and Complexity

SimViT is provided in five model sizes, each defined by per-stage channel widths SA(Q,K,V)=softmax(QKTd+B)V,Q=XWQ,K=XWK,V=XWV\mathrm{SA}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d}} + B\right)V, \quad Q = XW^Q,\, K = XW^K,\, V = XW^V6, number of MCSA heads SA(Q,K,V)=softmax(QKTd+B)V,Q=XWQ,K=XWK,V=XWV\mathrm{SA}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d}} + B\right)V, \quad Q = XW^Q,\, K = XW^K,\, V = XW^V7, FFN expansion SA(Q,K,V)=softmax(QKTd+B)V,Q=XWQ,K=XWK,V=XWV\mathrm{SA}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d}} + B\right)V, \quad Q = XW^Q,\, K = XW^K,\, V = XW^V8, and block count SA(Q,K,V)=softmax(QKTd+B)V,Q=XWQ,K=XWK,V=XWV\mathrm{SA}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d}} + B\right)V, \quad Q = XW^Q,\, K = XW^K,\, V = XW^V9 as detailed in the originating paper. Key complexities on ImageNet-1K (BB0 input) are summarized below:

Variant Parameters (M) FLOPs (G)
Micro 3.3 0.7
Tiny 13.0 2.5
Small 29.4 6.2
Medium 51.3 10.9
Large 62.9 12.2

The SimViT-Micro variant achieves the smallest parameter footprint (3.3M).

6. Empirical Results

Image Classification (ImageNet-1K, 300 epochs)

Model Params (M) FLOPs (G) Top-1 (%)
PVTv2-B0 3.4 0.6 70.5
SimViT-Micro 3.3 0.7 71.1
PVT-Tiny 13.2 1.9 75.1
SimViT-Tiny 13.0 2.5 79.3
PVT-Small 24.5 3.8 79.8
SimViT-Small 29.4 6.2 82.6
PVT-Medium 44.2 6.7 81.2
SimViT-Medium 51.3 10.9 83.3
PVT-Large 61.4 9.8 81.7
SimViT-Large 62.9 12.2 83.4

Detection and Segmentation

  • Object Detection (COCO 2017 val, SimViT-Small):
    • RetinaNet: APBB1 improves from 45.0 (Swin-T) to 46.3
    • ATSS: APBB2 improves from 47.2 to 49.6
    • GFL: APBB3 improves from 47.6 to 49.9
  • Semantic Segmentation (ADE20K val, Semantic FPN):
Backbone Params (M) FLOPs (G) mIOU (%)
PVT-Tiny 17.0 33.2 35.7
SimViT-Tiny 16.8 36.1 42.7
PVT-Small 28.2 44.5 39.8
SimViT-Small 33.2 54.0 47.2

7. Design Characteristics and Efficiency–Accuracy Trade-Offs

SimViT achieves notable performance with minimal parameter count and computational cost. SimViT-Micro (3.3M, 0.7G) surpasses PVTv2-B0 (3.4M, 0.6G) in Top-1 accuracy (71.1% vs. 70.5%). Ablative studies indicate that the model’s translation invariance renders absolute position encodings unnecessary; increasing window size beyond BB4 does not yield accuracy gains. In controlled-depth comparisons (e.g., 2-2-2-2 block configuration), SimViT-Tiny (11.1M) exceeds PVT-Tiny (13.2M) by +2.7% Top-1. The efficiency stems directly from the overlapping local attention of MCSA.

In summary, SimViT unifies local sliding-window self-attention (MCSA), a global MSA stage, and a four-stage hierarchical structure without explicit positional encodings. This combination delivers strong accuracy for both image-level and dense prediction tasks, with model sizes between 3.3M and 63M parameters (Li et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SimViT.