Papers
Topics
Authors
Recent
Search
2000 character limit reached

GC ViT: Global Context Self-Attention

Updated 25 February 2026
  • GC ViT is a vision transformer architecture that integrates global context self-attention with local attention to capture both long- and short-range spatial dependencies.
  • It employs specialized Fused-MBConv blocks that introduce CNN-like inductive biases, boosting parameter efficiency and computational performance.
  • GC ViT achieves state-of-the-art results on image classification, object detection, and semantic segmentation benchmarks such as ImageNet-1K and MS COCO.

Global Context Self-Attention (GC-SA) defines a self-attention paradigm central to the Global Context Vision Transformer (GC ViT) architecture, designed to simultaneously capture long-range and short-range spatial dependencies within visual data. GC ViT improves parameter and computational efficiency relative to prior Vision Transformers (ViTs) by integrating GC-SA modules with standard local self-attention, negating the need for computationally expensive attention masks or window-shifting operations. This framework additionally introduces CNN-inspired inductive biases using specialized Fused-MBConv blocks, contributing to state-of-the-art performance in image classification, object detection, and semantic segmentation tasks (Hatamizadeh et al., 2022).

1. Architectural Composition of GC ViT

GC ViT is structured as a hierarchical model consisting of four main stages, each containing alternating local and global context self-attention blocks:

  • Input Stem: The input is initially processed by two successive 3×33 \times 3 convolutions with stride 2, followed by LayerNorm and a Fused-MBConv block. This establishes a feature map of spatial size H/2×W/2H/2 \times W/2 with C0C_0 channels.
  • Hierarchical Stages (s = 1...4):
    • Each stage starts with a downsampler—composed of a strided 3×33 \times 3 convolution, LayerNorm, and a Fused-MBConv block.
    • After downsampling, each stage contains JsJ_s blocks that alternate between local windowed self-attention (LG-SA) and global context self-attention (GC-SA) operating on non-overlapping windows of size hs×wsh_s \times w_s—typically 7×77\times 7.
    • After all stages, global average pooling and a linear classifier are applied.

Within each stage, the alternation pattern is, for instance, [LG-SA → GC-SA → LG-SA → GC-SA] when Js=4J_s = 4. LG-SA modules implement standard window-based multi-head self-attention (MHSA), while GC-SA modules deploy the specialized global-to-local attention described below.

2. Global Context Self-Attention Formulation

GC-SA enables global queries to attend to local keys and values within windows, facilitating efficient modeling of cross-window relations:

  • Window Partitioning: Given a feature map X∈RB×H×W×CX \in \mathbb{R}^{B \times H \times W \times C} at a stage, it is partitioned into N=(H/h)â‹…(W/w)N = (H/h) \cdot (W/w) non-overlapping windows, denoted Xi∈RB×(hâ‹…w)×CX_i \in \mathbb{R}^{B \times (h \cdot w) \times C} for i=1…Ni = 1 \ldots N.
  • Global Query Generator: A small CNN repeatedly applies downsampling and Fused-MBConv blocks to XX so output spatial dimensions match h×wh \times w. This yields G(X)∈RB×h×w×C\mathcal{G}(X) \in \mathbb{R}^{B \times h \times w \times C}. Reshaped global queries become Qg:=reshape(G(X),[B,hâ‹…w,C])Q_g := \text{reshape}(\mathcal{G}(X), [B, h \cdot w, C]).
  • Local Keys and Values: For each window ii, linearly project XiX_i to obtain Ki=XiWkK_i = X_i W_k and Vi=XiWvV_i = X_i W_v with Wk,Wv∈RC×CW_k, W_v \in \mathbb{R}^{C\times C}.
  • Global Self-Attention Operation: All queries, keys, and values are split into FF heads of dimension d=C/Fd = C/F, with a relative position bias b∈R(2p−1)×(2p−1)b \in \mathbb{R}^{(2p-1)\times (2p-1)} added:

Ai=softmax((QgWq)(Ki)⊤/d+b),Yi=AiViA_i = \mathrm{softmax}\left( (Q_g W_q)(K_i)^\top / \sqrt{d} + b \right), \quad Y_i = A_i V_i

Output heads are concatenated and passed through a linear projection.

  • Computational Complexity: The overall cost per stage is O(2â‹…Hâ‹…Wâ‹…C2+Hâ‹…Wâ‹…hâ‹…wâ‹…C)O(2\cdot H \cdot W \cdot C^2 + H \cdot W \cdot h \cdot w \cdot C), which maintains efficiency on par with windowed attention approaches while introducing scalable long-range interactions.

3. Fused-MBConv Blocks and Inductive Bias

To embed spatial locality and cross-channel bias, GC ViT deploys modified Fused-MBConv blocks at each stage:

  • Structure: Each block implements the following pipeline for input x∈RB×H×W×Cx \in \mathbb{R}^{B\times H\times W\times C}:

    1. Depth-wise 3×33 \times 3 convolution
    2. GELU activation
    3. Squeeze-and-Excite operation
    4. 1×11 \times 1 convolution
    5. Residual addition to the input.
  • Equations:

x^=DWConv3×3(x);x^=GELU(x^);x^=SE(x^);xout=Conv1×1(x^)+x\hat{x} = \mathrm{DWConv}_{3\times3}(x);\quad \hat{x} = \mathrm{GELU}(\hat{x});\quad \hat{x} = \mathrm{SE}(\hat{x});\quad x_{\mathrm{out}} = \mathrm{Conv}_{1\times1}(\hat{x}) + x

  • Functional Roles: These blocks act as (i) injectors of spatial and channel-wise inductive biases (analogous to CNNs); (ii) downsamplers when used in strided mode; (iii) CNN-style token generators forming the global queries for GC-SA.

4. Blockwise Operations and Data Pipeline

A high-level GC ViT block performs an alternation between LG-SA and GC-SA, as governed by stage configuration. The forward function can be outlined as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def GC_ViT_Block(X, Q_g, is_global, params):
    if not is_global:
        # Local windowed self-attention
        Q = X @ Wq
        K = X @ Wk
        V = X @ Wv
        A = softmax(Q @ K.T / sqrt(d) + b_local)
        Y = A @ V
    else:
        # Global context self-attention
        Q = Q_g @ Wq
        K = X   @ Wk
        V = X   @ Wv
        A = softmax(Q @ K.T / sqrt(d) + b_global)
        Y = A @ V

    X = X + (Y @ Wo)
    X = X + MLP(LayerNorm(X))
    return X
A full stage processes all windows and alternates is_global between False (LG-SA) and True (GC-SA) for JsJ_s blocks.

5. Training Protocols and Empirical Benchmarks

GC ViT is evaluated on standard computer vision tasks, demonstrating high efficacy:

  • Image Classification (ImageNet-1K, 224×224224\times224 input):
    • Optimizer: AdamW (lr=1e−3\text{lr}=1 \text{e}{-3}, weight decay 0.05)
    • 300 epochs, cosine decay, 20 warm-up/cool-down epochs, batch size 1024 or 4096
    • No extra data/distillation
    • Top-1 accuracy:
Model Params (M) FLOPs (G) Top-1 (%)
GC ViT-S 51 8.5 84.3
GC ViT-B 90 14.8 85.0
GC ViT-L 201 32.6 85.7

These results surpass comparably-sized Swin, ConvNeXt, MaxViT, and other ViT derivatives.

  • Object Detection & Instance Segmentation (MS COCO):
    • Backbones: ImageNet-1K-pretrained GC ViT
    • Heads: Mask-RCNN / Cascade Mask-RCNN (3× schedule), lr=1e-4, weight decay=0.05, batch 16
Model+Head Params (M) Box AP Mask AP Comparison
GC ViT-T+Mask-RCNN 28 47.9 43.2 Swin-T: 46.0/41.6
GC ViT-S+CMRCNN 52.4 45.4 ConvNeXt-S: 51.9/45.0
GC ViT-L+DINO (IN-21K) 58.3 Swin-L: 58.0
  • Semantic Segmentation (ADE20K):
    • Backbone: ImageNet-1K-pretrained GC ViT, UPerNet head, crop 512×512512 \times 512
Model Params (M) mIoU (%) Comparison
GC ViT-T 58 47.0 Twins-SVT-S: 46.2
GC ViT-S 84 48.3 Twins-SVT-B: 47.7
GC ViT-B 125 49.2 Twins-SVT-L: 48.8

GC ViT consistently outperforms prior models of comparable or greater complexity across these tasks.

6. Significance and Research Context

GC ViT addresses two primary limitations in existing ViT-inspired architectures: the inability to efficiently model long-range dependencies without costly global attention, and the loss of spatial-inductive bias found in CNNs. By alternating between local and global context self-attention—where GC-SA efficiently couples global query tokens with window-local keys and values—GC ViT achieves efficient cross-window reasoning. The introduction of CNN-like Fused-MBConv blocks further closes the inductive bias gap.

A plausible implication is that the GC-SA mechanism may be adaptable to other dense prediction tasks and modalities, provided efficient synthesis of global context and local structure is needed.

7. Implementation Considerations and Reproducibility

The self-contained mathematical descriptions of GC-SA, blockwise alternation, and Fused-MBConv design offered in (Hatamizadeh et al., 2022) are sufficient for implementation. The architecture removes the necessity for computationally expensive global attention, attention masks, or window shifting. Global queries are computed once per stage using the specified CNN, and alternate with local attention blocks according to the stage’s JsJ_s parameter. Training schedules, optimizer settings, and data augmentations are specified to reproduce reported results on ImageNet-1K, MS COCO, and ADE20K. All empirical benchmarks compare directly and favorably to prior art, establishing a consistent state-of-the-art across vision benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Global Context Self-Attention (GC ViT).