GC ViT: Global Context Self-Attention

Updated 25 February 2026

GC ViT is a vision transformer architecture that integrates global context self-attention with local attention to capture both long- and short-range spatial dependencies.
It employs specialized Fused-MBConv blocks that introduce CNN-like inductive biases, boosting parameter efficiency and computational performance.
GC ViT achieves state-of-the-art results on image classification, object detection, and semantic segmentation benchmarks such as ImageNet-1K and MS COCO.

Global Context Self-Attention (GC-SA) defines a self-attention paradigm central to the Global Context Vision Transformer (GC ViT) architecture, designed to simultaneously capture long-range and short-range spatial dependencies within visual data. GC ViT improves parameter and computational efficiency relative to prior Vision Transformers (ViTs) by integrating GC-SA modules with standard local self-attention, negating the need for computationally expensive attention masks or window-shifting operations. This framework additionally introduces CNN-inspired inductive biases using specialized Fused-MBConv blocks, contributing to state-of-the-art performance in image classification, object detection, and semantic segmentation tasks (Hatamizadeh et al., 2022).

1. Architectural Composition of GC ViT

GC ViT is structured as a hierarchical model consisting of four main stages, each containing alternating local and global context self-attention blocks:

Input Stem: The input is initially processed by two successive $3 \times 3$ convolutions with stride 2, followed by LayerNorm and a Fused-MBConv block. This establishes a feature map of spatial size $H/2 \times W/2$ with $C_0$ channels.
Hierarchical Stages (s = 1...4):
- Each stage starts with a downsampler—composed of a strided $3 \times 3$ convolution, LayerNorm, and a Fused-MBConv block.
- After downsampling, each stage contains $J_s$ blocks that alternate between local windowed self-attention (LG-SA) and global context self-attention (GC-SA) operating on non-overlapping windows of size $h_s \times w_s$ —typically $7\times 7$ .
- After all stages, global average pooling and a linear classifier are applied.

Within each stage, the alternation pattern is, for instance, [LG-SA → GC-SA → LG-SA → GC-SA] when $J_s = 4$ . LG-SA modules implement standard window-based multi-head self-attention (MHSA), while GC-SA modules deploy the specialized global-to-local attention described below.

2. Global Context Self-Attention Formulation

GC-SA enables global queries to attend to local keys and values within windows, facilitating efficient modeling of cross-window relations:

Window Partitioning: Given a feature map $X \in \mathbb{R}^{B \times H \times W \times C}$ at a stage, it is partitioned into $N = (H/h) \cdot (W/w)$ non-overlapping windows, denoted $X_i \in \mathbb{R}^{B \times (h \cdot w) \times C}$ for $i = 1 \ldots N$ .
Global Query Generator: A small CNN repeatedly applies downsampling and Fused-MBConv blocks to $X$ so output spatial dimensions match $h \times w$ . This yields $\mathcal{G}(X) \in \mathbb{R}^{B \times h \times w \times C}$ . Reshaped global queries become $Q_g := \text{reshape}(\mathcal{G}(X), [B, h \cdot w, C])$ .
Local Keys and Values: For each window $i$ , linearly project $X_i$ to obtain $K_i = X_i W_k$ and $V_i = X_i W_v$ with $W_k, W_v \in \mathbb{R}^{C\times C}$ .
Global Self-Attention Operation: All queries, keys, and values are split into $F$ heads of dimension $d = C/F$ , with a relative position bias $b \in \mathbb{R}^{(2p-1)\times (2p-1)}$ added:

$A_i = \mathrm{softmax}\left( (Q_g W_q)(K_i)^\top / \sqrt{d} + b \right), \quad Y_i = A_i V_i$

Output heads are concatenated and passed through a linear projection.

Computational Complexity: The overall cost per stage is $O(2\cdot H \cdot W \cdot C^2 + H \cdot W \cdot h \cdot w \cdot C)$ , which maintains efficiency on par with windowed attention approaches while introducing scalable long-range interactions.

3. Fused-MBConv Blocks and Inductive Bias

To embed spatial locality and cross-channel bias, GC ViT deploys modified Fused-MBConv blocks at each stage:

Structure: Each block implements the following pipeline for input $x \in \mathbb{R}^{B\times H\times W\times C}$ $x \in R^{B \times H \times W \times C}$ :
1. Depth-wise $3 \times 3$ convolution
2. GELU activation
3. Squeeze-and-Excite operation
4. $1 \times 1$ convolution
5. Residual addition to the input.
Equations:

$\hat{x} = \mathrm{DWConv}_{3\times3}(x);\quad \hat{x} = \mathrm{GELU}(\hat{x});\quad \hat{x} = \mathrm{SE}(\hat{x});\quad x_{\mathrm{out}} = \mathrm{Conv}_{1\times1}(\hat{x}) + x$

Functional Roles: These blocks act as (i) injectors of spatial and channel-wise inductive biases (analogous to CNNs); (ii) downsamplers when used in strided mode; (iii) CNN-style token generators forming the global queries for GC-SA.

4. Blockwise Operations and Data Pipeline

A high-level GC ViT block performs an alternation between LG-SA and GC-SA, as governed by stage configuration. The forward function can be outlined as:

def GC_ViT_Block(X, Q_g, is_global, params):
    if not is_global:
        # Local windowed self-attention
        Q = X @ Wq
        K = X @ Wk
        V = X @ Wv
        A = softmax(Q @ K.T / sqrt(d) + b_local)
        Y = A @ V
    else:
        # Global context self-attention
        Q = Q_g @ Wq
        K = X   @ Wk
        V = X   @ Wv
        A = softmax(Q @ K.T / sqrt(d) + b_global)
        Y = A @ V

    X = X + (Y @ Wo)
    X = X + MLP(LayerNorm(X))
    return X

A full stage processes all windows and alternates is_global between False (LG-SA) and True (GC-SA) for

J_s

blocks.

5. Training Protocols and Empirical Benchmarks

GC ViT is evaluated on standard computer vision tasks, demonstrating high efficacy:

Image Classification (ImageNet-1K, $224\times224$ input):
- Optimizer: AdamW ( $\text{lr}=1 \text{e}{-3}$ , weight decay 0.05)
- 300 epochs, cosine decay, 20 warm-up/cool-down epochs, batch size 1024 or 4096
- No extra data/distillation
- Top-1 accuracy:

Model	Params (M)	FLOPs (G)	Top-1 (%)
GC ViT-S	51	8.5	84.3
GC ViT-B	90	14.8	85.0
GC ViT-L	201	32.6	85.7

These results surpass comparably-sized Swin, ConvNeXt, MaxViT, and other ViT derivatives.

Object Detection & Instance Segmentation (MS COCO):
- Backbones: ImageNet-1K-pretrained GC ViT
- Heads: Mask-RCNN / Cascade Mask-RCNN (3× schedule), lr=1e-4, weight decay=0.05, batch 16

Model+Head	Params (M)	Box AP	Mask AP	Comparison
GC ViT-T+Mask-RCNN	28	47.9	43.2	Swin-T: 46.0/41.6
GC ViT-S+CMRCNN		52.4	45.4	ConvNeXt-S: 51.9/45.0
GC ViT-L+DINO (IN-21K)		58.3		Swin-L: 58.0

Semantic Segmentation (ADE20K):
- Backbone: ImageNet-1K-pretrained GC ViT, UPerNet head, crop $512 \times 512$

Model	Params (M)	mIoU (%)	Comparison
GC ViT-T	58	47.0	Twins-SVT-S: 46.2
GC ViT-S	84	48.3	Twins-SVT-B: 47.7
GC ViT-B	125	49.2	Twins-SVT-L: 48.8

GC ViT consistently outperforms prior models of comparable or greater complexity across these tasks.

6. Significance and Research Context

GC ViT addresses two primary limitations in existing ViT-inspired architectures: the inability to efficiently model long-range dependencies without costly global attention, and the loss of spatial-inductive bias found in CNNs. By alternating between local and global context self-attention—where GC-SA efficiently couples global query tokens with window-local keys and values—GC ViT achieves efficient cross-window reasoning. The introduction of CNN-like Fused-MBConv blocks further closes the inductive bias gap.

A plausible implication is that the GC-SA mechanism may be adaptable to other dense prediction tasks and modalities, provided efficient synthesis of global context and local structure is needed.

7. Implementation Considerations and Reproducibility

The self-contained mathematical descriptions of GC-SA, blockwise alternation, and Fused-MBConv design offered in (Hatamizadeh et al., 2022) are sufficient for implementation. The architecture removes the necessity for computationally expensive global attention, attention masks, or window shifting. Global queries are computed once per stage using the specified CNN, and alternate with local attention blocks according to the stage’s $J_s$ parameter. Training schedules, optimizer settings, and data augmentations are specified to reproduce reported results on ImageNet-1K, MS COCO, and ADE20K. All empirical benchmarks compare directly and favorably to prior art, establishing a consistent state-of-the-art across vision benchmarks.

Markdown Report Issue Upgrade to Chat

References (1)

Global Context Vision Transformers (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Global Context Self-Attention (GC ViT).

GC ViT: Global Context Self-Attention

1. Architectural Composition of GC ViT

2. Global Context Self-Attention Formulation

3. Fused-MBConv Blocks and Inductive Bias

4. Blockwise Operations and Data Pipeline

5. Training Protocols and Empirical Benchmarks

6. Significance and Research Context

7. Implementation Considerations and Reproducibility

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GC ViT: Global Context Self-Attention

1. Architectural Composition of GC ViT

2. Global Context Self-Attention Formulation

3. Fused-MBConv Blocks and Inductive Bias

4. Blockwise Operations and Data Pipeline

5. Training Protocols and Empirical Benchmarks

6. Significance and Research Context

7. Implementation Considerations and Reproducibility

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research