GC ViT: Global Context Self-Attention
- GC ViT is a vision transformer architecture that integrates global context self-attention with local attention to capture both long- and short-range spatial dependencies.
- It employs specialized Fused-MBConv blocks that introduce CNN-like inductive biases, boosting parameter efficiency and computational performance.
- GC ViT achieves state-of-the-art results on image classification, object detection, and semantic segmentation benchmarks such as ImageNet-1K and MS COCO.
Global Context Self-Attention (GC-SA) defines a self-attention paradigm central to the Global Context Vision Transformer (GC ViT) architecture, designed to simultaneously capture long-range and short-range spatial dependencies within visual data. GC ViT improves parameter and computational efficiency relative to prior Vision Transformers (ViTs) by integrating GC-SA modules with standard local self-attention, negating the need for computationally expensive attention masks or window-shifting operations. This framework additionally introduces CNN-inspired inductive biases using specialized Fused-MBConv blocks, contributing to state-of-the-art performance in image classification, object detection, and semantic segmentation tasks (Hatamizadeh et al., 2022).
1. Architectural Composition of GC ViT
GC ViT is structured as a hierarchical model consisting of four main stages, each containing alternating local and global context self-attention blocks:
- Input Stem: The input is initially processed by two successive convolutions with stride 2, followed by LayerNorm and a Fused-MBConv block. This establishes a feature map of spatial size with channels.
- Hierarchical Stages (s = 1...4):
- Each stage starts with a downsampler—composed of a strided convolution, LayerNorm, and a Fused-MBConv block.
- After downsampling, each stage contains blocks that alternate between local windowed self-attention (LG-SA) and global context self-attention (GC-SA) operating on non-overlapping windows of size —typically .
- After all stages, global average pooling and a linear classifier are applied.
Within each stage, the alternation pattern is, for instance, [LG-SA → GC-SA → LG-SA → GC-SA] when . LG-SA modules implement standard window-based multi-head self-attention (MHSA), while GC-SA modules deploy the specialized global-to-local attention described below.
2. Global Context Self-Attention Formulation
GC-SA enables global queries to attend to local keys and values within windows, facilitating efficient modeling of cross-window relations:
- Window Partitioning: Given a feature map at a stage, it is partitioned into non-overlapping windows, denoted for .
- Global Query Generator: A small CNN repeatedly applies downsampling and Fused-MBConv blocks to so output spatial dimensions match . This yields . Reshaped global queries become .
- Local Keys and Values: For each window , linearly project to obtain and with .
- Global Self-Attention Operation: All queries, keys, and values are split into heads of dimension , with a relative position bias added:
Output heads are concatenated and passed through a linear projection.
- Computational Complexity: The overall cost per stage is , which maintains efficiency on par with windowed attention approaches while introducing scalable long-range interactions.
3. Fused-MBConv Blocks and Inductive Bias
To embed spatial locality and cross-channel bias, GC ViT deploys modified Fused-MBConv blocks at each stage:
- Structure: Each block implements the following pipeline for input :
- Depth-wise convolution
- GELU activation
- Squeeze-and-Excite operation
- convolution
- Residual addition to the input.
Equations:
- Functional Roles: These blocks act as (i) injectors of spatial and channel-wise inductive biases (analogous to CNNs); (ii) downsamplers when used in strided mode; (iii) CNN-style token generators forming the global queries for GC-SA.
4. Blockwise Operations and Data Pipeline
A high-level GC ViT block performs an alternation between LG-SA and GC-SA, as governed by stage configuration. The forward function can be outlined as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
def GC_ViT_Block(X, Q_g, is_global, params): if not is_global: # Local windowed self-attention Q = X @ Wq K = X @ Wk V = X @ Wv A = softmax(Q @ K.T / sqrt(d) + b_local) Y = A @ V else: # Global context self-attention Q = Q_g @ Wq K = X @ Wk V = X @ Wv A = softmax(Q @ K.T / sqrt(d) + b_global) Y = A @ V X = X + (Y @ Wo) X = X + MLP(LayerNorm(X)) return X |
5. Training Protocols and Empirical Benchmarks
GC ViT is evaluated on standard computer vision tasks, demonstrating high efficacy:
- Image Classification (ImageNet-1K, input):
- Optimizer: AdamW (, weight decay 0.05)
- 300 epochs, cosine decay, 20 warm-up/cool-down epochs, batch size 1024 or 4096
- No extra data/distillation
- Top-1 accuracy:
| Model | Params (M) | FLOPs (G) | Top-1 (%) |
|---|---|---|---|
| GC ViT-S | 51 | 8.5 | 84.3 |
| GC ViT-B | 90 | 14.8 | 85.0 |
| GC ViT-L | 201 | 32.6 | 85.7 |
These results surpass comparably-sized Swin, ConvNeXt, MaxViT, and other ViT derivatives.
- Object Detection & Instance Segmentation (MS COCO):
- Backbones: ImageNet-1K-pretrained GC ViT
- Heads: Mask-RCNN / Cascade Mask-RCNN (3× schedule), lr=1e-4, weight decay=0.05, batch 16
| Model+Head | Params (M) | Box AP | Mask AP | Comparison |
|---|---|---|---|---|
| GC ViT-T+Mask-RCNN | 28 | 47.9 | 43.2 | Swin-T: 46.0/41.6 |
| GC ViT-S+CMRCNN | 52.4 | 45.4 | ConvNeXt-S: 51.9/45.0 | |
| GC ViT-L+DINO (IN-21K) | 58.3 | Swin-L: 58.0 |
- Semantic Segmentation (ADE20K):
- Backbone: ImageNet-1K-pretrained GC ViT, UPerNet head, crop
| Model | Params (M) | mIoU (%) | Comparison |
|---|---|---|---|
| GC ViT-T | 58 | 47.0 | Twins-SVT-S: 46.2 |
| GC ViT-S | 84 | 48.3 | Twins-SVT-B: 47.7 |
| GC ViT-B | 125 | 49.2 | Twins-SVT-L: 48.8 |
GC ViT consistently outperforms prior models of comparable or greater complexity across these tasks.
6. Significance and Research Context
GC ViT addresses two primary limitations in existing ViT-inspired architectures: the inability to efficiently model long-range dependencies without costly global attention, and the loss of spatial-inductive bias found in CNNs. By alternating between local and global context self-attention—where GC-SA efficiently couples global query tokens with window-local keys and values—GC ViT achieves efficient cross-window reasoning. The introduction of CNN-like Fused-MBConv blocks further closes the inductive bias gap.
A plausible implication is that the GC-SA mechanism may be adaptable to other dense prediction tasks and modalities, provided efficient synthesis of global context and local structure is needed.
7. Implementation Considerations and Reproducibility
The self-contained mathematical descriptions of GC-SA, blockwise alternation, and Fused-MBConv design offered in (Hatamizadeh et al., 2022) are sufficient for implementation. The architecture removes the necessity for computationally expensive global attention, attention masks, or window shifting. Global queries are computed once per stage using the specified CNN, and alternate with local attention blocks according to the stage’s parameter. Training schedules, optimizer settings, and data augmentations are specified to reproduce reported results on ImageNet-1K, MS COCO, and ADE20K. All empirical benchmarks compare directly and favorably to prior art, establishing a consistent state-of-the-art across vision benchmarks.