Papers
Topics
Authors
Recent
Search
2000 character limit reached

GPSA: Integrating Convolutional Bias in Vision Transformers

Updated 1 April 2026
  • Gated Positional Self-Attention (GPSA) is a mechanism that integrates a learnable positional bias with content-based self-attention to balance local and global feature extraction.
  • It employs a per-head gating parameter to dynamically interpolate between convolution-like positional focus and data-driven content attention, enhancing sample efficiency.
  • Empirical results on ImageNet show GPSA yields higher accuracy and improved interpretability compared to standard transformer self-attention methods.

Gated Positional Self-Attention (GPSA) is a self-attention mechanism introduced as part of the ConViT architecture to fuse the sample efficiency advantages of convolutional inductive biases with the expressivity of Transformer models for vision tasks. GPSA augments the standard content-based self-attention in Vision Transformers (ViT) with a learnable, soft positional bias and a gating mechanism. This design enables GPSA to interpolate between convolution-like locality and data-driven content attention, with a per-head gate that can dynamically adapt during training. The resulting framework achieves improved sample efficiency, higher ImageNet classification accuracy, and provides an interpretable transition from local to global contexts in deep networks (d'Ascoli et al., 2021).

1. Foundation and Motivation

In standard Vision Transformers, each multi-head self-attention (SA) layer computes attention based solely on patch content, remaining invariant to spatial position. That is, for each attention head hh, the content-based attention matrix is defined as

Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}\left(Q_i^h (K_j^h)^\top \right)

where Qh,KhRL×DhQ^h,K^h \in \mathbb{R}^{L \times D_h} and LL is the number of image patches. While flexible and powerful for large-scale learning, such attention mechanisms lack explicit locality priors, which have been shown to yield superior sample efficiency in convolutional networks.

GPSA addresses this by introducing a soft convolutional inductive bias: it adds a learnable positional attention score per head and interpolates between positional and content attention via a sigmoid-gated parameter ghg_h. The principal aims are (i) to bias early layers toward locality, improving the sample efficiency particularly for small datasets, and (ii) to allow the network to “escape” this bias if required for long-range or semantic dependencies as training progresses.

2. Mathematical Formulation

GPSA defines, for each attention head hh:

  • Content attention:

Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}(Q_i^h (K_j^h)^\top)

  • Positional attention:

Apos,ijh=softmax((vposh)rij)A_{\text{pos},ij}^h = \mathrm{softmax}\left((v^h_{\text{pos}})^\top r_{ij}\right)

where vposhRDposv^h_{\text{pos}} \in \mathbb{R}^{D_{\text{pos}}} is a learned parameter, and rijr_{ij} encodes the relative 2-D offset between patch positions Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}\left(Q_i^h (K_j^h)^\top \right)0 and Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}\left(Q_i^h (K_j^h)^\top \right)1.

  • Gate:

Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}\left(Q_i^h (K_j^h)^\top \right)2

where Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}\left(Q_i^h (K_j^h)^\top \right)3 is a learnable scalar per head and Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}\left(Q_i^h (K_j^h)^\top \right)4 is the sigmoid function.

The mixed attention matrix is constructed as: Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}\left(Q_i^h (K_j^h)^\top \right)5 followed by row-wise normalization: Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}\left(Q_i^h (K_j^h)^\top \right)6 The GPSA operation for each head then applies this normalized matrix to values: Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}\left(Q_i^h (K_j^h)^\top \right)7

The relative position encoding Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}\left(Q_i^h (K_j^h)^\top \right)8 is fixed and low-dimensional, typically

Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}\left(Q_i^h (K_j^h)^\top \right)9

where Qh,KhRL×DhQ^h,K^h \in \mathbb{R}^{L \times D_h}0 is the position offset vector between Qh,KhRL×DhQ^h,K^h \in \mathbb{R}^{L \times D_h}1 and Qh,KhRL×DhQ^h,K^h \in \mathbb{R}^{L \times D_h}2.

3. Convolutional-Biased Initialization

A key aspect of GPSA is its initialization to mimic locality akin to convolution kernels, facilitating rapid sample-efficient learning.

  • Each head’s positional weight is set to focus on a specific offset Qh,KhRL×DhQ^h,K^h \in \mathbb{R}^{L \times D_h}3 (e.g., elements of a Qh,KhRL×DhQ^h,K^h \in \mathbb{R}^{L \times D_h}4 grid), with

Qh,KhRL×DhQ^h,K^h \in \mathbb{R}^{L \times D_h}5

This results in a positional score

Qh,KhRL×DhQ^h,K^h \in \mathbb{R}^{L \times D_h}6

Position attention thus initially mimics convolution, peaking at offset Qh,KhRL×DhQ^h,K^h \in \mathbb{R}^{L \times D_h}7, with window size regulated by the “locality strength” parameter Qh,KhRL×DhQ^h,K^h \in \mathbb{R}^{L \times D_h}8.

  • The gating parameter is initialized as Qh,KhRL×DhQ^h,K^h \in \mathbb{R}^{L \times D_h}9, yielding LL0. Thus, heads are biased toward positional attention early in training but retain flexibility to attend to content.

4. Gate Dynamics and Locality Escape

During training, each LL1 learns by gradient descent without further regularization beyond standard weight decay. Empirically:

  • Lower GPSA layers largely maintain high LL2, and thus maintain strong locality.
  • As depth increases, LL3 in higher GPSA layers drifts toward zero, increasing reliance on learned content attention, recovering standard ViT expressivity for complex, long-range interactions.

This mechanism enables GPSA to provide a continuous, data-driven transition from convolutional to transformer-like behavior at the head level, allowing the network to become less local only as dictated by the task.

5. Implementation and Architectural Integration

ConViT implements GPSA by replacing the first LL4 of 12 self-attention layers in a ViT block (typically LL5) with GPSA layers. The architecture otherwise mirrors ViT, including standard feedforward and normalization layers. Model sizes and corresponding head parameters are chosen as:

Model LL6 (heads) LL7 Locality (conv. kernel analog)
ConViT-Ti 4 48 LL8
ConViT-S 9 64 LL9
ConViT-B 16 64 ghg_h0

Relative positional encodings ghg_h1 are fixed and computationally inexpensive, and absolute positional embeddings (as in ViT) are included but downweighted. The gating and positional bias increase parameter efficiency without adding significant computational cost (d'Ascoli et al., 2021).

6. Empirical Outcomes and Ablation Results

On ImageNet-1k with DeiT hyperparameters and from-scratch training (300 epochs), ConViT substantially outperforms the comparable DeiT baseline:

  • Top-1 Accuracy: ConViT-S (22M): ghg_h2 vs. DeiT-S: ghg_h3; ConViT-B (86M): ghg_h4 vs. DeiT-B: ghg_h5.
  • Sample Efficiency Experiments (ConViT-S vs. DeiT-S):
    • ghg_h6: ghg_h7 vs. ghg_h8 (relative improvement ghg_h9)
    • hh0: hh1 vs. hh2 (hh3)
    • hh4: hh5 vs. hh6 (hh7)

Ablation studies demonstrate:

  • Removing gating results in a hh8 drop on full data (hh9 on Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}(Q_i^h (K_j^h)^\top)0 subset).
  • Removing convolutional initialization results in a Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}(Q_i^h (K_j^h)^\top)1/Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}(Q_i^h (K_j^h)^\top)2 drop.
  • Removing both yields a Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}(Q_i^h (K_j^h)^\top)3/Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}(Q_i^h (K_j^h)^\top)4 drop, indicating the compounded benefit of both components.

7. Locality Metrics and Hyperparameter Implications

To quantify locality, a nonlocality metric is defined per layer: Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}(Q_i^h (K_j^h)^\top)5

  • In vanilla SA (DeiT), Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}(Q_i^h (K_j^h)^\top)6 drops rapidly during early epochs (implying heads become more local) then increases in deeper layers as global structure forms.
  • In ConViT, the initial GPSA layers are maximally local due to initialization; Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}(Q_i^h (K_j^h)^\top)7 grows with training but remains below that of DeiT overall. Lower GPSA layers typically remain more local, while higher layers increasingly attend globally.

This analysis suggests that locality is particularly advantageous in shallow layers and that the hyperparameters Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}(Q_i^h (K_j^h)^\top)8 (locality strength) and Acontent,ijh=softmax(Qih(Kjh))A_{\text{content},ij}^h = \mathrm{softmax}(Q_i^h (K_j^h)^\top)9 (number of GPSA layers) offer direct trade-offs between sample efficiency and eventual accuracy ceiling.

8. Significance and Broader Implications

GPSA provides a soft, interpretable, and learnable convolutional bias embedded directly within the Transformer paradigm. The mechanism enables Vision Transformers to leverage both the sample efficiency of convolutional priors and the high capacity for global interactions afforded by self-attention. This yields networks with improved learning efficiency and competitive or superior accuracy compared to previous architectures, as evidenced by empirical results. The gating design offers insights into the transition between local and global information integration, supporting further research into adaptive architectural priors in deep models (d'Ascoli et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated Positional Self-Attention (GPSA).