GPSA: Integrating Convolutional Bias in Vision Transformers
- Gated Positional Self-Attention (GPSA) is a mechanism that integrates a learnable positional bias with content-based self-attention to balance local and global feature extraction.
- It employs a per-head gating parameter to dynamically interpolate between convolution-like positional focus and data-driven content attention, enhancing sample efficiency.
- Empirical results on ImageNet show GPSA yields higher accuracy and improved interpretability compared to standard transformer self-attention methods.
Gated Positional Self-Attention (GPSA) is a self-attention mechanism introduced as part of the ConViT architecture to fuse the sample efficiency advantages of convolutional inductive biases with the expressivity of Transformer models for vision tasks. GPSA augments the standard content-based self-attention in Vision Transformers (ViT) with a learnable, soft positional bias and a gating mechanism. This design enables GPSA to interpolate between convolution-like locality and data-driven content attention, with a per-head gate that can dynamically adapt during training. The resulting framework achieves improved sample efficiency, higher ImageNet classification accuracy, and provides an interpretable transition from local to global contexts in deep networks (d'Ascoli et al., 2021).
1. Foundation and Motivation
In standard Vision Transformers, each multi-head self-attention (SA) layer computes attention based solely on patch content, remaining invariant to spatial position. That is, for each attention head , the content-based attention matrix is defined as
where and is the number of image patches. While flexible and powerful for large-scale learning, such attention mechanisms lack explicit locality priors, which have been shown to yield superior sample efficiency in convolutional networks.
GPSA addresses this by introducing a soft convolutional inductive bias: it adds a learnable positional attention score per head and interpolates between positional and content attention via a sigmoid-gated parameter . The principal aims are (i) to bias early layers toward locality, improving the sample efficiency particularly for small datasets, and (ii) to allow the network to “escape” this bias if required for long-range or semantic dependencies as training progresses.
2. Mathematical Formulation
GPSA defines, for each attention head :
- Content attention:
- Positional attention:
where is a learned parameter, and encodes the relative 2-D offset between patch positions 0 and 1.
- Gate:
2
where 3 is a learnable scalar per head and 4 is the sigmoid function.
The mixed attention matrix is constructed as: 5 followed by row-wise normalization: 6 The GPSA operation for each head then applies this normalized matrix to values: 7
The relative position encoding 8 is fixed and low-dimensional, typically
9
where 0 is the position offset vector between 1 and 2.
3. Convolutional-Biased Initialization
A key aspect of GPSA is its initialization to mimic locality akin to convolution kernels, facilitating rapid sample-efficient learning.
- Each head’s positional weight is set to focus on a specific offset 3 (e.g., elements of a 4 grid), with
5
This results in a positional score
6
Position attention thus initially mimics convolution, peaking at offset 7, with window size regulated by the “locality strength” parameter 8.
- The gating parameter is initialized as 9, yielding 0. Thus, heads are biased toward positional attention early in training but retain flexibility to attend to content.
4. Gate Dynamics and Locality Escape
During training, each 1 learns by gradient descent without further regularization beyond standard weight decay. Empirically:
- Lower GPSA layers largely maintain high 2, and thus maintain strong locality.
- As depth increases, 3 in higher GPSA layers drifts toward zero, increasing reliance on learned content attention, recovering standard ViT expressivity for complex, long-range interactions.
This mechanism enables GPSA to provide a continuous, data-driven transition from convolutional to transformer-like behavior at the head level, allowing the network to become less local only as dictated by the task.
5. Implementation and Architectural Integration
ConViT implements GPSA by replacing the first 4 of 12 self-attention layers in a ViT block (typically 5) with GPSA layers. The architecture otherwise mirrors ViT, including standard feedforward and normalization layers. Model sizes and corresponding head parameters are chosen as:
| Model | 6 (heads) | 7 | Locality (conv. kernel analog) |
|---|---|---|---|
| ConViT-Ti | 4 | 48 | 8 |
| ConViT-S | 9 | 64 | 9 |
| ConViT-B | 16 | 64 | 0 |
Relative positional encodings 1 are fixed and computationally inexpensive, and absolute positional embeddings (as in ViT) are included but downweighted. The gating and positional bias increase parameter efficiency without adding significant computational cost (d'Ascoli et al., 2021).
6. Empirical Outcomes and Ablation Results
On ImageNet-1k with DeiT hyperparameters and from-scratch training (300 epochs), ConViT substantially outperforms the comparable DeiT baseline:
- Top-1 Accuracy: ConViT-S (22M): 2 vs. DeiT-S: 3; ConViT-B (86M): 4 vs. DeiT-B: 5.
- Sample Efficiency Experiments (ConViT-S vs. DeiT-S):
- 6: 7 vs. 8 (relative improvement 9)
- 0: 1 vs. 2 (3)
- 4: 5 vs. 6 (7)
Ablation studies demonstrate:
- Removing gating results in a 8 drop on full data (9 on 0 subset).
- Removing convolutional initialization results in a 1/2 drop.
- Removing both yields a 3/4 drop, indicating the compounded benefit of both components.
7. Locality Metrics and Hyperparameter Implications
To quantify locality, a nonlocality metric is defined per layer: 5
- In vanilla SA (DeiT), 6 drops rapidly during early epochs (implying heads become more local) then increases in deeper layers as global structure forms.
- In ConViT, the initial GPSA layers are maximally local due to initialization; 7 grows with training but remains below that of DeiT overall. Lower GPSA layers typically remain more local, while higher layers increasingly attend globally.
This analysis suggests that locality is particularly advantageous in shallow layers and that the hyperparameters 8 (locality strength) and 9 (number of GPSA layers) offer direct trade-offs between sample efficiency and eventual accuracy ceiling.
8. Significance and Broader Implications
GPSA provides a soft, interpretable, and learnable convolutional bias embedded directly within the Transformer paradigm. The mechanism enables Vision Transformers to leverage both the sample efficiency of convolutional priors and the high capacity for global interactions afforded by self-attention. This yields networks with improved learning efficiency and competitive or superior accuracy compared to previous architectures, as evidenced by empirical results. The gating design offers insights into the transition between local and global information integration, supporting further research into adaptive architectural priors in deep models (d'Ascoli et al., 2021).