Patchwise Self-Attention
- Patchwise self-attention is a neural mechanism that computes attention over localized image patches, enabling content-adaptive weighting for enhanced feature extraction.
- It generalizes standard convolution by replacing fixed kernels with dynamic, per-channel weight vectors, resulting in improved accuracy and efficiency as shown in benchmark comparisons.
- Its modular design integrates seamlessly into diverse architectures, offering increased robustness against geometric variations and adversarial attacks in visual recognition tasks.
Patchwise self-attention is a class of neural attention mechanisms designed for image recognition and related tasks, wherein attention is computed over local spatial patches, allowing the network to model adaptive, content-dependent weighting patterns within regions of the input feature map. As developed in multiple works, notably in the patchwise Self-Attention Network (SAN) module, patchwise attention generalizes local convolution by replacing translation-invariant weighting with content-adaptive, per-channel vector attention. This mechanism substantially increases the expressive power and flexibility of local aggregation in visual models, as evidenced by empirical and theoretical analyses (Zhao et al., 2020, Barkan, 2019).
1. Mathematical Definition and Mechanism
Patchwise self-attention operates on a feature map , where , , and are spatial height, width, and channel dimensions respectively. For each spatial location , a local patch is selected (e.g., a window), yielding . The new feature at position is computed as
where:
- is a linear projection reducing dimensionality (bottleneck factor ).
- comprises attention weight vectors (one per patch location ).
- is parameterized as , where:
- aggregates the patch into a -dimensional vector.
- unfolds to yield attention weights.
Three forms for are employed:
- Star-product:
- Clique-product:
- Concatenation:
Here, are linear projections, with a bottleneck factor.
After attention, a final Linear layer restores the channel dimension, followed by batch normalization, ReLU activation, and a residual skip connection to the block input (Zhao et al., 2020).
2. The Expressive Power of Patchwise Self-Attention versus Convolution
Whereas standard convolution aggregates features using fixed, translation-invariant kernels with weights applied solely as a function of relative spatial offset, patchwise self-attention determines aggregation weights as an adaptive function of the local content . The result is:
- Local weighting adapts to structure such as edges, textures, or object parts.
- Attention weights are per-channel vectors (length ), not scalars shared across all channels.
- No translation invariance: the same offset can have different weights in different contexts.
Any given convolution can be exactly reproduced by patchwise attention by using fixed . However, the converse is not true; patchwise attention can realize operators that are content-conditional, e.g., modulating or zeroing out specific channel groups based on patch context, which is beyond the scope of standard convolutional operations (Zhao et al., 2020).
3. Network Architecture and Integration Strategies
A standard patchwise SAN block constitutes two primary computational streams:
- Attention-weight computation: convolutions (bottlenecked by by default) to compute , , followed by aggregation (with selectable relation function) and mapping (two LinearReLULinear layers, ).
- Value projection: A convolution for , reducing channels by .
Fusion is executed via a Hadamard product of the output of and , followed by batch normalization, ReLU, channel expansion, and residual addition.
The patchwise SAN architecture, exemplified by San15, comprises five stages (stride sequence ; channel widths ). Patch size is for the initial stage and afterwards. No explicit multihead splitting is applied; each attention vector is shared over 8-channel groups (Zhao et al., 2020). In the Self Attentive Convolution (SAC) formulation, the approach generalizes to arbitrary kernels with overlapping patches, sliding across the image without explicit partitioning, and supports both single- and multi-head extensions (Barkan, 2019).
4. Implementation Hyper-parameters and Complexity Analysis
The principal hyper-parameters for patchwise SAN include training for 100 epochs with batch size 256 (across 8 GPUs), SGD optimizer with momentum $0.9$, weight decay , cosine-decayed learning rate (base $0.1$), label smoothing $0.1$, and standard data augmentations (resized crop, random horizontal flip, channel-wise normalization). By default, bottleneck factors , control dimension reduction. Eight channels share the same attention weight vector (Zhao et al., 2020).
Computational complexity for patchwise operators is dominated by convolutional projection and score computation. For input spatial size and channels :
- for projections (where ),
- for attention-weight computation and weighted sum,
- Memory . Stride or dilation can reduce the number of attended patch positions (Barkan, 2019).
5. Benchmark Results and Empirical Insights
Direct comparison of patchwise SANs with convolutional ResNets on ImageNet single-crop validation indicates:
| Method | Top-1 (%) | Top-5 (%) | Params (M) | FLOPs (G) |
|---|---|---|---|---|
| ResNet26 | 73.6 | 91.7 | 13.7 | 2.4 |
| SAN10 (patchwise) | 77.1 | 93.5 | 11.8 | 1.9 |
| ResNet38 | 76.0 | 93.0 | 19.6 | 3.2 |
| SAN15 (patchwise) | 78.0 | 93.9 | 16.2 | 2.6 |
| ResNet50 | 76.9 | 93.5 | 25.6 | 4.1 |
| SAN19 (patchwise) | 78.2 | 93.9 | 20.5 | 3.3 |
Patchwise SAN models achieve 1–2% higher top-1 accuracy while requiring 20–40% fewer parameters and FLOPs relative to comparable convnets.
Patchwise SANs exhibit improved robustness: under 180° input rotation, SAN15 accuracy drops from 78.0% to 56.0% (−22.0 pp) versus ResNet38's drop from 76.0% to 52.2% (−23.8 pp); under a PGD adversarial attack (, 4 steps), ResNet50 top-1 falls to 11.8% (success rate 82.5%), SAN19 to 24.8% (success rate 62.0%). These results suggest patchwise self-attention confers additional robustness to geometric and white-box adversarial attacks (Zhao et al., 2020).
6. Ablations, Parameterizations, and Limitations
Multiple ablations reveal:
- Among relation functions for , concatenation yields the highest validation accuracy (79.3% top-1) compared to star (78.7%) or clique (79.1%).
- Two layers for (LinearReLULinear) provide optimal depth.
- Distinct , , parameters perform better than tied versions.
- Increasing patch (footprint) size from to improves performance then saturates, with limited added FLOPs.
A limitation is the additional implementation complexity and memory overhead incurred by large, fully-connected local attention maps, though modest compared to global pixelwise attention. The module does not employ explicit multihead splitting, relying instead on group sharing of attention weights (Zhao et al., 2020).
7. Connections and Extensions in Related Literature
Patchwise attention subsumes classical convolution as a strict generalization, fully capturing stationary kernels as a limiting case while supporting content-conditioned adaptation. Self Attentive Convolutions (SAC) extend the paradigm further, showing that standard global self-attention is a 1×1 convolution, and that patchwise attention is equivalent to generalizing to localities. Multiscale SAC (MSAC) computes parallel patchwise attentions over varying scales, laterally concatenating their outputs and fusing via convolutions. This approach enables simultaneous modeling of local and non-local dependencies without explicit patch partitioning and can be integrated within ResNet- or DenseNet-style backbones. Preliminary experiments, though unpublished in detail, demonstrate that replacing convolutional layers with SAC/MSAC consistently improves classification and segmentation with comparable parameter budgets (Barkan, 2019).