Multi-Head Locality-Self-Attention
- Multi-Head Locality-Self-Attention is a self-attention mechanism that restricts computations to local neighborhoods using multi-head projections to model local dependencies.
- It employs techniques such as local window partitioning and relative positional encoding to efficiently balance spatial locality with the expressive power of standard transformers.
- Innovations like Hadamard attention and Ghost-Head modules enhance model performance with minimal computation increase, as demonstrated by improved empirical benchmarks.
Multi-head Locality-Self-Attention (LSA) is a family of self-attention mechanisms that incorporate both spatial locality constraints and multi-head representations, primarily designed to improve the modeling of local dependencies in neural network architectures such as Transformers. LSA modifications exploit local windowing, structured partitioning of the attention field, and explicit head-wise or spatial mechanisms to address the limitations of global self-attention, especially for vision and sequence modeling tasks where local patterns are highly informative.
1. Foundations of Multi-Head Locality-Self-Attention
Standard multi-head self-attention (MHSA), as originally described in the Transformer architecture, allows each input position to attend to all others, combining information via multiple representation subspaces (“heads”). Given an input tensor , queries, keys, and values are projected for each head, and attention is computed globally.
LSA restricts the attention operation to a local neighborhood, typically by confining each query’s attention to a window of spatially or sequentially adjacent positions. This can be formalized as masking out non-local key-value pairs or by directly computing windowed attention. Multi-head decomposition is inherited from MHSA: for each head , the projections and local attention are computed in subspaces of the embedding dimension.
A representative instantiation is as follows (Zhou et al., 2021):
- Local window partitioning: Input feature map is partitioned into non-overlapping windows of size . For each head,
where , .
- Attention computation within window: For token , in its window ,
0
1
The outputs of all heads are concatenated and an output projection is performed.
2. Locality Mechanisms and Variants
Locality can be introduced along spatial (token) or channel axes and can be implemented through various forms of partitioning and masking.
- Spatial locality: Windows are organized around each query position. In “Convolutional Self-Attention Networks” (Csans), this is formalized as masking attention to a window of 2 positions (Yang et al., 2019). The local scope is parameter-free—no new weights are introduced.
- Inter-head locality: Csans introduces 2D locality by allowing heads to interact across both spatial and head dimensions. For each 3 and position 4, a patch of keys/values over heads 5 and positions 6 is extracted; dynamic attention is then computed over this 7 patch.
- Channel-wise locality: In Local Multi-Head Channel Self-Attention (LHC), attention is applied along the channel dimension. The input is divided into 8 contiguous spatial blocks (“heads”); each head attends only to the channel–channel relations in its spatial block. Query, key, value are constructed via pooling and shared projections for each block (Pecoraro et al., 2021).
| Variant/Formulation | Locality Axis | Partitioning | Notable Properties |
|---|---|---|---|
| Csans (1D, 2D) (Yang et al., 2019) | Spatial, Head | Window, Head patches | Parameter-free, efficient, 2D convolution over heads/tokens |
| LHC (Pecoraro et al., 2021) | Channel | Spatial blocks (n heads) | Channel-wise, block-local, efficient in FLOPs/params |
3. Relative Position and Local Bias
Relative positional encoding is integral to many LSA formulations since spatial relationships are not globally available within local windows.
- In local attention frameworks (e.g., Swin Transformer, Csans, ELSA), attention score computation incorporates a learnable relative bias term, 9, typically indexed by the displacement 0 within the window. Relative biases are tabulated and injected to make the spatially-constrained attention sensitive to local structure (Zhou et al., 2021).
- In dynamic channel local attention (LHC), a learned gating or scaling function modulates the attention score per channel following computation of the channel–channel dot-product (Pecoraro et al., 2021).
4. Advancements: Hadamard Attention and Ghost-Head Module
ELSA (Enhanced Local Self-Attention) proposes two key augmentations to standard LSA (Zhou et al., 2021):
- Hadamard (elementwise) attention: The expensive dot-product attention 1 is replaced with efficient elementwise Hadamard products, introducing third-order mappings while remaining suitable for sliding neighborhood operation. The per-location Hadamard attention is computed using:
2
where 3 are learnable convolutional kernels and 4 is a bias term.
- Ghost-Head module: Instead of increasing the explicit head count to boost channel capacity, “ghost heads” expand output channels by linearly combining each learned attention map with two static matrices, followed by a scaling and summing operation:
5
where 6 are learned parameters. This process enhances the output diversity without substantial computational cost.
Both innovations produce favorable complexity: 7, minimal added parameters, and superior empirical performance compared to vanilla LSA.
5. Computational Complexity and Efficiency
Different LSA variants offer favorable computational and parametric tradeoffs relative to global MHSA:
- Csans (1D or 2D): Parameter-free relative to baseline Transformer (except for window masking), 8 flops when local window size 9 (Yang et al., 2019).
- Local windowed LSA: 0 flops and parameters per layer. The additional cost from mechanisms such as relative bias or ghost-head is marginal (Zhou et al., 2021).
- LHC: Scales as 1 in spatial size and 2 in channel dimension—linear in input size, circumventing the 3 scaling of global self-attention (Pecoraro et al., 2021).
6. Empirical Performance and Ablation Studies
- Csans improvement: On WMT’14 En4De, injecting a local window of size 5 (6) in the lowest 3 layers of Transformer-Base improved BLEU by 70.55; using a 2D patch over 8 (heads9tokens) yielded 00.87 BLEU. These gains add neither new parameters nor significant computational burden (Yang et al., 2019).
- ELSA: Replacing Swin-T’s local attention with ELSA (Hadamard + Ghost-Head) increases ImageNet-1K top-1 accuracy from 81.3% to 82.7% (11.4%), and on COCO detection, yields 21.9 box AP vs. LSA baseline. The method is implemented as a drop-in replacement, incurring only 30.2G extra FLOPs and ~0.8M parameters over 28M (Zhou et al., 2021).
- LHC: In facial expression recognition (FER2013), integration of LHC modules in ResNet34v2 improves baseline accuracy from 72.81% to 73.39% (no TTA), and up to 74.42% with test-time augmentation. Ablations indicate benefit with increasing number of heads, and show reduced inter-head redundancy compared to conventional transformer blocks (Pecoraro et al., 2021).
7. Comparative Analysis and Integration in Architectures
LSA can be contrasted with alternative local modeling strategies:
- Convolutions: Static, translation-invariant, parameter-efficient, but limited contextual adaptation.
- Global self-attention: Rich context, but quadratic cost in sequence/image size.
- Dynamic or sliding-filter approaches: More expressive local modeling but may introduce significant memory/computation overhead.
Integration of LSA in modern architectures is typically performed at earlier or intermediate layers, leveraging local inductive bias for low-level feature extraction while allowing upper layers to apply global context aggregation. In both convolutional self-attention (Yang et al., 2019) and channel-local attention (Pecoraro et al., 2021), such designs are shown to be directly compatible with standard backbones (e.g., ResNet, hierarchical transformers) and incur minor parameter or computation increases while providing measurable accuracy improvements.
A plausible implication is that multi-head locality-aware attention mechanisms provide an effective avenue for bridging the strengths of convolutional networks and global transformers, combining computational efficiency with enhanced capacity for modeling critical local structures.