Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Head Locality-Self-Attention

Updated 17 April 2026
  • Multi-Head Locality-Self-Attention is a self-attention mechanism that restricts computations to local neighborhoods using multi-head projections to model local dependencies.
  • It employs techniques such as local window partitioning and relative positional encoding to efficiently balance spatial locality with the expressive power of standard transformers.
  • Innovations like Hadamard attention and Ghost-Head modules enhance model performance with minimal computation increase, as demonstrated by improved empirical benchmarks.

Multi-head Locality-Self-Attention (LSA) is a family of self-attention mechanisms that incorporate both spatial locality constraints and multi-head representations, primarily designed to improve the modeling of local dependencies in neural network architectures such as Transformers. LSA modifications exploit local windowing, structured partitioning of the attention field, and explicit head-wise or spatial mechanisms to address the limitations of global self-attention, especially for vision and sequence modeling tasks where local patterns are highly informative.

1. Foundations of Multi-Head Locality-Self-Attention

Standard multi-head self-attention (MHSA), as originally described in the Transformer architecture, allows each input position to attend to all others, combining information via multiple representation subspaces (“heads”). Given an input tensor XRN×C\mathbf{X}\in\mathbb{R}^{N\times C}, queries, keys, and values are projected for each head, and attention is computed globally.

LSA restricts the attention operation to a local neighborhood, typically by confining each query’s attention to a window of spatially or sequentially adjacent positions. This can be formalized as masking out non-local key-value pairs or by directly computing windowed attention. Multi-head decomposition is inherited from MHSA: for each head hh, the projections and local attention are computed in subspaces of the embedding dimension.

A representative instantiation is as follows (Zhou et al., 2021):

  • Local window partitioning: Input feature map XRH×W×CX\in\mathbb{R}^{H\times W\times C} is partitioned into non-overlapping windows of size k×kk\times k. For each head,

Qh=XWqh,Kh=XWkh,Vh=XWvh,Qh,Kh,VhRN×dQ^h = X W_q^h, \quad K^h = X W_k^h, \quad V^h = X W_v^h, \quad Q^h,K^h,V^h \in \mathbb{R}^{N \times d}

where N=HWN=H \cdot W, d=C/Hd=C/H.

  • Attention computation within window: For token ii, in its window Ωi\Omega_i,

scoreijh=(Qih)(Kjh)+bjirel\text{score}_{ij}^h = (Q_i^h) (K_j^h)^\top + b_{j-i}^{\text{rel}}

hh0

hh1

The outputs of all heads are concatenated and an output projection is performed.

2. Locality Mechanisms and Variants

Locality can be introduced along spatial (token) or channel axes and can be implemented through various forms of partitioning and masking.

  • Spatial locality: Windows are organized around each query position. In “Convolutional Self-Attention Networks” (Csans), this is formalized as masking attention to a window of hh2 positions (Yang et al., 2019). The local scope is parameter-free—no new weights are introduced.
  • Inter-head locality: Csans introduces 2D locality by allowing heads to interact across both spatial and head dimensions. For each hh3 and position hh4, a patch of keys/values over heads hh5 and positions hh6 is extracted; dynamic attention is then computed over this hh7 patch.
  • Channel-wise locality: In Local Multi-Head Channel Self-Attention (LHC), attention is applied along the channel dimension. The input is divided into hh8 contiguous spatial blocks (“heads”); each head attends only to the channel–channel relations in its spatial block. Query, key, value are constructed via pooling and shared projections for each block (Pecoraro et al., 2021).
Variant/Formulation Locality Axis Partitioning Notable Properties
Csans (1D, 2D) (Yang et al., 2019) Spatial, Head Window, Head patches Parameter-free, efficient, 2D convolution over heads/tokens
LHC (Pecoraro et al., 2021) Channel Spatial blocks (n heads) Channel-wise, block-local, efficient in FLOPs/params

3. Relative Position and Local Bias

Relative positional encoding is integral to many LSA formulations since spatial relationships are not globally available within local windows.

  • In local attention frameworks (e.g., Swin Transformer, Csans, ELSA), attention score computation incorporates a learnable relative bias term, hh9, typically indexed by the displacement XRH×W×CX\in\mathbb{R}^{H\times W\times C}0 within the window. Relative biases are tabulated and injected to make the spatially-constrained attention sensitive to local structure (Zhou et al., 2021).
  • In dynamic channel local attention (LHC), a learned gating or scaling function modulates the attention score per channel following computation of the channel–channel dot-product (Pecoraro et al., 2021).

4. Advancements: Hadamard Attention and Ghost-Head Module

ELSA (Enhanced Local Self-Attention) proposes two key augmentations to standard LSA (Zhou et al., 2021):

  • Hadamard (elementwise) attention: The expensive dot-product attention XRH×W×CX\in\mathbb{R}^{H\times W\times C}1 is replaced with efficient elementwise Hadamard products, introducing third-order mappings while remaining suitable for sliding neighborhood operation. The per-location Hadamard attention is computed using:

XRH×W×CX\in\mathbb{R}^{H\times W\times C}2

where XRH×W×CX\in\mathbb{R}^{H\times W\times C}3 are learnable convolutional kernels and XRH×W×CX\in\mathbb{R}^{H\times W\times C}4 is a bias term.

  • Ghost-Head module: Instead of increasing the explicit head count to boost channel capacity, “ghost heads” expand output channels by linearly combining each learned attention map with two static matrices, followed by a scaling and summing operation:

XRH×W×CX\in\mathbb{R}^{H\times W\times C}5

where XRH×W×CX\in\mathbb{R}^{H\times W\times C}6 are learned parameters. This process enhances the output diversity without substantial computational cost.

Both innovations produce favorable complexity: XRH×W×CX\in\mathbb{R}^{H\times W\times C}7, minimal added parameters, and superior empirical performance compared to vanilla LSA.

5. Computational Complexity and Efficiency

Different LSA variants offer favorable computational and parametric tradeoffs relative to global MHSA:

  • Csans (1D or 2D): Parameter-free relative to baseline Transformer (except for window masking), XRH×W×CX\in\mathbb{R}^{H\times W\times C}8 flops when local window size XRH×W×CX\in\mathbb{R}^{H\times W\times C}9 (Yang et al., 2019).
  • Local windowed LSA: k×kk\times k0 flops and parameters per layer. The additional cost from mechanisms such as relative bias or ghost-head is marginal (Zhou et al., 2021).
  • LHC: Scales as k×kk\times k1 in spatial size and k×kk\times k2 in channel dimension—linear in input size, circumventing the k×kk\times k3 scaling of global self-attention (Pecoraro et al., 2021).

6. Empirical Performance and Ablation Studies

  • Csans improvement: On WMT’14 Enk×kk\times k4De, injecting a local window of size k×kk\times k5 (k×kk\times k6) in the lowest 3 layers of Transformer-Base improved BLEU by k×kk\times k70.55; using a 2D patch over k×kk\times k8 (headsk×kk\times k9tokens) yielded Qh=XWqh,Kh=XWkh,Vh=XWvh,Qh,Kh,VhRN×dQ^h = X W_q^h, \quad K^h = X W_k^h, \quad V^h = X W_v^h, \quad Q^h,K^h,V^h \in \mathbb{R}^{N \times d}00.87 BLEU. These gains add neither new parameters nor significant computational burden (Yang et al., 2019).
  • ELSA: Replacing Swin-T’s local attention with ELSA (Hadamard + Ghost-Head) increases ImageNet-1K top-1 accuracy from 81.3% to 82.7% (Qh=XWqh,Kh=XWkh,Vh=XWvh,Qh,Kh,VhRN×dQ^h = X W_q^h, \quad K^h = X W_k^h, \quad V^h = X W_v^h, \quad Q^h,K^h,V^h \in \mathbb{R}^{N \times d}11.4%), and on COCO detection, yields Qh=XWqh,Kh=XWkh,Vh=XWvh,Qh,Kh,VhRN×dQ^h = X W_q^h, \quad K^h = X W_k^h, \quad V^h = X W_v^h, \quad Q^h,K^h,V^h \in \mathbb{R}^{N \times d}21.9 box AP vs. LSA baseline. The method is implemented as a drop-in replacement, incurring only Qh=XWqh,Kh=XWkh,Vh=XWvh,Qh,Kh,VhRN×dQ^h = X W_q^h, \quad K^h = X W_k^h, \quad V^h = X W_v^h, \quad Q^h,K^h,V^h \in \mathbb{R}^{N \times d}30.2G extra FLOPs and ~0.8M parameters over 28M (Zhou et al., 2021).
  • LHC: In facial expression recognition (FER2013), integration of LHC modules in ResNet34v2 improves baseline accuracy from 72.81% to 73.39% (no TTA), and up to 74.42% with test-time augmentation. Ablations indicate benefit with increasing number of heads, and show reduced inter-head redundancy compared to conventional transformer blocks (Pecoraro et al., 2021).

7. Comparative Analysis and Integration in Architectures

LSA can be contrasted with alternative local modeling strategies:

  • Convolutions: Static, translation-invariant, parameter-efficient, but limited contextual adaptation.
  • Global self-attention: Rich context, but quadratic cost in sequence/image size.
  • Dynamic or sliding-filter approaches: More expressive local modeling but may introduce significant memory/computation overhead.

Integration of LSA in modern architectures is typically performed at earlier or intermediate layers, leveraging local inductive bias for low-level feature extraction while allowing upper layers to apply global context aggregation. In both convolutional self-attention (Yang et al., 2019) and channel-local attention (Pecoraro et al., 2021), such designs are shown to be directly compatible with standard backbones (e.g., ResNet, hierarchical transformers) and incur minor parameter or computation increases while providing measurable accuracy improvements.

A plausible implication is that multi-head locality-aware attention mechanisms provide an effective avenue for bridging the strengths of convolutional networks and global transformers, combining computational efficiency with enhanced capacity for modeling critical local structures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Head Locality-Self-Attention (LSA).