Locality Self-Attention (LSA)

Updated 15 March 2026

Locality Self-Attention is a method that biases self-attention to local regions using fixed or adaptive masking, Gaussian kernels, or convolutional mechanisms.
It reduces computational complexity and improves generalization by restricting interactions to nearby tokens, easing data and resource requirements.
Empirical results show significant gains such as improved BLEU scores and mIoU, underscoring its impact across language, vision, and speech tasks.

Locality Self-Attention (LSA) refers to a family of architectural modifications and masking strategies for self-attention networks that explicitly incorporate an inductive bias favoring local dependencies. Unlike vanilla multi-head self-attention, which permits each query position to interact with any key across the entire sequence or data grid, LSA restricts or biases attention to a local neighborhood—either via hard masking, learned Gaussian/frequency kernels, or hybrid convolutional-attentional mechanisms. This paradigm has been instantiated across natural language processing, speech enhancement, and vision domains; it encompasses fixed-window, data-adaptive, and convolution-augmented approaches, each with distinct mathematical formulations and empirical trade-offs.

1. Formal Definitions and Mechanisms

Hard Window Masking

Canonical LSA, as introduced in direct sequence modeling and machine translation (Fonollosa et al., 2019), constrains the receptive field of each attention head by imposing a binary mask $M \in \mathbb{R}^{n \times n}$ : $M_{ij} = \begin{cases} 0 & \text{if } |i-j| \leq w \ -\infty & \text{otherwise} \end{cases}$ The attention score matrix is modified prior to softmax: $A = \text{softmax}\left(\frac{QK^\top + M}{\sqrt{d_k}}\right)$ where $w$ —the window size—may be constant, layer-dependent, or dynamically selected per-position.

Learnable Gaussian Locality Biases

An alternative adaptive approach inserts data-dependent, learnable Gaussian kernels in logit space. In this case

$G_{ij} = -\frac{(j - P_i)^2}{2\sigma_i^2}$

where $P_i$ (center) and $\sigma_i$ (variance) are query-dependent and predicted via simple MLPs or linear heads over the queries (Yang et al., 2018, Hajimiri et al., 5 Mar 2026). The augmented logits are: $\text{Attention}^{LSA}_{ij} = \text{softmax}_j \left( \frac{Q_i K_j}{\sqrt{d}} + G_{ij} \right)$ In applications such as vision transformers, position is extended to two-dimensional spatial coordinates and $\sigma_i$ may be a vector controlling anisotropy (Hajimiri et al., 5 Mar 2026).

Local Spectral and Axial Attention

In spectral and grid-structured data (speech, images), LSA is applied along frequency or spatial dimensions using masks or Gaussian/local kernels, e.g. limiting attention to a fixed band on the frequency axis (Hou et al., 2023), or over non-overlapping/spatially shifted windows in vision transformers (VTs) (Zhou et al., 2021). In some cases, LSA is formulated as a convolutional operation with a large kernel—a strict translation-equivariant variant of attention (Li et al., 2024).

Convolutional-Augmented and Dynamic Local Attention

LESA (Locally Enhanced Self-Attention) augments the “unary” term of self-attention (i.e. a token’s attention to itself) with a $k\times k$ grouped convolution, combined with a “context” (global) term via an input-dependent gating mechanism (Yang et al., 2021). LSA can further be hybridized with dynamic filters, Hadamard products, or ghost heads to expand its channel capacity and local modeling expressivity (Zhou et al., 2021).

2. Data-Driven and Heuristic Scheduling

Layer-wise and head-wise window scheduling is common. Empirical analysis often demonstrates that early layers require wider attention receptive fields, while higher layers can tolerate or even benefit from tighter locality. For example, Fonollosa et al. employed a schedule growing from $M_{ij} = \begin{cases} 0 & \text{if } |i-j| \leq w \ -\infty & \text{otherwise} \end{cases}$ 0 to $M_{ij} = \begin{cases} 0 & \text{if } |i-j| \leq w \ -\infty & \text{otherwise} \end{cases}$ 1 for shallow to deep layers (Fonollosa et al., 2019). In speech translation, data-driven “contribution analysis” is used to select the minimum window width $M_{ij} = \begin{cases} 0 & \text{if } |i-j| \leq w \ -\infty & \text{otherwise} \end{cases}$ 2 per layer such that off-diagonal contributions are negligible, yielding significant FLOPs and memory reduction without accuracy loss (Alastruey et al., 2022).

Multi-head variants allow head-specific locality granularity: heads may be allocated with different fixed or learned window sizes, or attention masks (Yang et al., 2018, Pande et al., 2020). Some approaches share projection parameters across local heads to reduce parameter count (Pande et al., 2020).

3. Empirical Impact and Benchmark Results

LSA techniques consistently demonstrate:

Improved sample efficiency and generalization under limited data, due to regularization via locality bias (Fonollosa et al., 2019, Lee et al., 2021).
Computational and memory savings, especially for long sequences or high-dimensional inputs; e.g., replacing $M_{ij} = \begin{cases} 0 & \text{if } |i-j| \leq w \ -\infty & \text{otherwise} \end{cases}$ 3 complexity with $M_{ij} = \begin{cases} 0 & \text{if } |i-j| \leq w \ -\infty & \text{otherwise} \end{cases}$ 4 (Alastruey et al., 2022, Zhou et al., 2021).
Enhanced local feature modeling in vision, leading to segmentation and dense prediction gains over global-attention-only baselines (Hajimiri et al., 5 Mar 2026, Li et al., 2024, Zhou et al., 2021).

Representative findings include: | Model/Task | Baseline | + LSA | Improvement | Reference | |---------------------|------------------|-------------|-----------------|------------------| | ViT Tiny, ADE20K | 17.30 mIoU | 23.47 mIoU | +6.17 | (Hajimiri et al., 5 Mar 2026) | | WMT14 En $M_{ij} = \begin{cases} 0 & \text{if } |i-j| \leq w \ -\infty & \text{otherwise} \end{cases}$ 5De | 30.01 BLEU | 30.83 BLEU | +0.82 | (Pande et al., 2020) | | MTFAA, PESQ | 3.13 | 3.16 | +0.03 | (Hou et al., 2023) |

In addition to segmentation, LSA raises BLEU by 0.4–0.6 points on small-scale MT benchmarks and typically yields 1–4% accuracy improvement in small-data Vision Transformer setups, especially when coupled with techniques like shifted patch tokenization (Lee et al., 2021, Hajimiri et al., 5 Mar 2026).

4. Comparative Analysis and Theoretical Considerations

Extensive analytical studies have demonstrated that even unconstrained global self-attention implicitly develops a strong locality bias: output representations are far more sensitive to nearby tokens than distant or even syntactic ones, across both language modeling and classification tasks (Pande et al., 2020). Enforcing locality via masking or kernel bias not only preserves performance on GLUE and MT tasks but also allows parameter tying and head sharing, further compressing the model (Pande et al., 2020, Alastruey et al., 2022).

However, “vanilla” LSA (e.g., static windowed or Swin-style non-overlapping LSA) is often outperformed by dynamic filters, neighbor/sliding window mechanisms, and attention modules with relative positional encoding (Zhou et al., 2021). Enhanced variants such as ELSA integrate Hadamard attention and “ghost head” channel expansion to bridge these gaps.

5. Applications Across Modalities

Natural Language Processing: LSA is applied in both machine translation (Fonollosa et al., 2019, Yang et al., 2018, Pande et al., 2020) and speech translation (Alastruey et al., 2022), predominantly to improve efficiency and regularization.
Computer Vision: LSA, windowed or Gaussian-augmented, is a building block in modern vision transformers for small and medium datasets, segmentation, and object detection (Lee et al., 2021, Zhou et al., 2021, Hajimiri et al., 5 Mar 2026), sometimes providing critical improvements in spatial precision without sacrificing global context.
Speech Enhancement: Local spectral attention restricts frequency-domain interactions, suppressing noise leakage and improving perceptual metrics at lower computational cost (Hou et al., 2023).
Image Captioning: LSA modules on spatial grids allow local feature fusion and recovery of fine object details otherwise fragmented by global attention (Ma et al., 2023).

6. Implementation Notes and Limitations

Masking and kernel selection: LSA is typically implemented by addition of binary or continuous masks to raw attention logits prior to softmax. Gaussian kernels require explicit computation of pairwise coordinate differences and per-query parameter prediction (Yang et al., 2018, Hajimiri et al., 5 Mar 2026).
Parameter Overhead: Most LSA variants introduce minimal parameters beyond standard self-attention, with learned Gaussian kernel approaches requiring only lightweight projections per head (Hajimiri et al., 5 Mar 2026).
Limitations: Rigid locality can obstruct the capture of long-range dependencies if not progressively widened in deeper layers (Fonollosa et al., 2019, Alastruey et al., 2022). Overly narrow or fixed kernel windows lead to performance degradation, as shown in ablation studies. Channel capacity and filter application details substantially affect results in vision; relative positional embeddings and sliding-window operations are pivotal for maximizing fine-grained spatial learning (Zhou et al., 2021).

7. Theoretical and Practical Significance

LSA mechanisms instantiate strong spatial, temporal, or spectral priors into self-attention networks, mitigating their data hungriness and improving robustness in data- and resource-constrained regimes. They provide an explicit trade-off between local inductive bias, parameter/computation savings, and global context modeling. Empirical results indicate that carefully selected and scheduled locality constraints can achieve or surpass the performance of unconstrained attention on both language and vision tasks, and undergird current best practices for ViTs, speech enhancement models, and sequence-to-sequence translation architectures (Fonollosa et al., 2019, Alastruey et al., 2022, Hajimiri et al., 5 Mar 2026).