Locally Shifted Attention Mechanisms

Updated 27 April 2026

Locally shifted attention is defined as mechanisms that constrain queries to local spatial or temporal neighborhoods with learnable shifts.
This approach underpins efficient transformer architectures in tasks like speech recognition and vision, reducing complexity while maintaining cross-window context.
Empirical results show notable gains in accuracy and memory efficiency, with mechanisms like shifted-window and group shifted attention driving performance.

Locally shifted attention refers to a broad family of neural attention mechanisms that constrain each query’s context to a spatially or temporally local neighborhood, with a learnable or adaptive “shift” or bias of the receptive field. This design aims to mitigate the inefficiencies and inductive limitations of global attention by emphasizing locality while preserving a receptive field that can shift, grow, or propagate over layers. Locally shifted attention is a core ingredient in high-efficiency transformer architectures, sequence-to-sequence models for monotonic tasks, and scalable vision networks, manifesting in variants such as monotonic attention, shifted window attention, group shifted attention, and virtual-patch shifted attention. Common goals are efficient computation, improved inductive bias for locality or monotonicity, and enhanced feature continuity across partition boundaries.

1. Monotonic Local Attention for Sequence Models

The first formalization of locally shifted attention appeared in the context of monotonic, sequence-to-sequence models such as end-to-end speech recognition and grapheme-to-phoneme conversion (Tjandra et al., 2017). In global attention, each output step $t$ computes alignment over all input states $h^e_s$ :

$c_t = \sum_{s=1}^S \alpha_{t,s} h^e_s,\quad \alpha_{t,s} = \frac{\exp(\text{Score}(h^e_s, h^d_t))}{\sum_{s'} \exp(\text{Score}(h^e_{s'}, h^d_t))}$

This incurs $O(S \cdot T)$ complexity and permits jumps, unsuitable for monotonic mappings.

Local monotonic attention instead:

Predicts a center $p_t \geq p_{t-1}$ , with $\Delta p_t$ computed via an MLP, enforcing strict monotonicity.
Attends only within a window of size $2\sigma$ around $p_t$ , reducing context to $O(\sigma \cdot T)$ .
Computes a Gaussian prior $a_t^{\mathcal N}(s)$ centered at $h^e_s$ 0, combined multiplicatively with a local scorer $h^e_s$ 1 over the window.

Key formulas: $h^e_s$ 2

$h^e_s$ 3

$h^e_s$ 4

This yields sharper, step-wise alignments well-suited for speech and mapping tasks, achieving both a 12% relative error reduction in TIMIT PER and reduced decoding complexity (Tjandra et al., 2017).

2. Shifted-Window Attention in Vision Transformers

Locally shifted attention achieved widespread impact through shifted window architectures in vision transformers, notably Swin Transformer and its descendants. The key technique partitions the feature map into non-overlapping $h^e_s$ 5 windows, then alternates layers with regular and half-window cyclically shifted partitions (Li et al., 2023, Boulaabi et al., 20 Apr 2025, Gu et al., 29 Jul 2025, Khadka et al., 10 Sep 2025, Cai et al., 2024). This strategy achieves:

Intra-window locality: Within each window, self-attention is computed only among $h^e_s$ 6 spatial tokens, drastically reducing per-block cost from $h^e_s$ 7 to $h^e_s$ 8.
Window shifting: Every second layer, the window grid is shifted by $h^e_s$ 9 so that window boundaries move. Tokens at window boundaries in one layer are included in the center of a window in the next, enabling cross-window information flow.
Hierarchical stacking: Multiple layers (or stages) alternate regular and shifted windows, progressively expanding the effective receptive field.

The attention within each window is computed as: $c_t = \sum_{s=1}^S \alpha_{t,s} h^e_s,\quad \alpha_{t,s} = \frac{\exp(\text{Score}(h^e_s, h^d_t))}{\sum_{s'} \exp(\text{Score}(h^e_{s'}, h^d_t))}$ 0 where $c_t = \sum_{s=1}^S \alpha_{t,s} h^e_s,\quad \alpha_{t,s} = \frac{\exp(\text{Score}(h^e_s, h^d_t))}{\sum_{s'} \exp(\text{Score}(h^e_{s'}, h^d_t))}$ 1 is a learnable relative position bias per window.

Complexity comparison:

Attention Type	Complexity per Layer	Long-Range Propagation
Global (ViT)	$c_t = \sum_{s=1}^S \alpha_{t,s} h^e_s,\quad \alpha_{t,s} = \frac{\exp(\text{Score}(h^e_s, h^d_t))}{\sum_{s'} \exp(\text{Score}(h^e_{s'}, h^d_t))}$ 2	Single pass
Windowed (no shift)	$c_t = \sum_{s=1}^S \alpha_{t,s} h^e_s,\quad \alpha_{t,s} = \frac{\exp(\text{Score}(h^e_s, h^d_t))}{\sum_{s'} \exp(\text{Score}(h^e_{s'}, h^d_t))}$ 3	None across windows
Shifted-window (locally shifted)	$c_t = \sum_{s=1}^S \alpha_{t,s} h^e_s,\quad \alpha_{t,s} = \frac{\exp(\text{Score}(h^e_s, h^d_t))}{\sum_{s'} \exp(\text{Score}(h^e_{s'}, h^d_t))}$ 4, with cross-window
Group shifted (AgileIR)	$c_t = \sum_{s=1}^S \alpha_{t,s} h^e_s,\quad \alpha_{t,s} = \frac{\exp(\text{Score}(h^e_s, h^d_t))}{\sum_{s'} \exp(\text{Score}(h^e_{s'}, h^d_t))}$ 5	Same, with memory savings

Alternating shifted windows yields nearly global receptive fields after a small number of blocks, while retaining linear cost in $c_t = \sum_{s=1}^S \alpha_{t,s} h^e_s,\quad \alpha_{t,s} = \frac{\exp(\text{Score}(h^e_s, h^d_t))}{\sum_{s'} \exp(\text{Score}(h^e_{s'}, h^d_t))}$ 6 (Li et al., 2023, Gu et al., 29 Jul 2025, Cai et al., 2024).

3. Mechanisms for Locality and Shifted Context

Variants of locally shifted attention deploy different mechanisms for shifting or overlapping locality:

Cyclically shifted windows (Swin, R3D-SWIN, CoSwin, AgileIR): Partition and cyclically roll the feature map spatially by $c_t = \sum_{s=1}^S \alpha_{t,s} h^e_s,\quad \alpha_{t,s} = \frac{\exp(\text{Score}(h^e_s, h^d_t))}{\sum_{s'} \exp(\text{Score}(h^e_{s'}, h^d_t))}$ 7, where $c_t = \sum_{s=1}^S \alpha_{t,s} h^e_s,\quad \alpha_{t,s} = \frac{\exp(\text{Score}(h^e_s, h^d_t))}{\sum_{s'} \exp(\text{Score}(h^e_{s'}, h^d_t))}$ 8, before re-windowing, compute per-window attention, then inverse the shift.
Multiple shifted and overlapped windows (Sow-Attention, EleGANt): Partition the feature map multiple times with different half-window offsets (e.g., four schemes with $c_t = \sum_{s=1}^S \alpha_{t,s} h^e_s,\quad \alpha_{t,s} = \frac{\exp(\text{Score}(h^e_s, h^d_t))}{\sum_{s'} \exp(\text{Score}(h^e_{s'}, h^d_t))}$ 9), and merge the outcomes for smooth, block-artifact-free continuity (Yang et al., 2022).
Group-wise head decomposition (AgileIR): Split multi-head self-attention into $O(S \cdot T)$ 0 groups, applying window and shift-per-group, yielding further memory reduction at the same locality scale (Cai et al., 2024).
Learned local bias (LocAtViT): Imposes a learnable Gaussian prior centered at each query patch, biasing attention logits toward spatial proximity without explicit windowing (Hajimiri et al., 5 Mar 2026).

In all cases, locality is defined by window size or Gaussian scale, while the shift (either deterministic or parameterized) enforces dynamic spatial coverage and aggregation.

4. Representative Architectures and Algorithmic Patterns

Swin Transformer and Derivatives

Swin Transformer alternates block types:

Window MSA (W-MSA): Non-overlapping $O(S \cdot T)$ 1 attention.
Shifted Window MSA (SW-MSA): Cyclically shift by $O(S \cdot T)$2 spatially, re-partition and attend in new windows, then inverse shift.

This alternating sequence enables propagation of information beyond window boundaries. Each block includes attention, MLP, normalization, and skip connections. In SwinECAT and R3D-SWIN, shifted window attention is foundational for high-resolution fundus image and 3D voxel tasks (Li et al., 2023, Gu et al., 29 Jul 2025).

CoSwin

Adds a parallel learnable local convolutional enhancement module at every block, fusing conv and attention features, preserving translation equivariance and achieving gains especially in low-data/low-resolution tasks (Khadka et al., 10 Sep 2025).

EleGANt Sow-Attention

Partitions into four overlapping, half-shifted window schemes and merges outputs by geometric bilinear weights, achieving fine-grained, artifact-free attention for high-frequency image manipulations (Yang et al., 2022).

LocAtViT

Adds a learnable Gaussian bias to global attention logits, resulting in a “softly shifted” locality that preserves global receptive field but encourages strong local focus—well-suited for segmentation tasks where spatial detail is critical (Hajimiri et al., 5 Mar 2026).

Local Shifted Attention With Early Global Integration

Proposes per-patch locality constructed by soft-attending across $O(S \cdot T)$ 3 circularly shifted patch neighborhoods, then aggregating via per-query attention, finally applying early-stage global self-attention for full receptive field (Sheynin et al., 2021).

5. Mathematical Formulation and Complexity

The core shared structure is:

Window partitioning: $O(S \cdot T)$ 4 Batch of $O(S \cdot T)$ 5 (or $O(S \cdot T)$ 6 for 3D) tensors.
Optional shift: $O(S \cdot T)$ 7
Attention within window:

$O(S \cdot T)$ 8

$O(S \cdot T)$ 9

$p_t \geq p_{t-1}$ 0

Multi-shifting/overlap (if used): repeat above for each offset, then aggregate.

Computational characteristics:

Standard global attention: $p_t \geq p_{t-1}$ 1
Windowed or locally shifted: $p_t \geq p_{t-1}$ 2, $p_t \geq p_{t-1}$ 3
Grouped heads (AgileIR): $p_t \geq p_{t-1}$ 4

Empirical impact: In AgileIR, group shifted window attention (GSWA) cuts training memory overhead by $p_t \geq p_{t-1}$ 550% at large batch size, with negligible performance drop (0.1 dB in PSNR for restoration) (Cai et al., 2024).

6. Empirical Performance and Application Domains

Locally shifted attention has demonstrated consistent efficiency and accuracy gains across domains:

Speech and Sequence Modeling: Up to 12% relative error reduction and 3 BLEU improvement over global attention for monotonic alignment (Tjandra et al., 2017).
Vision (2D/3D): SOTA or superior metrics in single-view 3D voxel reconstruction (Li et al., 2023), 2–5% accuracy gain in small-scale image classification (Khadka et al., 10 Sep 2025), +1.7 pp in fundus disease classification (Gu et al., 29 Jul 2025), and substantial gains in segmentation mIoU on ADE20K and others (Hajimiri et al., 5 Mar 2026).
Image Generation and Restoration: Linear cost Sow-Attention achieves artifact-free, high-res makeup transfer with up to 16 $p_t \geq p_{t-1}$ 6 less compute than global attention (Yang et al., 2022); GSWA provides scalable super-resolution/restoration with >50% memory reduction (Cai et al., 2024).

7. Trade-offs, Limitations, and Design Considerations

Locally shifted attention offers substantial efficiency and inductive bias advantages, but with specific trade-offs:

Window size vs. context: Small $p_t \geq p_{t-1}$ 7 yields better locality but slower global context propagation; too large $p_t \geq p_{t-1}$ 8 dilutes locality.
Cross-window propagation: Shifting is critical for breaking window isolation; overlap or group decomposition further ensures information continuity.
Task dependence: For strictly monotonic or strongly local tasks, local shifting is ideal; for global or dense prediction, hybrid approaches (e.g., LocAtViT’s learnable bias) retain best-of-both performance.
Implementation complexity: Overlapping windows, group decomposition, and bilinear aggregation (for smoothing) increase code complexity, but provide substantial memory and artifact improvements (Yang et al., 2022, Cai et al., 2024).
Limitations: Gains are task- and backbone-specific; in backbones like Swin, LocAtViT’s explicit Gaussian bias provides limited additional benefit (Hajimiri et al., 5 Mar 2026).

In summary, locally shifted attention mechanisms unify a spectrum of strategies for balancing locality, efficiency, and global context in attention models. By leveraging spatial or temporal shifts—either explicitly through window realignment, overlap, or implicitly through learned priors—they achieve scalable, artifact-resistant modeling across sequence, vision, and generative domains, and now represent a foundational building block for modern scalable neural architectures (Tjandra et al., 2017, Li et al., 2023, Gu et al., 29 Jul 2025, Hajimiri et al., 5 Mar 2026, Yang et al., 2022, Cai et al., 2024, Khadka et al., 10 Sep 2025, Sheynin et al., 2021).