Papers
Topics
Authors
Recent
2000 character limit reached

LSRA: Long-Short Range Attention

Updated 30 November 2025
  • LSRA is a method that decomposes Transformer attention into local and global components, enhancing efficiency and stability in handling varied dependencies.
  • It employs techniques such as head decomposition, dynamic projection, and branch-wise convolution to mitigate logit explosion and optimize computational cost.
  • Empirical studies demonstrate improved BLEU, ROUGE, and ImageNet accuracy while reducing GPU time and resource demands in language, translation, and vision tasks.

Long-Short Range Attention (LSRA) encompasses a family of architectural approaches for decomposing the attention mechanism in Transformer models into distinct local (short-range) and global (long-range) components. This decomposition addresses challenges in both learning efficiency and computational complexity that arise from the standard multi-head self-attention (MHSA) paradigm, particularly when modeling sequences where both immediate context and distant dependencies are critical. LSRA has been realized through several technical designs, including specialized head assignment (Hajra, 21 May 2025), dynamic projection methods (Zhu et al., 2021), and branch-wise decomposition with convolutional modeling (Wu et al., 2020), each targeting stability, efficiency, or resource-constrained deployment.

1. Core Formulations of LSRA

Three implementations of LSRA have been proposed:

  • Head Decomposition (LS-attention): The self-attention heads are split into HlocalH_{\text{local}} short-range heads, which attend only within a local window via banded attention masks, and HglobalH_{\text{global}} long-range heads, which use standard global attention (Hajra, 21 May 2025). The full output is a concatenation of all head results, projected to the model dimension.

A(l,i)=Q(l,i)(K(l,i))/dk+Mlocal A(g,j)=Q(g,j)(K(g,j))/dk+Mglobal LSAttn(X)=Concat([O(l,0),,O(l,s1),O(g,0),,O(g,l1)])WO\begin{align*} & A^{(l,i)} = Q^{(l,i)} (K^{(l,i)})^\top / \sqrt{d_k} + M_\mathrm{local} \ & A^{(g,j)} = Q^{(g,j)} (K^{(g,j)})^\top / \sqrt{d_k} + M_\mathrm{global} \ & \operatorname{LSAttn}(X) = \text{Concat}([O^{(l,0)}, \ldots, O^{(l,s-1)}, O^{(g,0)}, \ldots, O^{(g,l-1)}]) W_O \end{align*}

  • Parallel Streams With Dynamic Projection: In Long-Short Transformer (Transformer-LS), each attention head is decomposed into a local stream (windowed short-term attention) and a global stream leveraging dynamic low-rank projection of all keys/values (Zhu et al., 2021). The streams are normalized and concatenated before application.
  • Branch-wise Split With Convolution: In Lite Transformer, the hidden representation is split along the channel axis into a local branch (modeled using lightweight convolution) and a global branch (standard self-attention on a reduced subspace) (Wu et al., 2020). Outputs are concatenated and processed by a feed-forward network.

These formulations retain compatibility with the classic Transformer block, and typically replace or augment standard MHSA modules.

2. Theoretical Motivation and Instability Analysis

The principal motivation for LSRA is the limitation of vanilla MHSA in modeling dense local dependencies when sequence length nn becomes large. In autoregressive and language modeling tasks, the true attention matrix should be densely banded, reflecting the fact that tokens rely primarily on their neighbors. Standard MHSA, with O(nd)O(n \cdot d) degrees of freedom, is inefficient at modeling a banded O(nl)O(n \cdot l) dependency pattern and instead drives the logits QKQK^\top to unmanageably high magnitudes ("logit explosion") to approximate this with a global softmax. This effect is empirically observed as baseline max-logits reaching 20–100× those seen in LS-attention for n=2048n=2048 (Hajra, 21 May 2025).

This instability leads to frequent divergence or loss spikes when training conventional global self-attention on long sequences. By directly allocating computation to explicit local and global streams or heads, LSRA controls the distribution of dependency modeling, resulting in much more stable logit distributions and smoother optimization dynamics.

3. Computational Complexity and Efficiency

A comparative summary of computational and memory costs across LSRA designs:

Architecture Time Complexity Memory per Head Comments
Vanilla MHSA O(Hn2dk)O(H n^2 d_k) O(n2)O(n^2) All pairs attention
Head-Decomp. LS-attn O(ln2dk+snpdk)O(l n^2 d_k + s n p d_k) O(ln)O(l n) pp is local span
Transformer-LS LSRA O(hndk(w+r))O(h n d_k (w+r)) O(n(w+r))O(n(w+r)) w+rnw+r \ll n
Lite Transformer LSRA O(1.5Nd2+0.5N2d)O(1.5 N d^2 + 0.5 N^2 d) Branch-wise split

This decomposition leads to linear or subquadratic scaling in sequence length for lHl \ll H, w+rnw+r \ll n, and supports substantial improvements in both hardware wall-clock times and deployment costs. For example, in LS-attention (Hajra, 21 May 2025), inference speedups of up to 36% and >20× savings in GPU-hours for stable training are reported.

4. Empirical Performance

Language Modeling

On PG-19 with auto-regressive loss, full-range Flash-attention exhibits loss divergence for long sequences (n=2048n=2048 or $8192$), while LS-attention with only one global head and the remainder as local heads maintains perfect stability. Even minimal splits (1 global + 1 local) suffice for stability (Hajra, 21 May 2025). Comparable patterns are reported on the enwik8 benchmark (Zhu et al., 2021).

Machine Translation and Summarization

Lite Transformer (using LSRA) demonstrates consistent BLEU improvements over vanilla Transformer under equal MAdd budgets. On WMT’14 En→Fr, LSRA delivers +1.7 BLEU over baseline transformer at ≈100M MAdds; for CNN–DailyMail summarization, LSRA cuts parameters by 2.5× and FLOPs by 2.4× with no meaningful drop in ROUGE (Wu et al., 2020).

Vision Tasks

On ImageNet, architectures such as CvT-LS and ViL-LS with LSRA achieve top-1 accuracy of 84.1% (ViL-LS-Base, 56M params) and outperform full-attention variants while running at half or less of the FLOPs (Zhu et al., 2021).

5. Architectural Integration and Implementation Techniques

Head Assignment and Masking

In the head decomposition approach, practical setups assign H1H-1 heads to local attention and a single head to global attention; local span pp is set according to expected dependency (e.g., p=50p = 50 for n2048n \leq 2048, p=100p = 100 for longer sequences) (Hajra, 21 May 2025).

Branch Normalization

Transformer-LS employs a Dual LayerNorm (DualLN), with independent γ,β\gamma, \beta for local and global streams, to address initialization scale mismatches and ensure unbiased gradient flow (Zhu et al., 2021).

Pseudocode Structure

A representative block for LSRA with local/global split:

1
2
3
4
5
6
7
def TransformerBlock(X):
    Y1 = LayerNorm(X)
    A = LS_Attn(Y1)
    X2 = X + A
    Y2 = LayerNorm(X2)
    Z = FeedForward(Y2)
    return X2 + Z

The LS_Attn function handles both head splitting and stream/mask selection. For Lite Transformer, the main block equally partitions inputs and applies convolution and self-attention on separate branches (Wu et al., 2020).

Compression and Hardware Adaptation

Pruning and quantization can be applied efficiently to LSRA-based models, yielding up to 18.2× size compression with minimal performance loss, especially suited for edge deployment (Wu et al., 2020).

6. Practical Recommendations and Hyperparameter Selection

  • Head split: One global head, remaining as local.
  • Local span pp: Determined by dependency length (e.g., 50–100).
  • Precision: Mixed BF16 suffices for LS-attention; full FP32 not required.
  • Optimization: AdamW with cosine LR decay, typical warmup, and large batch sizes maintain stability (Hajra, 21 May 2025).
  • Branch normalization: Independent LayerNorm for streams (DualLN) is critical for Transformer-LS convergence.
  • Deployment: Replace MHSA with LSRA at drop-in level, keeping the rest of the architecture and hyperparameters unchanged.
  • Compression: Prune and quantize both convolutional and linear layers for mobile/edge applications (Wu et al., 2020).

7. Empirical Comparisons and Limitations

LSRA designs consistently outperform or match full-attention baselines in both accuracy and resource demands across language and vision domains. Exclusive reliance on global or local mechanisms underperforms joint models by 1–2 percentage points on long-range tasks (Zhu et al., 2021). For extremely long sequences, LSRA enables training with sequence lengths 2–3× those feasible with vanilla attention on typical hardware (Zhu et al., 2021). A plausible implication is that neglecting either local or global context introduces significant performance degradation, highlighting the necessity of LSRA decomposition for robust scaling.

References:

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Long-Short Range Attention (LSRA).