Papers
Topics
Authors
Recent
Search
2000 character limit reached

Long Short Distance Attention (LSDA)

Updated 22 February 2026
  • LSDA is a family of Transformer attention mechanisms dividing interactions into local (short-range) and global (long-range) components.
  • It lowers the $O(n^2)$ cost of self-attention while preserving context modeling across both sequences and grid-based modalities.
  • LSDA methods are applied in NLP and CV, boosting Transformer efficiency and accuracy across varied tasks.

Long Short Distance Attention (LSDA) encompasses a family of Transformer attention mechanisms that explicitly decompose interactions into short-range (local, neighborhood) and long-range (global, cross-block) components. This architectural paradigm aims to address the prohibitive O(n2)O(n^2) cost of full self-attention while maintaining the ability to model both fine-grained dependencies and global context, in both sequence and grid-based modalities. LSDA has emerged independently in NLP and vision (CV) settings, with instantiations including LSG Attention (Condevaux et al., 2022), CrossFormer/CrossFormer++ LSDA (Wang et al., 2021, Wang et al., 2023), Long Short-attention (LS-attention) (Hajra, 21 May 2025), and Long-Short Range Attention (LSRA) (Wu et al., 2020). Common to all is the structural separation—or alternation—of dense local context modeling from sparser global routing, implemented with different algorithmic schemes depending on modality and use case.

1. Mathematical Foundations and Variants

Several principal formalizations of LSDA exist, distinguished by the granularity of the decomposition and the mechanism for mixing local and global signals.

Block-based Local/Sparse/Global (LSG):

The LSG architecture (Condevaux et al., 2022) operates over a sequence of length nn by partitioning tokens into nb=n/btn_b = n / b_t non-overlapping blocks of size btb_t. LSG merges three components per attention head:

  • Local: Each token in block ii attends densely to all tokens in blocks i−1,i,i+1i-1, i, i+1, comprising its immediate neighborhood. The local context size per query is at most 3bt3 b_t.
  • Sparse: Two additional windows per block sample bt/fb_t / f tokens each, using head-specific strategies (strided, block-strided, average-pooling, max-norm, or one-round LSH clustering).
  • Global: gg global tokens, learned and prepended, attend to and are attended by all nn sequence tokens.

Combined, the query in block ii attends to up to 3bt+2(bt/f)+g3b_t + 2(b_t / f) + g keys—a count constant in nn. Complexity per layer is O(ndh)O(n d_h).

Head-based Local/Global (LS-attention, LSRA):

Long Short-attention (LS-attention) (Hajra, 21 May 2025) and Long-Short Range Attention (LSRA) (Wu et al., 2020) decompose self-attention across heads:

  • HsH_s local heads attend to a window of p≪np \ll n neighbors (∣i−j∣≤p|i-j| \leq p), using banded or convolutional operations.
  • Hâ„“H_\ell global heads attend unrestrictedly (full or causal mask).
  • Each head uses standard query/key/value projections, with local masks ensuring each only computes attention over its targeted region.

Grid-based Alternating Local/Long (Vision LSDA):

In vision, input tokens from spatial grids (S×SS \times S) are grouped:

  • SDA (Short-Distance Attention): Each group consists of contiguous G×GG \times G spatial neighbors, with attention internally per group (O(S2G2)O(S^2 G^2) cost).
  • LDA (Long-Distance Attention): Groups are formed by strided sampling with interval II; tokens that are spatially distant become group neighbors. Each group undergoes standard attention, then outputs are scattered back to native positions.
  • In CrossFormer/CrossFormer++ (Wang et al., 2021, Wang et al., 2023), SDA and LDA alternate across blocks, leveraging cross-scale embedding (CEL) and dynamic position bias (DPB).

2. Architectural Integration

LSDA mechanisms are incorporated at the multi-head attention (MHA) layer or block level, with the following strategies:

  • LSG (Condevaux et al., 2022): All heads use the local+sparse+global mask for their attention calculation; the number of keys is bounded for each query. Transformers can be converted post hoc by replacing full self-attention with LSG attention and expanding the positional embedding table.
  • LS-attention (Hajra, 21 May 2025): The heads are statically divided (e.g., Hs=5H_s=5 local, Hâ„“=1H_\ell=1 global). Per-head projections are computed, attention is run with distinct masks per head, outputs concatenated across heads, then linearly projected.
  • LSRA (Wu et al., 2020): Inputs are split along channels, executing global (MHSA) on half and local (light conv or dynamic conv) on the other, fusing outputs and following with feed-forward layers.
  • Vision LSDA (Wang et al., 2021, Wang et al., 2023): Each Transformer block alternates between SDA and LDA; group sizes (GG) and strides (II) are stage-dependent and can be made progressive (PGS), growing receptive field deeper in the network.

3. Computational Complexity and Efficiency

LSDA instantiates linear or near-linear attention cost in both sequence and image domains, replacing the quadratic O(n2)O(n^2) or O(S4)O(S^4) cost with O(n)O(n) or O(S2G2)O(S^2 G^2):

Variant Complexity per Layer Local Context Global Context
LSG (Condevaux et al., 2022) O(ndh)O(n d_h) 3bt3b_t gg tokens, sparse
LS-attention (Hajra, 21 May 2025) O(Hsnp+Hâ„“n2)O(H_s n p + H_\ell n^2) pp Quadratic in Hâ„“H_\ell
Vision LSDA (Wang et al., 2021) O(S2G2d)O(S^2 G^2 d) G×GG \times G Strided G×GG \times G
LSRA (Wu et al., 2020) O(N2d/2)+O(Nkd/2)O(N^2 d/2) + O(N k d/2) Conv window kk Global (full)

Empirically, LSG attention achieves training step time ≈1.5\approx 1.5 s/step and memory ≈32\approx 32 GB for 4096 tokens, outperforming Longformer and BigBird in speed with similar or less memory. CrossFormer-S (with LSDA) achieves attention cost reductions up to $1/64$ compared to global attention at high spatial resolutions (Condevaux et al., 2022, Wang et al., 2021, Wang et al., 2023).

4. Practical Adaptation and Implementation

LSDA is compatible with a wide range of Transformer architectures and can frequently be integrated with minimal changes:

  • Conversion Tools: Scripts exist to convert HuggingFace BERT, RoBERTa, DistilBERT, and BART checkpoints to LSG attention, simply by replacing attention modules and expanding positional embeddings (Condevaux et al., 2022).
  • Hyperparameters: Key tunables include block size btb_t, sparsity factor ff, number of global tokens gg, group sizes GG, stride II, and the local window pp. For head-based variants, a typical split is Hâ„“=1H_\ell=1 global, Hs=H−1H_s=H-1 local heads (Hajra, 21 May 2025).
  • Mixed-precision and Acceleration: Fast attention kernels (e.g., FlashAttention-2/3) are applicable for both local and global heads. LS-attention and LSRA can reduce inference latency by up to 36% at sequence length 8192 (Hajra, 21 May 2025).
  • Progressive Design: In vision, group size is adapted per stage, shallow layers prioritizing smaller groups (local attention), deep layers growing global context (PGS) (Wang et al., 2023).

5. Empirical Evaluation and Comparative Performance

LSDA-based models consistently outperform or match standard full-attention and windowed/sparse baselines across domains:

  • NLP (LSG, LS-attention, LSRA):
    • LSG (block=128, ff=4) matches or exceeds Longformer/BigBird in classification and summarization on long documents. For BERT-class models adapted with LSG, masked LM accuracy degrades minimally—unlike catastrophic collapse for vanilla models extrapolated to longer sequences (Condevaux et al., 2022).
    • LS-attention reduces LM perplexity to about $2/5$ of QK-norm stabilizers and matches FlashAttention at $1/20$ the GPU-hours (Hajra, 21 May 2025).
    • Lite Transformer with LSRA exceeds baselines by 1–2 BLEU for translation and 1.8 PPL for language modeling under strict compute constraints (Wu et al., 2020).
  • Vision (LSDA, CrossFormer/++):
    • LSDA improves Top-1 ImageNet-1K accuracy by 0.6–1.2% over Swin/PVT-style baselines; removal of either SDA or LDA degrades accuracy by over 1%. On COCO, CrossFormer++-S (LSDA backbone) attains +0.7 AP over CrossFormer-S (Wang et al., 2023, Wang et al., 2021).
    • Ablation replacing LSDA with alternative sparse attentions in CrossFormer++ confirms the unique gain (+0.4–1.4% Top-1) of the local/global alternation (Wang et al., 2023).

6. Theoretical and Empirical Rationale

LSDA schemes are motivated by the inadequacies of global MHSA alone, especially for:

  • Short-range dependency binding: Standard global MHSA is rank-deficient for expressing dense local banded dependencies when n≫dn \gg d, leading to logit explosion and instability (Hajra, 21 May 2025).
  • Computation-accuracy tradeoff: Local-only attention (e.g. windows, convolution) sacrifices global context, while global-only is inefficient and numerically brittle. By decomposing, each submodule operates in its optimal regime.
  • Adaptability: LSDA variants improve robustness to input length/distribution shift (e.g., LSG adaptation allows direct extrapolation to longer sequences without pretraining, provided positional embeddings are extended appropriately (Condevaux et al., 2022)).

7. Limitations and Prospects

LSDA approaches are not without tradeoffs:

  • Hyperparameter dependence: Optimal block/group/window sizes and head splits are task- and distribution-specific; there is no universally superior setting. Performance can be sensitive to these parameters.
  • Scalability: While asymptotic cost is reduced, absolute compute may remain significant at extreme sequence/grid sizes due to constants (e.g., btb_t, GG).
  • Heterogeneous architectures: Adapting conversion tools and LSDA modules across nonstandard Transformer architectures may require per-variant engineering effort (Condevaux et al., 2022).
  • Dynamic allocation: Static assignment (fixed local/global splits) can be suboptimal. Future directions propose learnable or mixture-of-expert-based splits, continuous routing, or block-wise relative position embedding (Condevaux et al., 2022, Wang et al., 2023).

Extensions being investigated include mixing block-local/sparse/global scaffolding with dynamic convolution or kernel attention methods, integrating relative positional encoding within blocks, and dynamic or learned allocation of global attention capacity.


The LSDA paradigm—across LSG attention, head-based local/global decomposition, and vision block alternation—demonstrates consistent benefits in efficiency, stability, and accuracy for both language and vision transformers, establishing itself as a universal principle for scalable attention architectures (Condevaux et al., 2022, Wang et al., 2021, Wang et al., 2023, Hajra, 21 May 2025, Wu et al., 2020).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Long Short Distance Attention (LSDA).