Long Short Distance Attention (LSDA)
- LSDA is a family of Transformer attention mechanisms dividing interactions into local (short-range) and global (long-range) components.
- It lowers the $O(n^2)$ cost of self-attention while preserving context modeling across both sequences and grid-based modalities.
- LSDA methods are applied in NLP and CV, boosting Transformer efficiency and accuracy across varied tasks.
Long Short Distance Attention (LSDA) encompasses a family of Transformer attention mechanisms that explicitly decompose interactions into short-range (local, neighborhood) and long-range (global, cross-block) components. This architectural paradigm aims to address the prohibitive cost of full self-attention while maintaining the ability to model both fine-grained dependencies and global context, in both sequence and grid-based modalities. LSDA has emerged independently in NLP and vision (CV) settings, with instantiations including LSG Attention (Condevaux et al., 2022), CrossFormer/CrossFormer++ LSDA (Wang et al., 2021, Wang et al., 2023), Long Short-attention (LS-attention) (Hajra, 21 May 2025), and Long-Short Range Attention (LSRA) (Wu et al., 2020). Common to all is the structural separation—or alternation—of dense local context modeling from sparser global routing, implemented with different algorithmic schemes depending on modality and use case.
1. Mathematical Foundations and Variants
Several principal formalizations of LSDA exist, distinguished by the granularity of the decomposition and the mechanism for mixing local and global signals.
Block-based Local/Sparse/Global (LSG):
The LSG architecture (Condevaux et al., 2022) operates over a sequence of length by partitioning tokens into non-overlapping blocks of size . LSG merges three components per attention head:
- Local: Each token in block attends densely to all tokens in blocks , comprising its immediate neighborhood. The local context size per query is at most .
- Sparse: Two additional windows per block sample tokens each, using head-specific strategies (strided, block-strided, average-pooling, max-norm, or one-round LSH clustering).
- Global: global tokens, learned and prepended, attend to and are attended by all sequence tokens.
Combined, the query in block attends to up to keys—a count constant in . Complexity per layer is .
Head-based Local/Global (LS-attention, LSRA):
Long Short-attention (LS-attention) (Hajra, 21 May 2025) and Long-Short Range Attention (LSRA) (Wu et al., 2020) decompose self-attention across heads:
- local heads attend to a window of neighbors (), using banded or convolutional operations.
- global heads attend unrestrictedly (full or causal mask).
- Each head uses standard query/key/value projections, with local masks ensuring each only computes attention over its targeted region.
Grid-based Alternating Local/Long (Vision LSDA):
In vision, input tokens from spatial grids () are grouped:
- SDA (Short-Distance Attention): Each group consists of contiguous spatial neighbors, with attention internally per group ( cost).
- LDA (Long-Distance Attention): Groups are formed by strided sampling with interval ; tokens that are spatially distant become group neighbors. Each group undergoes standard attention, then outputs are scattered back to native positions.
- In CrossFormer/CrossFormer++ (Wang et al., 2021, Wang et al., 2023), SDA and LDA alternate across blocks, leveraging cross-scale embedding (CEL) and dynamic position bias (DPB).
2. Architectural Integration
LSDA mechanisms are incorporated at the multi-head attention (MHA) layer or block level, with the following strategies:
- LSG (Condevaux et al., 2022): All heads use the local+sparse+global mask for their attention calculation; the number of keys is bounded for each query. Transformers can be converted post hoc by replacing full self-attention with LSG attention and expanding the positional embedding table.
- LS-attention (Hajra, 21 May 2025): The heads are statically divided (e.g., local, global). Per-head projections are computed, attention is run with distinct masks per head, outputs concatenated across heads, then linearly projected.
- LSRA (Wu et al., 2020): Inputs are split along channels, executing global (MHSA) on half and local (light conv or dynamic conv) on the other, fusing outputs and following with feed-forward layers.
- Vision LSDA (Wang et al., 2021, Wang et al., 2023): Each Transformer block alternates between SDA and LDA; group sizes () and strides () are stage-dependent and can be made progressive (PGS), growing receptive field deeper in the network.
3. Computational Complexity and Efficiency
LSDA instantiates linear or near-linear attention cost in both sequence and image domains, replacing the quadratic or cost with or :
| Variant | Complexity per Layer | Local Context | Global Context |
|---|---|---|---|
| LSG (Condevaux et al., 2022) | tokens, sparse | ||
| LS-attention (Hajra, 21 May 2025) | Quadratic in | ||
| Vision LSDA (Wang et al., 2021) | Strided | ||
| LSRA (Wu et al., 2020) | Conv window | Global (full) |
Empirically, LSG attention achieves training step time  s/step and memory  GB for 4096 tokens, outperforming Longformer and BigBird in speed with similar or less memory. CrossFormer-S (with LSDA) achieves attention cost reductions up to $1/64$ compared to global attention at high spatial resolutions (Condevaux et al., 2022, Wang et al., 2021, Wang et al., 2023).
4. Practical Adaptation and Implementation
LSDA is compatible with a wide range of Transformer architectures and can frequently be integrated with minimal changes:
- Conversion Tools: Scripts exist to convert HuggingFace BERT, RoBERTa, DistilBERT, and BART checkpoints to LSG attention, simply by replacing attention modules and expanding positional embeddings (Condevaux et al., 2022).
- Hyperparameters: Key tunables include block size , sparsity factor , number of global tokens , group sizes , stride , and the local window . For head-based variants, a typical split is global, local heads (Hajra, 21 May 2025).
- Mixed-precision and Acceleration: Fast attention kernels (e.g., FlashAttention-2/3) are applicable for both local and global heads. LS-attention and LSRA can reduce inference latency by up to 36% at sequence length 8192 (Hajra, 21 May 2025).
- Progressive Design: In vision, group size is adapted per stage, shallow layers prioritizing smaller groups (local attention), deep layers growing global context (PGS) (Wang et al., 2023).
5. Empirical Evaluation and Comparative Performance
LSDA-based models consistently outperform or match standard full-attention and windowed/sparse baselines across domains:
- NLP (LSG, LS-attention, LSRA):
- LSG (block=128, =4) matches or exceeds Longformer/BigBird in classification and summarization on long documents. For BERT-class models adapted with LSG, masked LM accuracy degrades minimally—unlike catastrophic collapse for vanilla models extrapolated to longer sequences (Condevaux et al., 2022).
- LS-attention reduces LM perplexity to about $2/5$ of QK-norm stabilizers and matches FlashAttention at $1/20$ the GPU-hours (Hajra, 21 May 2025).
- Lite Transformer with LSRA exceeds baselines by 1–2 BLEU for translation and 1.8 PPL for language modeling under strict compute constraints (Wu et al., 2020).
- Vision (LSDA, CrossFormer/++):
- LSDA improves Top-1 ImageNet-1K accuracy by 0.6–1.2% over Swin/PVT-style baselines; removal of either SDA or LDA degrades accuracy by over 1%. On COCO, CrossFormer++-S (LSDA backbone) attains +0.7 AP over CrossFormer-S (Wang et al., 2023, Wang et al., 2021).
- Ablation replacing LSDA with alternative sparse attentions in CrossFormer++ confirms the unique gain (+0.4–1.4% Top-1) of the local/global alternation (Wang et al., 2023).
6. Theoretical and Empirical Rationale
LSDA schemes are motivated by the inadequacies of global MHSA alone, especially for:
- Short-range dependency binding: Standard global MHSA is rank-deficient for expressing dense local banded dependencies when , leading to logit explosion and instability (Hajra, 21 May 2025).
- Computation-accuracy tradeoff: Local-only attention (e.g. windows, convolution) sacrifices global context, while global-only is inefficient and numerically brittle. By decomposing, each submodule operates in its optimal regime.
- Adaptability: LSDA variants improve robustness to input length/distribution shift (e.g., LSG adaptation allows direct extrapolation to longer sequences without pretraining, provided positional embeddings are extended appropriately (Condevaux et al., 2022)).
7. Limitations and Prospects
LSDA approaches are not without tradeoffs:
- Hyperparameter dependence: Optimal block/group/window sizes and head splits are task- and distribution-specific; there is no universally superior setting. Performance can be sensitive to these parameters.
- Scalability: While asymptotic cost is reduced, absolute compute may remain significant at extreme sequence/grid sizes due to constants (e.g., , ).
- Heterogeneous architectures: Adapting conversion tools and LSDA modules across nonstandard Transformer architectures may require per-variant engineering effort (Condevaux et al., 2022).
- Dynamic allocation: Static assignment (fixed local/global splits) can be suboptimal. Future directions propose learnable or mixture-of-expert-based splits, continuous routing, or block-wise relative position embedding (Condevaux et al., 2022, Wang et al., 2023).
Extensions being investigated include mixing block-local/sparse/global scaffolding with dynamic convolution or kernel attention methods, integrating relative positional encoding within blocks, and dynamic or learned allocation of global attention capacity.
The LSDA paradigm—across LSG attention, head-based local/global decomposition, and vision block alternation—demonstrates consistent benefits in efficiency, stability, and accuracy for both language and vision transformers, establishing itself as a universal principle for scalable attention architectures (Condevaux et al., 2022, Wang et al., 2021, Wang et al., 2023, Hajra, 21 May 2025, Wu et al., 2020).