Papers
Topics
Authors
Recent
Search
2000 character limit reached

Length-Scaled Attention in Transformers

Updated 13 May 2026
  • Length-Scaled Attention (LSA) is a family of techniques that adjust the standard attention mechanism to preserve entropy invariance as sequence lengths increase.
  • LSA employs methods such as InfoScale, CosScale, and log-length scaling, which introduce length-adaptive adjustments to counteract attention score dilution.
  • The approach enhances performance in large language models, speech recognition, and speaker verification by keeping attention distributions sharp over longer sequences.

Length-Scaled Attention (LSA) is a family of modifications to the attention mechanism in Transformer-based models, designed to maintain or enhance model performance as sequence lengths at inference grow far beyond those seen during training. The principal aim is to prevent the degradation of attention sharpness and model confidence due to entropy growth and “attention score dilution” that naturally arises with increased input length. LSA encompasses theoretically grounded techniques (e.g., InfoScale, CosScale, scale-invariant attention) as well as practical recipes (log-length scaling in Conformer, Softplus-based variants) for improving length extrapolation across domains including LLMs, speech recognition, and speaker verification.

1. Origins and Theoretical Motivation

When a Transformer’s attention operates on a much longer sequence than encountered during training, the softmax is computed over many more terms. In standard scaled-dot-product attention, this leads to two phenomena: (i) individual logits contribute less to the attention output (“score dilution”), and (ii) the output distribution’s entropy increases, making the model less confident. The resultant “flattening” of attention means the model may lose focus on semantically or contextually important tokens, especially for extremely long contexts (Li et al., 15 Jan 2025).

The central insight of LSA is entropy invariance: by controlling scaling parameters as a function of sequence length, it is possible to match the average entropy of the attention distribution to that seen at training time. This preserves the sharpness and selectivity of the softmax, allowing models to extrapolate more robustly to longer contexts (Li et al., 15 Jan 2025, Gao et al., 23 Jan 2025, Anson et al., 20 May 2025).

2. Mathematical Formulations

LSA methods modify attention scoring and normalization via length-dependent scaling, either as a multiplicative factor or via affine position-dependent transforms. The key approaches include:

  • InfoScale (Dot-Product Attention): For queries qiq_i and keys K=[k1,...,kn]K = [k_1, ..., k_n], the scaling factor λ\lambda in standard attention (typically 1/dk1/\sqrt{d_k}) is replaced by a length-adaptive version:

InfoScale(n)1n2/dk1ntrain2/dk\text{InfoScale}(n) \approx \sqrt{ \frac{1-n^{-2/d_k}}{1-n_{\text{train}}^{-2/d_k}} }

The logits are then computed as:

logits=InfoScale(n)dkQKT\text{logits} = \frac{\text{InfoScale}(n)}{\sqrt{d_k}} QK^T

enforcing entropy invariance across nn (Li et al., 15 Jan 2025).

  • CosScale (Cosine Attention): Here, the logits use a temperature parameter α\alpha:

aij=exp(αcosθij)a_{ij} = \exp\left(\alpha \cdot \cos \theta_{ij}\right)

Large α\alpha values focus attention but may cause numerical instability (Li et al., 15 Jan 2025).

  • Log-Length Scaling (Conformer): In Conformer architectures, attention scores are scaled by a factor K=[k1,...,kn]K = [k_1, ..., k_n]0:

K=[k1,...,kn]K = [k_1, ..., k_n]1

where K=[k1,...,kn]K = [k_1, ..., k_n]2 is a learnable temperature parameter (Liao et al., 2022).

  • Scale-Invariant Attention (Position-Dependent Transform): Each logit K=[k1,...,kn]K = [k_1, ..., k_n]3 is transformed as:

K=[k1,...,kn]K = [k_1, ..., k_n]4

with K=[k1,...,kn]K = [k_1, ..., k_n]5, K=[k1,...,kn]K = [k_1, ..., k_n]6, ensuring scale-invariant total attention and sparsity (Anson et al., 20 May 2025).

  • Length-Scaled Softplus Attention (LSSA): Switches the nonlinearity in softmax to softplus and applies a length-dependent scale K=[k1,...,kn]K = [k_1, ..., k_n]7 to K=[k1,...,kn]K = [k_1, ..., k_n]8 prior to normalization, optionally followed by power-based reweighting (LSSAR) (Gao et al., 23 Jan 2025).

3. Implementation Variants and Integration

LSA techniques can be introduced at inference or fine-tuning without retraining from scratch. For dot-product attention, InfoScale is used as a scalar multiplier on the logits, computed from sequence length and training length. For cosine attention, a fixed or learnable temperature parameter (CosScale) is applied. In Conformer models, LSA is incorporated by multiplying logits by K=[k1,...,kn]K = [k_1, ..., k_n]9 and introducing a trainable temperature.

The table below outlines key implementation choices for major LSA variants:

Variant Logit Scaling Entropy Goal Extra Params
InfoScale Length-adaptive λ Fixed at train None
CosScale Temp. α on cosine Sharpened α, fixed or learnable
Log-LSA log n multiplier Invariant s, per block (learned)
Scale-invariant λ\lambda0 by i−j
LSSA/LSSAR log d·log N_s, Softplus Invariant p (LSSAR), optional

All methods have negligible computational overhead: mainly scalar multiplications, length-dependent calculations, or power-based post-processing.

4. Theoretical Foundations and Guarantees

The guiding principle is to sustain the entropy or the sparsity of attention distributions as sequence length grows. The theory demonstrates that, for random dot-products λ\lambda1 modeled as λ\lambda2, length scaling of logits can enforce:

  • Scale-invariant total attention: Unnormalized attention summed over geometrically growing blocks (λ\lambda3) is Θ(1), keeping local and global context influential (Anson et al., 20 May 2025).
  • Scale-invariant sparsity: The entropy of the attention distribution over each block grows at most sub-logarithmically (ideally remains O(1)), preventing the diluting effect of the softmax (Anson et al., 20 May 2025).

Entropy-invariance is achieved by solving for the scaling factor that keeps the entropy λ\lambda4 at the reference value seen during training, leading to exact expressions (as in InfoScale). In Softplus-based LSSA, the empirically justified scale λ\lambda5 achieves a similar goal (Gao et al., 23 Jan 2025).

5. Empirical Evaluation and Impact

Experiments in multiple domains demonstrate marked improvements in length extrapolation:

  • LLMs: On GAU-α, InfoScale and CosScale combined produce sharp reductions in perplexity (PPL drops from ≫500 to 6.36) and large boosts in accuracy (ACC increases up to 11×) at context lengths 64× longer than training (Li et al., 15 Jan 2025).
  • Language Modeling: Scale-invariant p-RoPE attention maintains validation loss nearly constant when extrapolating from 4k-token training to 16k or 64k inference, outperforming RoPE, LogN scaling, and ALiBi in both language modeling and retrieval tasks (Anson et al., 20 May 2025).
  • Speech and Speaker Verification: In Conformer-based ASV models, LSA reduces EER by ≈10% relative and delivers gains across different evaluation sets, demonstrating generalization to variable utterance lengths (Liao et al., 2022).
  • Numerical Stability and Efficiency: LSSA/LSSAR maintain constant validation loss up to 16× context expansion and exhibit robust numerical behavior due to the use of softplus and bounded gradients (Gao et al., 23 Jan 2025).

Table: Extracted results for select benchmarks (excerpted):

Model / Task Baseline (PPL/ACC/EER) LSA Variant Result
GAU-α, PPL ≫500 InfoScale+CosScale 6.36
GAU-α, ACC <0.1 InfoScale+CosScale 0.63
Conformer, VoxCeleb 1.143% (EER) LSA 1.026%
GPT2-124M val. loss 6.282 (@16k) LSSAR 3.317
Retrieval (Acc@64k) 0.000 (RoPE) Scale-invariant p-RoPE .969

LSA approaches differ from prior heuristics (e.g., just multiplying logits by log n or adding positional biases):

  • ALiBi: Adds position-dependent linear bias, but under-attends distant tokens and lacks explicit entropy control (Anson et al., 20 May 2025).
  • LogN / Scalable-Softmax: Scales all logits by log N, but does not differentiate among local/global or adapt sharpness as a function of offset (Anson et al., 20 May 2025).
  • Softmax Plus, Pre-softmax Scaling: Do not guarantee entropy invariance and often fail to preserve attention sharpness at long sequence lengths (Li et al., 15 Jan 2025).
  • Re-weighted Softplus (LSSAR): Further sharpens attention via a nonlinear post-processing power, amplifying peaks and suppressing weak entries for robust focus at large L (Gao et al., 23 Jan 2025).

LSA, especially in the form of InfoScale, CosScale, and scale-invariant attention, mathematically guarantees invariance of key statistical properties (total attention, entropy), providing theoretical and empirical performance advantages over these alternatives.

7. Implementation Considerations and Stability

LSA modifications are lightweight, requiring only minor per-attention-head computational adjustments (scalar calculation, per-position scaling). Stability is generally excellent, but extremely large scaling parameters (e.g., CosScale α≫600) may induce numerical instability (Li et al., 15 Jan 2025). All proposed methods are compatible with standard attention architectures and position encodings (RoPE, p-RoPE), but some variants (e.g., scale-invariant transforms) perform best when paired with high-frequency spectral cutoffs on positional encodings (Anson et al., 20 May 2025).

LSA has been shown to generalize across transformer variants, domaints (language, speech), and use cases (language modeling, ASR, ASV, retrieval), indicating broad applicability for extending model context windows efficiently. All code and benchmarks for the principal LSA variants are publicly available as referenced in the original works (Li et al., 15 Jan 2025, Liao et al., 2022, Gao et al., 23 Jan 2025, Anson et al., 20 May 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Length-Scaled Attention (LSA).