Papers
Topics
Authors
Recent
Search
2000 character limit reached

ALiBi: Linear Biases in Transformer Attention

Updated 21 April 2026
  • ALiBi is a positional encoding mechanism for Transformers that adds fixed, head-specific linear biases to attention logits based on token or spatial distance.
  • It enables head-wise specialization by assigning different decay slopes, allowing some heads to focus on local details while others capture long-range dependencies.
  • The method is computationally efficient, generalizes across modalities, and enhances performance on out-of-distribution sequence lengths without extra learnable parameters.

Attention with Linear Biases (ALiBi) is a class of positional encoding mechanism introduced for Transformer architectures to enable efficient, parameter-free, and extrapolation-friendly inductive biases for modeling sequence order. By adding a fixed, head-specific, distance-dependent linear penalty to the raw attention logits, ALiBi induces effective recency bias across attention heads, circumvents the context length limitations of absolute position embeddings, and generalizes naturally to a variety of modalities beyond natural language, including vision and molecular graphs.

1. Mathematical Definition and Core Formulation

ALiBi modifies the raw, pre-softmax attention logits between a query at position ii and a key at position jj in a Transformer attention head by adding a linear, distance-dependent bias: Li,j=qikjdsh(ij)L'_{i,j} = \frac{q_i^\top k_j}{\sqrt{d}} - s_h (i-j) where qi,kjRdq_i, k_j \in \mathbb{R}^d are the query and key vectors, and shs_h is a fixed, head-specific nonnegative slope parameter. For causal attention, only jij \leq i are admitted. The term sh(ij)-s_h (i-j) penalizes keys that are temporally distant from the query, decaying their contribution exponentially in (i–j) after the softmax. This bias is applied in every attention head and at every attention layer.

In vision or graph contexts, the distance can be generalized—e.g., using Euclidean distance between 2D image patch coordinates, optionally scaled by a ground sample distance (GSD) for multi-resolution data, or using the shortest-path length in molecular graphs (Kage et al., 11 Apr 2026, Pawlowsky et al., 17 Mar 2026, Gurev et al., 24 Jun 2025). The bias then takes the general form: bi,j(h)=shd(i,j)b_{i,j}^{(h)} = -s_h\,d(i,j) where d(i,j)d(i,j) is the selected metric.

No learnable parameters are introduced; the shs_h are fixed before training using a geometric progression over heads, commonly: jj0 for head index jj1 (Press et al., 2021, Clark et al., 2024).

2. Recency Bias and Head-wise Specialization

The distance-dependent bias induces an exponentially decaying weighting over past tokens, causing the attention distribution to be recency-biased. Large jj2 values cause strong decay, confining the head's receptive field to very recent context (short-range dependencies). Small jj3 values allow a head to attend to long-range dependencies with weak attenuation.

The mixture of slopes enables specialization among attention heads: heads with steeper slopes capture local syntactic relations, while those with shallow slopes track dependencies over longer ranges or wider spatial fields. Empirical studies on both language and vision benchmarks have shown that head specialization directly follows from this bias schedule (Clark et al., 2024, Press et al., 2021, Pawlowsky et al., 17 Mar 2026).

In cognitive modeling, the ALiBi recency profile closely matches memory decay in frameworks such as ACT-R, providing a graded, parameter-free approximation to human-like forgetting curves (Clark et al., 2024).

3. Integration, Efficiency, and Architectural Variants

ALiBi is integrated by:

  • Omitting explicit positional embeddings (e.g., sinusoidal, learned).
  • Adding the head-specific bias based on sequence (or other) distance to attention logits before softmax in every attention block.
  • Retaining all other Transformer hyperparameters and architecture.

The mechanism is computationally efficient: the bias addition is a single elementwise operation and does not increase per-layer parameter count or FLOPs. Memory overhead consists of storing a fixed H×L×L bias tensor (where H is the number of heads and L the sequence length), minimal relative to the attention matrix in typical configurations (Press et al., 2021).

ALiBi generalizes to multidimensional input spaces by adapting jj4. For multi-modal and cross-resolution transformer architectures, e.g., in satellite imagery, the bias can be scaled using input-specific GSDs to encode true spatial distances, enabling multi-scale, multi-resolution attention while preserving extrapolation and physical interpretability (Kage et al., 11 Apr 2026).

4. Empirical Evaluation, Extrapolation, and Limitations

ALiBi-equipped Transformers exhibit strong sequence-length extrapolation. When evaluated on sequences longer than those seen at training time, models retain their perplexity and performance, and may even outperform larger absolute-positioned baselines at a fraction of training cost (Press et al., 2021).

Representative results include:

Model Train Length Test Length Perplexity (Test) Relative Speed Memory Savings
Sinusoidal 2048 2048 18.67 100% 0%
ALiBi 1024 2048 17.96 +11% –11%

This effect is robust for up to approximately 2× the training context length; beyond this, performance plateaus or slightly degrades, but still outperforms other static-bias schemes. ALiBi's extrapolation is further improved by dynamic position interpolation, where the slopes are rescaled at inference so the largest bias magnitude remains constant regardless of test length (Al-Khateeb et al., 2023).

However, the induced exponential decay attenuates attention to tokens beyond a head's effective window. For deep dependencies or truly global attention, ALiBi's per-head decay can act as a limitation: no head remains “fully global,” and very long-range dependencies may be underemphasized, leading to a restricted receptive field (Oka et al., 4 Feb 2025). This effect motivated developments of wavelet-based or multi-scale position representations.

Additional limitations include quadratic memory overhead for bias storage (for extreme sequence lengths), and, in certain architectures (notably BLOOM), a susceptibility to attention head collapse, where some heads attend almost exclusively to the beginning-of-sequence token under the steepest ALiBi slopes. Targeted reinitialization (“surgical repair”) can recover such heads (Schallon, 10 Mar 2026).

5. Extensions, Variants, and Theoretical Unification

Significant variations and generalizations of ALiBi have appeared:

  • Graph- and Image-based Biases: In vision transformers, ALiBi is extended to 2D by using Euclidean distances between patch centroids, optionally normalized and toroidal. For molecular modeling, the shortest-path length in a molecular graph replaces the sequence distance, injecting relational inductive bias (Pawlowsky et al., 17 Mar 2026, Kage et al., 11 Apr 2026, Gurev et al., 24 Jun 2025).
  • Scale-ALiBi: In cross-resolution multi-modal transformers, GSD scaling aligns the bias with true physical distances across input types (Kage et al., 11 Apr 2026).
  • Low-rank and Group-Theoretic Generalizations: Group Representational Position Encoding (GRAPE) shows that ALiBi is a special case of additive positional encoding via unipotent group actions, with ALiBi realized as a rank-1 linear bias arising from a GL(jj5) unipotent subgroup. This situates ALiBi alongside rotary embeddings in a unified geometric framework, and enables richer path-dependent and content-adaptive biases (Zhang et al., 8 Dec 2025).
  • Hyperbolic Biases (HyPE): HyPE generalizes ALiBi via hyperbolic sine functions to encode relative position at reduced memory overhead and full compatibility with FlashAttention-2 kernels. For small slopes, HyPE's sinh bias converges to ALiBi's linear form, but can be made learnable and non-linear for improved flexibility (Angelotti, 2023).
  • Wavelet-based and Multi-scale*: Wavelet position representations remove the fixed-window limitation by providing multiple scales and unlimited attention span, overcoming ALiBi's constraint on deep dependency span (Oka et al., 4 Feb 2025).

6. Cross-domain Applications

ALiBi's parameter-free, extrapolatable structure has enabled adoption beyond conventional autoregressive language modeling:

  • Cognitive modeling: ALiBi-trained transformers more closely match human reading time data, as the effective memory decay reproduces psycholinguistic findings from ACT-R and similar theories (Clark et al., 2024).
  • Vision and medical imaging: Replacing learned or absolute positional encodings in self-supervised vision transformers (DINO, MAE) with 2D ALiBi reduces positional leakage, enhances zero- and low-shot generalization, and yields homogeneous segmentations in microscopy images. Translation-invariant applications benefit significantly from ALiBi-based encoding (Pawlowsky et al., 17 Mar 2026).
  • Remote sensing and geospatial analysis: Scale-ALiBi encodes real-world patch distances, outperforming established 2D ALiBi vision baselines in multi-modal satellite imagery benchmarks (Kage et al., 11 Apr 2026).
  • Molecular conformer generation: Use of shortest-path-based ALiBi allows smaller transformers to match or surpass much larger, non-equivariant baselines on molecular structure benchmarks (Gurev et al., 24 Jun 2025).

Despite its simplicity, ALiBi's induction of multi-scale, distance-aware attention is highly effective across modalities featuring strong spatial or relational inductive biases.

7. Implications, Practical Recommendations, and Future Directions

ALiBi is notable for its zero-parameter, easily-integrated nature and its robust performance for long context modeling and multi-span reasoning. Best practices emerging from the literature include using a geometric spread of per-head slopes, omitting competing positional embeddings, and, where needed, applying position interpolation for stronger out-of-distribution performance at test time (Al-Khateeb et al., 2023, Press et al., 2021). Where attention head collapse arises, surgical retraining methods can recover global head functionality (Schallon, 10 Mar 2026).

Open directions for research include:

  • Learned or adaptive bias slopes per head.
  • Power-law or non-linear decay profiles to better capture heavy-tailed dependencies.
  • Fusing ALiBi with partial erasure/interference models for enhanced cognitive plausibility.
  • Higher-rank or content-gated additive biases leveraging the GRAPE framework.
  • Application to further domains—structured data, protein sequences, spatiotemporal series—where relative positions encode crucial inductive biases.

The principle underlying ALiBi—a physically or semantically meaningful, relative, fixed bias added directly to attention scores—has generalized across language, vision, and structured data. It supplies a theoretically grounded, computationally lightweight bridge from positional geometry to robust, extrapolation-friendly sequence modeling (Press et al., 2021, Zhang et al., 8 Dec 2025, Clark et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Attention with Linear Biases (ALiBi).