Papers
Topics
Authors
Recent
Search
2000 character limit reached

ALiBi: Linear Biases in Transformers

Updated 6 April 2026
  • ALiBi is a positional encoding scheme that adds a linear, distance-dependent bias to attention logits, enabling efficient handling of variable-length inputs.
  • It replaces standard absolute or rotary embeddings with a parameter-free additive bias, promoting recency and multi-scale focus across attention heads.
  • Empirical results show improved language modeling, vision segmentation, and graph performance by adapting recency penalties and enabling streaming cacheability.

Attention with Linear Biases (ALiBi) is a positional encoding scheme for Transformers that introduces a linear, head-specific bias to the attention logits, designed to enable parameter-free length extrapolation, multi-scale inductive bias, and efficient handling of variable-length inputs. ALiBi replaces standard absolute or rotary position embeddings with an additive term proportional to the key-query distance, conferring strong recency bias and streaming cacheability without additional parameters. It generalizes naturally across domains—language, graphs, and vision—by substituting the distance metric as appropriate.

1. Mathematical Formulation and Algorithmic Details

In a Transformer layer with HH attention heads, ALiBi injects a distance-dependent bias into each head’s unnormalized attention logits. For a sequence of length LL, the raw attention logits for head hh are

Ai,j(h)=qi(h)kj(h)d+bi,j(h),A_{i,j}^{(h)} = \frac{q_i^{(h)} \cdot k_j^{(h)}}{\sqrt{d}} + b_{i,j}^{(h)},

where qi(h)q_i^{(h)} and kj(h)k_j^{(h)} are the query and key for position i,ji, j, and bi,j(h)b_{i,j}^{(h)} is the positional bias.

For 1D data (text), the bias takes the form

bi,j(h)=mhij,b_{i,j}^{(h)} = -m_h |i-j|,

or, in the causal regime,

bi,j(h)=mh(ij),ji,b_{i,j}^{(h)} = -m_h (i-j), \quad j \leq i,

where LL0 is a fixed, non-learned slope assigned to each head. The slopes are typically set as a geometric progression,

LL1

as in Press et al. (Press et al., 2021). This results in some heads applying steep recency penalties (local focus), and others decaying more slowly (long-range focus).

Scaled-dot-product attention with ALiBi becomes

LL2

where LL3.

The ALiBi mechanism is mathematically characterized as a rank-1 unipotent action in GL, shown to satisfy an exact relative positional law and implemented with group action cacheability for autoregressive decoding (Zhang et al., 8 Dec 2025).

2. Motivations, Inductive Bias, and Theoretical Underpinnings

ALiBi was introduced to address the inability of standard and rotary position embeddings to extrapolate effectively beyond the training context length. Absolute embeddings cannot accommodate positions beyond their training limit, while rotary (RoPE) and sinusoidal embeddings empirically degrade with increasing distance, often leading to divergence of attention magnitudes (Press et al., 2021, Al-Khateeb et al., 2023).

The recency bias imposed by ALiBi translates, after the softmax, into an exponential memory decay: LL4 which causes attention weights to decrease exponentially as key-query distance grows. Assigning geometrically spaced slopes across heads yields a mixture of decay rates, so different attention heads specialize in tracking dependencies at distinct length scales (Clark et al., 2024).

This multi-scale induction mechanism enhances the model’s capacity to allocate some heads to short-range (e.g., argument structure) and others to long-range (e.g., coreference) dependencies in language, and analogously for neighborhood sizes in graphs or spatial patch relationships in vision.

3. Practical Implementations and Extensions

ALiBi requires only minor code modifications in standard multihead attention layers. The positional bias can be precomputed once for a given context length and added to the attention logits at each layer, incurring negligible runtime or memory overhead. No additional learnable parameters are introduced. A typical sequence:

hh7 (Press et al., 2021, Al-Khateeb et al., 2023)

ALiBi generalizes to other domains by redefining the notion of distance:

  • Graphs: LL5, with LL6 as graph shortest-path distance. Using ALiBi in graph transformers yields gains in molecular conformer generation, outperforming both Laplacian-eigenvector and learnable-bias schemes with faster inference (Gurev et al., 24 Jun 2025).
  • Images: 2D ALiBi for vision: LL7, with LL8 the Euclidean or wrap-around distance between patches (Pawlowsky et al., 17 Mar 2026).

ALiBi’s structure also admits content-gated extensions, low-rank bias composition, and nonconstant slopes within the GRAPE (Group Representational Position Encoding) framework, which subsumes ALiBi as a special case of additive group action (Zhang et al., 8 Dec 2025).

4. Empirical Performance and Applications

ALiBi has demonstrated strong input-length extrapolation, competitive or superior perplexity, and minimal computational cost.

Language Modeling: On WikiText-103, ALiBi-trained models with 1024-token context extrapolate to 2048 and 3072 tokens, matching or outperforming models trained at those longer lengths, while maintaining 100 MB mask overhead and no additional runtime (Press et al., 2021). In larger models (1.3B parameters), ALiBi achieves equal or lower perplexity than sinusoidal or rotary PE, with an 11% reduction in training memory.

Extended Contexts: Baseline ALiBi models begin to degrade slightly beyond LL9 the training context, with a rapid rise in perplexity. Linear position interpolation (PI), which scales the ALiBi slope hh0 by hh1, preserves model behavior to hh2 and sustains downstream performance in language modeling, summarization, and retrieval (Al-Khateeb et al., 2023).

Cognitive Modeling: ALiBi-trained LLMs yield increased fit (ΔLogLik ≈ +400) to human reading-time corpora, highlighting its ability to capture human-like memory decay via mixed recency biases in multiple heads (Clark et al., 2024).

Graphs & Molecules: ALiBi-style bias in molecular graphs allows small non-equivariant transformers to match the performance of much larger models, emphasizing the broad applicability of the method (Gurev et al., 24 Jun 2025).

Vision: In vision transformers, ALiBi removes linearly-decodable positional ramps, eliminates spurious spatial artifacts, and preserves semantic segmentation capabilities on challenging microscopy and material-science tasks, with no loss in mIoU compared to DINOv2 (Pawlowsky et al., 17 Mar 2026).

5. Pathologies, Limitations, and Surgical Correction

A known pathology with ALiBi in large-scale LLMs (e.g., BLOOM) is attention head collapse: 31–44% of heads may "sink" attention almost entirely to BOS due to the monotonic recency bias. Heads with steepest slopes are most prone to this collapse. Surgical reinitialization—a targeted Q/K/V reinit/zeroing of output projections plus frozen non-surgical parameters—recovers almost full operational head capacity and transiently improves in-domain perplexity by 25%, indicating suboptimality of the typical pretraining minima (Schallon, 10 Mar 2026).

ALiBi does not extend true modeling capacity to arbitrary long-range dependencies; its extrapolation remains robust out to approximately hh3, but degrades beyond this unless augmented with position interpolation or fine-tuning (Al-Khateeb et al., 2023). The model does not, in sliding-window ablations, learn substantially novel long-distance structure (Press et al., 2021).

6. Generalizations: Group-Theoretic, Domain, and Schedule Extensions

The GRAPE framework unifies ALiBi with other positional encoding families via group actions. ALiBi specifically is a rank-1 unipotent action in hh4, ensuring exact relativity (the bias depends only on the offset hh5), streaming cacheability, and extensibility to content-gated or higher-rank biases (Zhang et al., 8 Dec 2025).

Generalization to domains is achieved by redefining the distance metric. For molecular graphs, graph shortest-paths; for images, 2D wrap-around distances; for text, linear offset. Slope schedules may be fixed (standard geometric progression) or learned, and future work targets adaptive or nonlinear schedules for improved extrapolation and richer bias profiles (Al-Khateeb et al., 2023, Zhang et al., 8 Dec 2025).

7. Outlook and Open Questions

ALiBi’s simplicity, zero-parameter nature, cache efficiency, and domain generality make it a standard baseline for length-extrapolating Transformers in language, graphs, and vision. Open research directions include:

ALiBi remains a canonical example of positional encoding by additive bias, enabling robust length-handling with minimal implementation and strong empirical performance across modern Transformer deployments.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ALiBi (Attention with Linear Biases).