ALiBi: Linear Biases in Transformers

Updated 6 April 2026

ALiBi is a positional encoding scheme that adds a linear, distance-dependent bias to attention logits, enabling efficient handling of variable-length inputs.
It replaces standard absolute or rotary embeddings with a parameter-free additive bias, promoting recency and multi-scale focus across attention heads.
Empirical results show improved language modeling, vision segmentation, and graph performance by adapting recency penalties and enabling streaming cacheability.

Attention with Linear Biases (ALiBi) is a positional encoding scheme for Transformers that introduces a linear, head-specific bias to the attention logits, designed to enable parameter-free length extrapolation, multi-scale inductive bias, and efficient handling of variable-length inputs. ALiBi replaces standard absolute or rotary position embeddings with an additive term proportional to the key-query distance, conferring strong recency bias and streaming cacheability without additional parameters. It generalizes naturally across domains—language, graphs, and vision—by substituting the distance metric as appropriate.

1. Mathematical Formulation and Algorithmic Details

In a Transformer layer with $H$ attention heads, ALiBi injects a distance-dependent bias into each head’s unnormalized attention logits. For a sequence of length $L$ , the raw attention logits for head $h$ are

$A_{i,j}^{(h)} = \frac{q_i^{(h)} \cdot k_j^{(h)}}{\sqrt{d}} + b_{i,j}^{(h)},$

where $q_i^{(h)}$ and $k_j^{(h)}$ are the query and key for position $i, j$ , and $b_{i,j}^{(h)}$ is the positional bias.

For 1D data (text), the bias takes the form

$b_{i,j}^{(h)} = -m_h |i-j|,$

or, in the causal regime,

$b_{i,j}^{(h)} = -m_h (i-j), \quad j \leq i,$

where $L$ 0 is a fixed, non-learned slope assigned to each head. The slopes are typically set as a geometric progression,

$L$ 1

as in Press et al. (Press et al., 2021). This results in some heads applying steep recency penalties (local focus), and others decaying more slowly (long-range focus).

Scaled-dot-product attention with ALiBi becomes

$L$ 2

where $L$ 3.

The ALiBi mechanism is mathematically characterized as a rank-1 unipotent action in GL, shown to satisfy an exact relative positional law and implemented with group action cacheability for autoregressive decoding (Zhang et al., 8 Dec 2025).

2. Motivations, Inductive Bias, and Theoretical Underpinnings

ALiBi was introduced to address the inability of standard and rotary position embeddings to extrapolate effectively beyond the training context length. Absolute embeddings cannot accommodate positions beyond their training limit, while rotary (RoPE) and sinusoidal embeddings empirically degrade with increasing distance, often leading to divergence of attention magnitudes (Press et al., 2021, Al-Khateeb et al., 2023).

The recency bias imposed by ALiBi translates, after the softmax, into an exponential memory decay: $L$ 4 which causes attention weights to decrease exponentially as key-query distance grows. Assigning geometrically spaced slopes across heads yields a mixture of decay rates, so different attention heads specialize in tracking dependencies at distinct length scales (Clark et al., 2024).

This multi-scale induction mechanism enhances the model’s capacity to allocate some heads to short-range (e.g., argument structure) and others to long-range (e.g., coreference) dependencies in language, and analogously for neighborhood sizes in graphs or spatial patch relationships in vision.

3. Practical Implementations and Extensions

ALiBi requires only minor code modifications in standard multihead attention layers. The positional bias can be precomputed once for a given context length and added to the attention logits at each layer, incurring negligible runtime or memory overhead. No additional learnable parameters are introduced. A typical sequence:

$h$ 7 (Press et al., 2021, Al-Khateeb et al., 2023)

ALiBi generalizes to other domains by redefining the notion of distance:

Graphs: $L$ 5, with $L$ 6 as graph shortest-path distance. Using ALiBi in graph transformers yields gains in molecular conformer generation, outperforming both Laplacian-eigenvector and learnable-bias schemes with faster inference (Gurev et al., 24 Jun 2025).
Images: 2D ALiBi for vision: $L$ 7, with $L$ 8 the Euclidean or wrap-around distance between patches (Pawlowsky et al., 17 Mar 2026).

ALiBi’s structure also admits content-gated extensions, low-rank bias composition, and nonconstant slopes within the GRAPE (Group Representational Position Encoding) framework, which subsumes ALiBi as a special case of additive group action (Zhang et al., 8 Dec 2025).

4. Empirical Performance and Applications

ALiBi has demonstrated strong input-length extrapolation, competitive or superior perplexity, and minimal computational cost.

Language Modeling: On WikiText-103, ALiBi-trained models with 1024-token context extrapolate to 2048 and 3072 tokens, matching or outperforming models trained at those longer lengths, while maintaining 100 MB mask overhead and no additional runtime (Press et al., 2021). In larger models (1.3B parameters), ALiBi achieves equal or lower perplexity than sinusoidal or rotary PE, with an 11% reduction in training memory.

Extended Contexts: Baseline ALiBi models begin to degrade slightly beyond $L$ 9 the training context, with a rapid rise in perplexity. Linear position interpolation (PI), which scales the ALiBi slope $h$ 0 by $h$ 1, preserves model behavior to $h$ 2 and sustains downstream performance in language modeling, summarization, and retrieval (Al-Khateeb et al., 2023).

Cognitive Modeling: ALiBi-trained LLMs yield increased fit (ΔLogLik ≈ +400) to human reading-time corpora, highlighting its ability to capture human-like memory decay via mixed recency biases in multiple heads (Clark et al., 2024).

Graphs & Molecules: ALiBi-style bias in molecular graphs allows small non-equivariant transformers to match the performance of much larger models, emphasizing the broad applicability of the method (Gurev et al., 24 Jun 2025).

Vision: In vision transformers, ALiBi removes linearly-decodable positional ramps, eliminates spurious spatial artifacts, and preserves semantic segmentation capabilities on challenging microscopy and material-science tasks, with no loss in mIoU compared to DINOv2 (Pawlowsky et al., 17 Mar 2026).

5. Pathologies, Limitations, and Surgical Correction

A known pathology with ALiBi in large-scale LLMs (e.g., BLOOM) is attention head collapse: 31–44% of heads may "sink" attention almost entirely to BOS due to the monotonic recency bias. Heads with steepest slopes are most prone to this collapse. Surgical reinitialization—a targeted Q/K/V reinit/zeroing of output projections plus frozen non-surgical parameters—recovers almost full operational head capacity and transiently improves in-domain perplexity by 25%, indicating suboptimality of the typical pretraining minima (Schallon, 10 Mar 2026).

ALiBi does not extend true modeling capacity to arbitrary long-range dependencies; its extrapolation remains robust out to approximately $h$ 3, but degrades beyond this unless augmented with position interpolation or fine-tuning (Al-Khateeb et al., 2023). The model does not, in sliding-window ablations, learn substantially novel long-distance structure (Press et al., 2021).

6. Generalizations: Group-Theoretic, Domain, and Schedule Extensions

The GRAPE framework unifies ALiBi with other positional encoding families via group actions. ALiBi specifically is a rank-1 unipotent action in $h$ 4, ensuring exact relativity (the bias depends only on the offset $h$ 5), streaming cacheability, and extensibility to content-gated or higher-rank biases (Zhang et al., 8 Dec 2025).

Generalization to domains is achieved by redefining the distance metric. For molecular graphs, graph shortest-paths; for images, 2D wrap-around distances; for text, linear offset. Slope schedules may be fixed (standard geometric progression) or learned, and future work targets adaptive or nonlinear schedules for improved extrapolation and richer bias profiles (Al-Khateeb et al., 2023, Zhang et al., 8 Dec 2025).

7. Outlook and Open Questions

ALiBi’s simplicity, zero-parameter nature, cache efficiency, and domain generality make it a standard baseline for length-extrapolating Transformers in language, graphs, and vision. Open research directions include:

Hybrid positional bias/interpolation plus lightweight fine-tuning for $h$ 6 extrapolation (Al-Khateeb et al., 2023).
Investigating non-geometric or learned per-head slope schedules (Zhang et al., 8 Dec 2025).
Analyzing interactions between ALiBi, self-supervised semantic representations, and emergent head specialization (Pawlowsky et al., 17 Mar 2026, Clark et al., 2024).
Theoretical characterization of collapse local minima and global redistribution phenomena in ALiBi models (Schallon, 10 Mar 2026).
Fully unifying multiplicative (RoPE) and additive (ALiBi) encodings for arbitrary continuous/graph domains (Zhang et al., 8 Dec 2025).

ALiBi remains a canonical example of positional encoding by additive bias, enabling robust length-handling with minimal implementation and strong empirical performance across modern Transformer deployments.