Scaled-dot-product Attention with ALiBi

Updated 21 April 2026

The paper introduces ALiBi, which incorporates a head-specific linear bias into self-attention to penalize distant tokens and enhance sequence extrapolation.
Methodology employs geometric scaling of bias slopes across attention heads, achieving a strong recency bias without extra learned parameters, leading to faster convergence and lower memory usage.
Extensions include position interpolation and multi-modal adaptations, allowing effective application in longer sequences and diverse domains such as vision and remote sensing.

Scaled-dot-product attention with Attention with Linear Biases (ALiBi) is a variant of self-attention in transformer models that replaces classical positional encodings with a fixed, head-specific linear bias on query-key attention scores. Unlike sinusoidal or learned positional embeddings, ALiBi directly penalizes attention to distant positions by subtracting a linear function of relative distance, yielding an intrinsic recency bias. This structural difference enables superior length extrapolation—models trained on short sequences can robustly generalize to substantially longer sequences at inference—while reducing computational and memory overhead. Recent work has further extended ALiBi to vision and multi-modal contexts and has developed inference-time “position interpolation” techniques that double the effective context range without retraining.

1. Scaled-dot-product Attention and ALiBi Modification

Standard scaled-dot-product attention computes, for queries $Q \in \mathbb{R}^{T \times d_k}$ , keys $K \in \mathbb{R}^{T \times d_k}$ , and values $V \in \mathbb{R}^{T \times d_v}$ ,

$\mathrm{Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{Q K^\top}{\sqrt{d_k}} \right) V.$

The unnormalized score for query position $i$ and key position $j$ is $e_{ij} = \frac{q_i \cdot k_j}{\sqrt{d_k}}$ .

ALiBi modifies this by introducing a per-head, distance-dependent bias applied to each score before the softmax. For attention head $h$ with fixed slope $\alpha^{(h)}$ , the bias for query position $i$ and key position $K \in \mathbb{R}^{T \times d_k}$ 0 is

$K \in \mathbb{R}^{T \times d_k}$ 1

yielding the modified attention: $K \in \mathbb{R}^{T \times d_k}$ 2 No explicit positional embeddings are added to $K \in \mathbb{R}^{T \times d_k}$ 3 or $K \in \mathbb{R}^{T \times d_k}$ 4; instead, the attention mechanism is directly biased. Computation is efficiently implemented as a single element-wise addition and requires no additional parameters beyond the fixed slope values (Press et al., 2021).

2. Choice of Slopes and Inductive Bias

The slopes $K \in \mathbb{R}^{T \times d_k}$ 5 (or $K \in \mathbb{R}^{T \times d_k}$ 6 in alternative notation) are pre-determined before training and span a geometric range across heads, typically from a fast-decaying bias (strong recency) to a slow one. For $K \in \mathbb{R}^{T \times d_k}$ 7 heads, they are commonly set as

$K \in \mathbb{R}^{T \times d_k}$ 8

For $K \in \mathbb{R}^{T \times d_k}$ 9, interleaved values covering a similar logarithmic scale are used.

This geometric arrangement ensures that different heads specialize to different degrees of recency bias. The result is that tokens attend more strongly to recent positions, as distant keys are penalized more heavily. This matches statistical properties of natural language and introduces a robust inductive bias favoring recent context (Press et al., 2021).

3. Sequence Length Extrapolation and Position Interpolation

The linear form $V \in \mathbb{R}^{T \times d_v}$ 0 is defined for arbitrarily long sequences since it depends only on relative distance, not absolute position. Thus, an ALiBi-equipped transformer trained on context window $V \in \mathbb{R}^{T \times d_v}$ 1 can be applied to inputs of length $V \in \mathbb{R}^{T \times d_v}$ 2 at inference, by generating a correspondingly larger bias matrix. Empirically, this enables high-fidelity extrapolation:

On WikiText-103, a model trained with ALiBi on $V \in \mathbb{R}^{T \times d_v}$ 3 outperforms sinusoidal models trained on any $V \in \mathbb{R}^{T \times d_v}$ 4 when tested out to $V \in \mathbb{R}^{T \times d_v}$ 5, with up to 1.8× faster training and lower perplexity (Press et al., 2021).
On CC100+RoBERTa, ALiBi models achieve equivalent or better perplexity at $V \in \mathbb{R}^{T \times d_v}$ 6 with 11% less memory and 11% faster convergence (Press et al., 2021).

However, for much longer contexts ( $V \in \mathbb{R}^{T \times d_v}$ 7), ALiBi’s unscaled bias can excessively penalize distant tokens, degrading performance. Position Interpolation (PI), introduced in "Position Interpolation Improves ALiBi Extrapolation," addresses this by linearly rescaling slopes at inference: $V \in \mathbb{R}^{T \times d_v}$ 8 where $V \in \mathbb{R}^{T \times d_v}$ 9 is the maximum training context and $\mathrm{Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{Q K^\top}{\sqrt{d_k}} \right) V.$ 0 is the test-time context (Al-Khateeb et al., 2023).

PI ensures that attention biases for long distances remain within the range seen during training, substantially delaying the onset of performance degradation and enabling extrapolation to approximately $\mathrm{Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{Q K^\top}{\sqrt{d_k}} \right) V.$ 1 with minimal loss in language modeling, summarization, and retrieval tasks.

4. Implementation, Computational Efficiency, and Extensions

ALiBi incurs minimal computational overhead: the only additional operation is element-wise addition of a broadcasted $\mathrm{Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{Q K^\top}{\sqrt{d_k}} \right) V.$ 2 bias tensor. No learned positional embedding parameters are required, and the memory cost of storing $\mathrm{Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{Q K^\top}{\sqrt{d_k}} \right) V.$ 3 floats is modest for typical batch sizes ( $\mathrm{Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{Q K^\top}{\sqrt{d_k}} \right) V.$ 4, $\mathrm{Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{Q K^\top}{\sqrt{d_k}} \right) V.$ 5). The method is fully compatible with standard multi-head attention workflows and can be implemented by pre-computing slope values and a distance matrix (Press et al., 2021).

PI is an inference-only modification, requiring no retraining or fine-tuning when increasing sequence length. For $\mathrm{Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{Q K^\top}{\sqrt{d_k}} \right) V.$ 6, the model operates identically to plain ALiBi. A limitation is that extrapolation beyond $\mathrm{Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{Q K^\top}{\sqrt{d_k}} \right) V.$ 7 still leads to quality degradation, especially for fine-grained retrieval (Al-Khateeb et al., 2023).

ALiBi also provides the architectural flexibility for application to modalities beyond language, as demonstrated in the vision domain.

Scale-ALiBi extends the ALiBi principle to multi-scale, multi-modal vision transformers by redefining the linear bias in terms of Euclidean spatial distance and ground sample distance (GSD), accommodating tokens originating from different spatial resolutions (e.g., satellite imagery) (Kage et al., 11 Apr 2026). For tokens $\mathrm{Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{Q K^\top}{\sqrt{d_k}} \right) V.$ 8 and $\mathrm{Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{Q K^\top}{\sqrt{d_k}} \right) V.$ 9 arising from imagery with GSD values $i$ 0 and positions $i$ 1, the spatial bias is formulated as: $i$ 2 where $i$ 3 and $i$ 4 is the source GSD for the query stream.

In multi-stream architectures, such as those incorporating SAR and optical imagery at different resolutions, Scale-ALiBi enables cross-stream fusion without resampling or tiling, aligning attention biases with true ground distance. Implementations show that spatial ALiBi alone suffices for effective positional encoding, and preliminary benchmarks indicate competitive or superior performance relative to state-of-the-art modalities, especially in cross-scale retrieval tasks (Kage et al., 11 Apr 2026).

6. Empirical Performance and Benchmarks

Key benchmarks from the original and subsequent work demonstrate the following:

On WikiText-103 and BookCorpus, ALiBi matches or surpasses sinusoidal baselines in perplexity, even when extrapolating to 3× the training sequence length (Press et al., 2021).
On CC100+RoBERTa, ALiBi reduces memory by 6–11% and achieves faster convergence for equivalent perplexity (Press et al., 2021).
Position Interpolation (PI) effectively doubles context window applicability with no retraining, sharply improving downstream task metrics (ROUGE for summarization, retrieval accuracy) when exceeding training context by up to $i$ 5 (Al-Khateeb et al., 2023).
In multi-modal remote sensing, Scale-ALiBi matches or outperforms CROMA on classification/segmentation and retrieval under both neural and non-parametric probing protocols (Kage et al., 11 Apr 2026).

7. Limitations and Prospective Research Directions

ALiBi does not require learned positional parameters and its bias tensor is non-adaptive, which simplifies implementation but may underutilize potential adaptivity in position encoding. When extrapolating far beyond double the training context, attention degradation reemerges. Existing analyses and results suggest combining PI with fine-tuning, using non-uniform interpolation schedules, or hybridizing with other position encodings (e.g., Rotary) as productive future pathways to extend robust extrapolation. In vision, further investigation into learned or adaptive bias functions, scaling to global satellite coverage, and incorporating temporal revisits represent natural next steps (Al-Khateeb et al., 2023, Kage et al., 11 Apr 2026).