BiPE-ALiBi: Bidirectional Linear Bias

Updated 15 March 2026

BiPE-ALiBi is a bidirectional attention bias mechanism that applies distinct linear slopes to forward and backward token offsets, enhancing asymmetry in relative positional encoding.
It leverages the Additive GRAPE framework to implement relative biases with linear memory scaling and supports streaming inference without an explicit O(L²) mask.
The approach enables flexible hyperparameter tuning and improved handling of asymmetric dependencies in domains such as language modeling, code analysis, and genomics.

BiPE-ALiBi refers to the class of bidirectional, linear distance-based attention biases where the offset-dependent bias applied to attention logits is allowed to differ in scale (or slope) for forward and backward directions. This construction has its roots in the original ALiBi (“Attention with Linear Biases”) mechanism and is further theoretically grounded and extended in the Additive GRAPE (Group Representational Position Encoding) framework. The BiPE-ALiBi form underpins a general design space for relative positional bias, supports streaming inference, and can be implemented efficiently without ever materializing an $O(L^2)$ bias mask.

1. Mathematical Formulation and Theoretical Foundations

ALiBi introduces a head-specific, linear logit bias that penalizes attention weights by token-tokn offset, most often $a_{i,j} = -m_h\,(j-i)$ for head $h$ . In the Bidirectional Positional Encoding (“BiPE”-style) extension, one generalizes to

$a_{i,j} = \begin{cases} -m_+\,(j-i), & j > i \ -m_-\,(i-j), & j < i \end{cases}$

where $m_+$ and $m_-$ may differ, permitting asymmetric treatment of forward and backward context. This form explicitly appears as the canonical two-generator extension in the Additive GRAPE formalism, with each direction governed by a distinct nilpotent generator in $\mathrm{GL}(d+2)$ . The additive logit bias thus remains a strict function of offset, preserving exact relative laws and streaming cacheability (Zhang et al., 8 Dec 2025).

2. Implementation Without an Explicit $O(L^2)$ Mask

In conventional ALiBi-style implementations, the $L \times L$ bias matrix for sequence length $L$ is typically constructed and added to the scaled dot-product logits. Both Additive GRAPE and HyPE (Hyperbolic Positional Encoding) demonstrate that these biases can be implemented without such quadratic memory cost by:

Representing the bias mechanism as a low-rank or structured augmentation of query/key features;
Folding the resulting effect into the $QK^\top$ score via small extra features or homogeneous coordinates.

For example, HyPE injects a hyperbolic version of the bidirectional bias by concatenating per-token 2-dimensional features, so that computing $\hat Q \hat K^\top$ implicitly adds

$a^{\text{HyPE}}_{i,j} = -\tau\, \sinh(\mu (j-i))$

to each logit, with $\mu$ and $\tau$ chosen (or learned) to approximate the ALiBi linear term for small offsets (Angelotti, 2023). The associated memory overhead is $O(B H L 2)$ (batch, heads, length, features) rather than $O(L^2)$ . The same principle applies in Additive GRAPE, using minimal augmentation in homogeneous coordinates for $O(L)$ cost (Zhang et al., 8 Dec 2025).

3. Group-Theoretic Perspective and GRAPE Generalization

Additive GRAPE reformulates ALiBi and its bidirectional variants as special cases of unipotent group actions in $\mathrm{GL}(d+1)$ or $\mathrm{GL}(d+2)$ (Zhang et al., 8 Dec 2025). The key property is the existence of a one-parameter subgroup $\exp(n\,\omega \mathbf{A})$ with nilpotent generator $\mathbf{A}$ , producing an additive logit offset that is:

Linear in the relative offset for each head/direction;
Streamable (i.e., supports online key caching with zero recomputation);
Relatively compositional: logit bias depends only on $j-i$ or $i-j$ .

Bidirectional/“BiPE-ALiBi” style is realized by deploying two nilpotent generators, each handling one direction, and selecting the appropriate one based on $j-i>0$ or $j-i<0$ .

4. Practical Considerations: Hyperparameters, Learning, and Stability

Both ALiBi and its BiPE-style variants typically use fixed, precomputed slopes per head to avoid learning instability and improve simplicity. However, the design space allows head-dependent or even content-gated slopes (as in the Additive GRAPE extension) and can, in principle, support learning of these parameters end-to-end with appropriate regularization or clamping to ensure numerical stability (Angelotti, 2023, Zhang et al., 8 Dec 2025). HyPE further introduces amplitude and slope hyperparameters $(\tau, \mu)$ , which may be fixed or learned, affecting the exponential nonlinearity of the bias. Maintaining $|\mu (j-i)| < 1$ is recommended to prevent overflow.

5. Memory, Complexity, and Efficient Streaming

The primary advantage of purely additive, offset-dependent bias mechanisms is linear memory scaling with sequence length, versus the $O(L^2)$ cost of methods that explicitly store pairwise position matrices (e.g., standard relative position bias or bucketed schemes). All variants discussed here (ALiBi, BiPE-ALiBi, HyPE, Additive GRAPE) enable streaming key/value caches at inference time, as the required offset can be computed incrementally with $O(1)$ per-token overhead (Angelotti, 2023, Zhang et al., 8 Dec 2025). Exponentials (in HyPE) incur $O(B H L)$ additional compute—negligible compared to self-attention’s $O(BHL^2d)$ cost.

6. Applications and Experimental Directions

BiPE-ALiBi style biases are particularly suited to domains where asymmetric dependencies are relevant—such as bidirectional models or situations with known temporal/causal breaks. For example, in genomics or code modeling, distinct forward and backward attention slopes could be beneficial. Experimental work is recommended along axes including:

Evaluating perplexity or classification metrics on language or sequence datasets that require long context and asymmetric dependencies.
Testing generalization to sequence lengths longer than those seen in pretraining, leveraging the unbounded extrapolation of the additive (or hyperbolic, for HyPE) bias.
Ablation on fixed versus learnable, per-head, or direction-specific slopes (Angelotti, 2023).

Proposed milestones include benchmarking on standard language modeling corpora and tasks measuring local versus long-range dependency modeling capacity (Angelotti, 2023).

BiPE-ALiBi, as formalized in Additive GRAPE, stands in contrast to multiplicative (e.g., Rotational Position Embedding/RoPE) or path-integral (e.g., FoX) positional encoding methods. Its principal distinction is a pure additive, offset-only bias in logits, preserving both strict relativity and cacheability. Additive GRAPE subsumes ALiBi, BiPE-ALiBi, and FoX as special cases under a single group-theoretic umbrella, demonstrating that all such schemes derive from unipotent lifts in $\mathrm{GL}$ (Zhang et al., 8 Dec 2025). Extensions to multidimensional or multiaxis offsets (e.g., for images or graphs) are immediate by concatenating direction-specific features (Angelotti, 2023).

In summary, BiPE-ALiBi comprises the class of bidirectional, linear relative attention logit biases enabling efficient, scalable, and extensible positional encoding for long-context or asymmetric dependency modeling within the general family of Additive GRAPE positional mechanisms.

Markdown Report Issue Upgrade to Chat

References (2)

Group Representational Position Encoding (2025)

HyPE: Attention with Hyperbolic Biases for Relative Positional Encoding (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BiPE-ALiBi.