Hyperbolic Relative Bias (HyPE) in Transformers

Updated 31 January 2026

Hyperbolic Relative Bias (HyPE) is a relative positional encoding method that uses hyperbolic functions to dynamically inject biases into Transformer attention mechanisms.
It integrates auxiliary hyperbolic embeddings into queries and keys, ensuring memory efficiency and compatibility with advanced kernels like FlashAttention-2.
Its analytic formulation allows seamless long-context extrapolation, approximating linear biases such as ALiBi and generalizing beyond pretraining lengths.

Hyperbolic Relative Bias (HyPE) is a relative positional encoding scheme for Transformer-based architectures that leverages hyperbolic functions to encode token relationships, enabling efficient context-length extrapolation and seamless integration with advanced attention mechanisms. Unlike traditional methods for sequential order imposition—such as absolute positional embeddings or dense relative-bias masks—HyPE introduces a fully differentiable, memory-efficient formulation where the positional bias is indirectly injected into the attention computation using hyperbolic embeddings. HyPE is compatible with state-of-the-art fused attention kernels, supports gradient propagation for all learnable parameters, and analytically approximates linear biases such as the ALiBi method, facilitating extrapolation well beyond pretraining context lengths (Angelotti, 2023).

1. Mathematical Formulation of Hyperbolic Bias

HyPE constructs a relative-position bias matrix $B$ parameterized by the sequence length $L$ and the hidden dimension $d$ of each attention head. The matrix is defined by

$B = [B_{i,j}]_{0 \leq i,j < L} \quad \text{where} \quad B_{i,j} = -\tau\,\sinh(\mu\,(j-i))$

with $\mu \in \mathbb{R}$ as the slope and $\tau > 0$ as the amplitude. The hyperbolic identity

$2\sinh(x) = e^{x} - e^{-x}$

implies

$B_{i,j} = -\frac{\tau}{2}(e^{\mu(j-i)} - e^{-\mu(j-i)}).$

This bias function translates relative token distances $\Delta = j-i$ into soft positional biases, which can be tuned by the $\mu$ and $\tau$ hyperparameters.

2. Integration into Transformer Attention

HyPE modifies the attention computation pipeline by introducing two hyperbolic embedding matrices per head,

$\eta^Q \in \mathbb{R}^{L \times 2}, \quad \eta^K \in \mathbb{R}^{L \times 2}$

with entries

$\eta^Q_{i,j} = \frac{\tau\sqrt{d}}{2}\,\exp\bigl((-1)^{j+1}\mu i\bigr), \quad \eta^K_{i,j} = (-1)^{j+1}\exp\bigl((-1)^j \mu i\bigr)$

for $j \in \{0,1\}$ . These embeddings are concatenated as additional channels to the usual query and key matrices,

$\widehat Q = \mathrm{concat}(Q, \eta^Q) \in \mathbb{R}^{L \times (d+2)}, \quad \widehat K = \mathrm{concat}(K, \eta^K) \in \mathbb{R}^{L \times (d+2)}.$

The attention logits are then

$\widehat Q \widehat K^\top = Q K^\top + \sqrt{d} B$

where the bias $B$ arises via hyperbolic identity. The final output is computed as

$O = \mathrm{softmax} \left( \frac{Q K^\top}{\sqrt{d}} + B \right) V = \mathrm{softmax} \left( \frac{\widehat Q \widehat K^\top}{\sqrt{d}} \right) V$

and the explicit bias matrix $B$ is never materialized, ensuring efficient computation.

3. Computational Complexity and Memory Footprint

Standard relative-position bias approaches incur $O(L^2)$ memory cost for storing a dense bias mask. HyPE requires only two $L \times 2$ side-channel matrices per attention head ($4L$ scalars), or $O(Lh)$ overall for $h$ heads. The runtime involves a single $(L \times (d+2)) \times (L \times (d+2))$ matrix multiplication per head and the usual $(L \times L) \times V$ cost, with no additional $O(L^2)$ operations. The overall time complexity remains $O(L d^2)$ per head, and memory footprint for embeddings is $O(L d)$ , matching standard Transformer attention efficiency.

4. Compatibility with FlashAttention-2 and Backpropagation

HyPE's bias injection via concatenation of auxiliary channels into $Q$ and $K$ renders it fully compatible with FlashAttention-2's fused, IO-aware attention kernels. The two added channels are managed as ordinary feature dimensions and all gradients propagate through $\eta^Q$ and $\eta^K$ under standard automatic differentiation. Any learnable parameters, such as per-head $\mu_h$ or amplitude $\tau$ , are updated by the established backward pass without need for special handling or code modifications at the kernel level.

5. Hyperparameter Selection and ALiBi Approximation

The ALiBi method employs a linear bias $B^{\text{ALiBi}}_{i,j} = -m(j-i)$ . For HyPE, the Taylor expansion

$\sinh(x) = x + O(x^3)$

implies

$-\tau\sinh(\mu(j-i)) = -\tau[\mu(j-i)] + O((j-i)^3\mu^3)$

Setting $\tau=1$ and $\mu=m \ll 1/L$ yields

$B_{i,j}^{\text{HyPE}(\mu,\tau=1)} \approx -m(j-i) = B_{i,j}^{\text{ALiBi}(m)} + O((j-i)^3 m^3)$

so the cubic remainder is negligible for $|j-i| \leq L_\text{extra}$ if $\mu < m_\text{max} \approx 1/L_\text{extra}$ , where $L_\text{extra}$ is the context length for extrapolation. Promoting $\tau$ , or using per-head $\tau_h$ , enables fine control over overall bias magnitude during training.

6. Generalization Beyond Pretraining Lengths

HyPE supports input-length extrapolation due to its analytic design. With $\mu < 1/L_\text{extra}$ , the hyperbolic bias grows almost linearly up to $L_\text{extra}$ before saturating, preventing divergence. As HyPE relies on no fixed-size sinusoidal tables or learned absolute embeddings, it generalizes without modification to arbitrary sequence lengths at inference. The bound on approximation error to linear bias, derived from Taylor expansion, further supports its theoretical robustness. While large-scale empirical results are deferred, HyPE offers a compact, differentiable framework for positional bias injection, achieving extrapolation benefits analogous to ALiBi, trainability, and compatibility with advanced attention mechanisms at minimal memory and computational cost (Angelotti, 2023).

Markdown Report Issue Upgrade to Chat

References (1)

HyPE: Attention with Hyperbolic Biases for Relative Positional Encoding (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hyperbolic Relative Bias (HyPE).

Hyperbolic Relative Bias (HyPE) in Transformers

1. Mathematical Formulation of Hyperbolic Bias

2. Integration into Transformer Attention

3. Computational Complexity and Memory Footprint

4. Compatibility with FlashAttention-2 and Backpropagation

5. Hyperparameter Selection and ALiBi Approximation

6. Generalization Beyond Pretraining Lengths

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Hyperbolic Relative Bias (HyPE) in Transformers

1. Mathematical Formulation of Hyperbolic Bias

2. Integration into Transformer Attention

3. Computational Complexity and Memory Footprint

4. Compatibility with FlashAttention-2 and Backpropagation

5. Hyperparameter Selection and ALiBi Approximation

6. Generalization Beyond Pretraining Lengths

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research