Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hyperbolic Relative Bias (HyPE) in Transformers

Updated 31 January 2026
  • Hyperbolic Relative Bias (HyPE) is a relative positional encoding method that uses hyperbolic functions to dynamically inject biases into Transformer attention mechanisms.
  • It integrates auxiliary hyperbolic embeddings into queries and keys, ensuring memory efficiency and compatibility with advanced kernels like FlashAttention-2.
  • Its analytic formulation allows seamless long-context extrapolation, approximating linear biases such as ALiBi and generalizing beyond pretraining lengths.

Hyperbolic Relative Bias (HyPE) is a relative positional encoding scheme for Transformer-based architectures that leverages hyperbolic functions to encode token relationships, enabling efficient context-length extrapolation and seamless integration with advanced attention mechanisms. Unlike traditional methods for sequential order imposition—such as absolute positional embeddings or dense relative-bias masks—HyPE introduces a fully differentiable, memory-efficient formulation where the positional bias is indirectly injected into the attention computation using hyperbolic embeddings. HyPE is compatible with state-of-the-art fused attention kernels, supports gradient propagation for all learnable parameters, and analytically approximates linear biases such as the ALiBi method, facilitating extrapolation well beyond pretraining context lengths (Angelotti, 2023).

1. Mathematical Formulation of Hyperbolic Bias

HyPE constructs a relative-position bias matrix BB parameterized by the sequence length LL and the hidden dimension dd of each attention head. The matrix is defined by

B=[Bi,j]0i,j<LwhereBi,j=τsinh(μ(ji))B = [B_{i,j}]_{0 \leq i,j < L} \quad \text{where} \quad B_{i,j} = -\tau\,\sinh(\mu\,(j-i))

with μR\mu \in \mathbb{R} as the slope and τ>0\tau > 0 as the amplitude. The hyperbolic identity

2sinh(x)=exex2\sinh(x) = e^{x} - e^{-x}

implies

Bi,j=τ2(eμ(ji)eμ(ji)).B_{i,j} = -\frac{\tau}{2}(e^{\mu(j-i)} - e^{-\mu(j-i)}).

This bias function translates relative token distances Δ=ji\Delta = j-i into soft positional biases, which can be tuned by the μ\mu and τ\tau hyperparameters.

2. Integration into Transformer Attention

HyPE modifies the attention computation pipeline by introducing two hyperbolic embedding matrices per head,

ηQRL×2,ηKRL×2\eta^Q \in \mathbb{R}^{L \times 2}, \quad \eta^K \in \mathbb{R}^{L \times 2}

with entries

ηi,jQ=τd2exp((1)j+1μi),ηi,jK=(1)j+1exp((1)jμi)\eta^Q_{i,j} = \frac{\tau\sqrt{d}}{2}\,\exp\bigl((-1)^{j+1}\mu i\bigr), \quad \eta^K_{i,j} = (-1)^{j+1}\exp\bigl((-1)^j \mu i\bigr)

for j{0,1}j \in \{0,1\}. These embeddings are concatenated as additional channels to the usual query and key matrices,

Q^=concat(Q,ηQ)RL×(d+2),K^=concat(K,ηK)RL×(d+2).\widehat Q = \mathrm{concat}(Q, \eta^Q) \in \mathbb{R}^{L \times (d+2)}, \quad \widehat K = \mathrm{concat}(K, \eta^K) \in \mathbb{R}^{L \times (d+2)}.

The attention logits are then

Q^K^=QK+dB\widehat Q \widehat K^\top = Q K^\top + \sqrt{d} B

where the bias BB arises via hyperbolic identity. The final output is computed as

O=softmax(QKd+B)V=softmax(Q^K^d)VO = \mathrm{softmax} \left( \frac{Q K^\top}{\sqrt{d}} + B \right) V = \mathrm{softmax} \left( \frac{\widehat Q \widehat K^\top}{\sqrt{d}} \right) V

and the explicit bias matrix BB is never materialized, ensuring efficient computation.

3. Computational Complexity and Memory Footprint

Standard relative-position bias approaches incur O(L2)O(L^2) memory cost for storing a dense bias mask. HyPE requires only two L×2L \times 2 side-channel matrices per attention head ($4L$ scalars), or O(Lh)O(Lh) overall for hh heads. The runtime involves a single (L×(d+2))×(L×(d+2))(L \times (d+2)) \times (L \times (d+2)) matrix multiplication per head and the usual (L×L)×V(L \times L) \times V cost, with no additional O(L2)O(L^2) operations. The overall time complexity remains O(Ld2)O(L d^2) per head, and memory footprint for embeddings is O(Ld)O(L d), matching standard Transformer attention efficiency.

4. Compatibility with FlashAttention-2 and Backpropagation

HyPE's bias injection via concatenation of auxiliary channels into QQ and KK renders it fully compatible with FlashAttention-2's fused, IO-aware attention kernels. The two added channels are managed as ordinary feature dimensions and all gradients propagate through ηQ\eta^Q and ηK\eta^K under standard automatic differentiation. Any learnable parameters, such as per-head μh\mu_h or amplitude τ\tau, are updated by the established backward pass without need for special handling or code modifications at the kernel level.

5. Hyperparameter Selection and ALiBi Approximation

The ALiBi method employs a linear bias Bi,jALiBi=m(ji)B^{\text{ALiBi}}_{i,j} = -m(j-i). For HyPE, the Taylor expansion

sinh(x)=x+O(x3)\sinh(x) = x + O(x^3)

implies

τsinh(μ(ji))=τ[μ(ji)]+O((ji)3μ3)-\tau\sinh(\mu(j-i)) = -\tau[\mu(j-i)] + O((j-i)^3\mu^3)

Setting τ=1\tau=1 and μ=m1/L\mu=m \ll 1/L yields

Bi,jHyPE(μ,τ=1)m(ji)=Bi,jALiBi(m)+O((ji)3m3)B_{i,j}^{\text{HyPE}(\mu,\tau=1)} \approx -m(j-i) = B_{i,j}^{\text{ALiBi}(m)} + O((j-i)^3 m^3)

so the cubic remainder is negligible for jiLextra|j-i| \leq L_\text{extra} if μ<mmax1/Lextra\mu < m_\text{max} \approx 1/L_\text{extra}, where LextraL_\text{extra} is the context length for extrapolation. Promoting τ\tau, or using per-head τh\tau_h, enables fine control over overall bias magnitude during training.

6. Generalization Beyond Pretraining Lengths

HyPE supports input-length extrapolation due to its analytic design. With μ<1/Lextra\mu < 1/L_\text{extra}, the hyperbolic bias grows almost linearly up to LextraL_\text{extra} before saturating, preventing divergence. As HyPE relies on no fixed-size sinusoidal tables or learned absolute embeddings, it generalizes without modification to arbitrary sequence lengths at inference. The bound on approximation error to linear bias, derived from Taylor expansion, further supports its theoretical robustness. While large-scale empirical results are deferred, HyPE offers a compact, differentiable framework for positional bias injection, achieving extrapolation benefits analogous to ALiBi, trainability, and compatibility with advanced attention mechanisms at minimal memory and computational cost (Angelotti, 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hyperbolic Relative Bias (HyPE).