Hyperbolic Relative Bias (HyPE) in Transformers
- Hyperbolic Relative Bias (HyPE) is a relative positional encoding method that uses hyperbolic functions to dynamically inject biases into Transformer attention mechanisms.
- It integrates auxiliary hyperbolic embeddings into queries and keys, ensuring memory efficiency and compatibility with advanced kernels like FlashAttention-2.
- Its analytic formulation allows seamless long-context extrapolation, approximating linear biases such as ALiBi and generalizing beyond pretraining lengths.
Hyperbolic Relative Bias (HyPE) is a relative positional encoding scheme for Transformer-based architectures that leverages hyperbolic functions to encode token relationships, enabling efficient context-length extrapolation and seamless integration with advanced attention mechanisms. Unlike traditional methods for sequential order imposition—such as absolute positional embeddings or dense relative-bias masks—HyPE introduces a fully differentiable, memory-efficient formulation where the positional bias is indirectly injected into the attention computation using hyperbolic embeddings. HyPE is compatible with state-of-the-art fused attention kernels, supports gradient propagation for all learnable parameters, and analytically approximates linear biases such as the ALiBi method, facilitating extrapolation well beyond pretraining context lengths (Angelotti, 2023).
1. Mathematical Formulation of Hyperbolic Bias
HyPE constructs a relative-position bias matrix parameterized by the sequence length and the hidden dimension of each attention head. The matrix is defined by
with as the slope and as the amplitude. The hyperbolic identity
implies
This bias function translates relative token distances into soft positional biases, which can be tuned by the and hyperparameters.
2. Integration into Transformer Attention
HyPE modifies the attention computation pipeline by introducing two hyperbolic embedding matrices per head,
with entries
for . These embeddings are concatenated as additional channels to the usual query and key matrices,
The attention logits are then
where the bias arises via hyperbolic identity. The final output is computed as
and the explicit bias matrix is never materialized, ensuring efficient computation.
3. Computational Complexity and Memory Footprint
Standard relative-position bias approaches incur memory cost for storing a dense bias mask. HyPE requires only two side-channel matrices per attention head ($4L$ scalars), or overall for heads. The runtime involves a single matrix multiplication per head and the usual cost, with no additional operations. The overall time complexity remains per head, and memory footprint for embeddings is , matching standard Transformer attention efficiency.
4. Compatibility with FlashAttention-2 and Backpropagation
HyPE's bias injection via concatenation of auxiliary channels into and renders it fully compatible with FlashAttention-2's fused, IO-aware attention kernels. The two added channels are managed as ordinary feature dimensions and all gradients propagate through and under standard automatic differentiation. Any learnable parameters, such as per-head or amplitude , are updated by the established backward pass without need for special handling or code modifications at the kernel level.
5. Hyperparameter Selection and ALiBi Approximation
The ALiBi method employs a linear bias . For HyPE, the Taylor expansion
implies
Setting and yields
so the cubic remainder is negligible for if , where is the context length for extrapolation. Promoting , or using per-head , enables fine control over overall bias magnitude during training.
6. Generalization Beyond Pretraining Lengths
HyPE supports input-length extrapolation due to its analytic design. With , the hyperbolic bias grows almost linearly up to before saturating, preventing divergence. As HyPE relies on no fixed-size sinusoidal tables or learned absolute embeddings, it generalizes without modification to arbitrary sequence lengths at inference. The bound on approximation error to linear bias, derived from Taylor expansion, further supports its theoretical robustness. While large-scale empirical results are deferred, HyPE offers a compact, differentiable framework for positional bias injection, achieving extrapolation benefits analogous to ALiBi, trainability, and compatibility with advanced attention mechanisms at minimal memory and computational cost (Angelotti, 2023).