Hyperbolic Rotary Positional Encoding (HoPE)

Updated 20 May 2026

HoPE is a hyperbolic extension of rotary positional encoding that applies Lorentz boosts to induce monotonic exponential decay in attention weights.
It replaces periodic Euclidean rotations with hyperbolic functions, resolving oscillatory artifacts and improving long-range extrapolation in transformer models.
The method maintains efficient complexity similar to RoPE while enhancing performance on extended-sequence language tasks and graph-structured representations.

Hyperbolic Rotary Positional Encoding (HoPE) is a geometric generalization of rotary positional encodings designed for stable and efficient modeling of long-range dependencies in transformers. HoPE replaces the periodic Euclidean rotations of standard rotary positional encoding (RoPE) with Lorentz boosts parameterized by hyperbolic functions, inducing monotonic exponential decay of attention weights as token distances increase. This formulation resolves the oscillatory and resonance effects pervasive in traditional RoPE, achieving improved extrapolation and performance in extended-sequence transformers for language and representation learning tasks (Dai et al., 5 Sep 2025, Xu et al., 20 Sep 2025).

1. Theoretical Foundations

HoPE arises by reinterpreting the geometric mechanism underlying positional encodings in self-attention. In RoPE, token vectors are decomposed into 2D subspaces, each rotated by an angle proportional to token position using block-diagonal matrices $\rho_E(p\,g_i)$ with frequencies $g_i$ . The resulting dot-product attention kernel for tokens at positions $m$ and $n$ is expressed as: $\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)}$ where the rotation is periodic in $n-m$ , causing oscillatory attention weights for large token distances due to the trigonometric nature of $\cos$ and $\sin$ functions.

HoPE replaces these planar (Euclidean) rotations with Lorentz transformations in hyperbolic space. The canonical 2D Lorentz boost (the hyperbolic analog of a rotation) is given by: $B_i(\Delta p) = \begin{pmatrix} \cosh(\Delta p \theta_i) & \sinh(\Delta p \theta_i) \ \sinh(\Delta p \theta_i) & \cosh(\Delta p \theta_i) \end{pmatrix}$ where $\theta_i$ parameterizes per-block curvature, and $g_i$ 0 is the positional offset. The full HoPE operator over $g_i$ 1 dimensions is: $g_i$ 2 where the gamma factor $g_i$ 3 provides additional exponential damping.

Crucially, as the curvature vanishes ( $g_i$ 4), the hyperbolic formulation reverts to the Euclidean (RoPE) case, as shown by the limiting behavior of hyperbolic trigonometric functions: $g_i$ 5 for $g_i$ 6. Thus, RoPE is a special case of HoPE with zero curvature (Dai et al., 5 Sep 2025).

2. Mathematical Properties

The defining property of HoPE is the strict monotonicity of the attention decay. For a single 2D subspace, the unnormalized dot-product between the query at $g_i$ 7 and key at $g_i$ 8 is: $g_i$ 9 Combining transformations yields: $m$ 0 resulting in exponential decay in $m$ 1 whenever $m$ 2. This structure eliminates the oscillatory “resonances” present in RoPE, in which attention periodically recurs for increasing token separations.

Additionally, the block-diagonal construction in each embedding plane allows HoPE to retain the efficient complexity scaling of standard rotary encodings without additional dependence on sequence length. The operator is implemented as a direct sum over $m$ 3 two-dimensional Lorentz blocks, each with learnable or fixed $m$ 4, with a global damping $m$ 5. For the multi-dimensional case, the HoPE operator is: $m$ 6

3. Implementation Details

HoPE requires minimal changes to transformer workflows. Input token representations are split into 2D chunks; for each chunk, the Lorentz boost with parameter $m$ 7 is computed at a position-dependent argument. Query and key chunks are rotated using $m$ 8/ $m$ 9-parametrized matrices, with exponential damping applied elementwise. Pseudocode (see (Dai et al., 5 Sep 2025)) illustrates the algorithmic loop over $n$ 0 subspaces:

For queries (at position $n$ 1): apply $n$ 2 with $n$ 3 multiplicative damping.
For keys (at position $n$ 4): apply $n$ 5 with $n$ 6 (i.e., reciprocal) damping.
The final dot-product incorporates both relative Lorentz boosts and overall distance-dependent decay.

Numerical stability for large $n$ 7 is addressed using numerically stable library routines for hyperbolic functions or clamping input ranges.

HoPE introduces only two extra parameter vectors of length $n$ 8: $n$ 9 (curvatures) and $\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)}$ 0 (damping), and does not require additional state per sequence or token (Dai et al., 5 Sep 2025).

4. Empirical Evaluation

HoPE has been evaluated extensively on standard long-context benchmarks for language modeling and downstream tasks (Dai et al., 5 Sep 2025).

Zero-Shot Perplexity

On PG19 and arXiv, HoPE demonstrates significantly lower perplexity at long sequence lengths compared to RoPE and Alibi, especially as token distances increase, indicating improved robustness and generalization for untrained context sizes.

Method	1024	2048	3072	4096	5120	6144
RoPE	12.82	25.80	56.28	88.59	116.63	144.13
HoPE	13.35	16.46	35.07	60.03	85.94	110.02
Alibi	11.95	25.11	52.54	79.04	107.59	132.80

Downstream Tasks

On SCROLLS, covering QA, NLI, and summarization (sequence length up to 8192), HoPE outperforms or matches sinusoidal, RoPE, and Alibi encodings across all major metrics. Ablation studies confirm that the damping factor $\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)}$ 1 is influential in controlling the rate of attention decay; over-damping may harm capacity, while under-damping reduces long-range selectivity.

5. Geometric and Causal Extensions

The hyperbolic formalism underlying HoPE extends naturally to other forms of representation, including causality-informed and non-sequential feature sets. In CAPE ("Causality-Induced Positional Encoding") (Xu et al., 20 Sep 2025), feature graphs inferred by structural equation modeling are embedded as points in hyperbolic space (using the hyperboloid/Lorentz model), with rotary angles derived from hyperbolic distances. This associates strength of causal relationships (inverse hyperbolic separation) and specificity (radial position) to attention weighting, further leveraging the monotonicity and flexibility of hyperbolic encodings.

In such settings, the self-attention computation takes the form: $\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)}$ 2 where rotary matrices $\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)}$ 3 are derived from hyperbolic embeddings of the underlying causal DAG. This generalizes HoPE beyond sequential data to arbitrary relational or graph-structured features.

6. Strengths, Limitations, and Future Directions

HoPE's strengths include:

Monotonic, bias-free exponential decay of attention weights as a function of token/feature separation.
Theoretical grounding in the Lorentz group, unifying and extending prior Euclidean methods.
Efficient implementation matching RoPE’s computational cost.
Robustness against resonance and oscillatory failure modes.

Limitations:

Choice of curvature and damping parameters ( $\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)}$ 4, $\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)}$ 5) must be tuned for each application; improper calibration can under- or over-damp long-range attention.
Empirical validation to date is limited to textual modeling, with cross-modal and multimodal generalization remaining to be systematically verified (Dai et al., 5 Sep 2025).

Potential extensions include learnable curvature per transformer head, incorporation into encoder-decoder and retrieval-augmented architectures, integration with alternative hyperbolic models (Poincaré disk or upper half-plane), and hybridization with sparse attention patterns for extreme sequence lengths.

HoPE generalizes the following prior methodologies:

Rotary Positional Encoding (RoPE), recoverable as the curvature $\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)}$ 6 limit of the Lorentz-boost formalism.
Linear or absolute encodings (e.g., sinusoidal), improved by HoPE's geometric damping and stable extrapolation.
Causality-aware hyperbolic embeddings, as in CAPE, which systematically encode graph relationships and feature hierarchies in hyperbolic space and inject them via rotary attention mechanisms into transformers (Xu et al., 20 Sep 2025).

Both theoretical and empirical analyses demonstrate that HoPE achieves a smoother, geometry-consistent decay of attention, supporting improved length extrapolation and modeling capacity for long-sequence and graph-structured data domains.

Markdown Report Issue Upgrade to Chat

References (2)

HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models (2025)

Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hyperbolic Rotary Positional Encoding (HoPE).