Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hyperbolic Rotary Positional Encoding (HoPE)

Updated 20 May 2026
  • HoPE is a hyperbolic extension of rotary positional encoding that applies Lorentz boosts to induce monotonic exponential decay in attention weights.
  • It replaces periodic Euclidean rotations with hyperbolic functions, resolving oscillatory artifacts and improving long-range extrapolation in transformer models.
  • The method maintains efficient complexity similar to RoPE while enhancing performance on extended-sequence language tasks and graph-structured representations.

Hyperbolic Rotary Positional Encoding (HoPE) is a geometric generalization of rotary positional encodings designed for stable and efficient modeling of long-range dependencies in transformers. HoPE replaces the periodic Euclidean rotations of standard rotary positional encoding (RoPE) with Lorentz boosts parameterized by hyperbolic functions, inducing monotonic exponential decay of attention weights as token distances increase. This formulation resolves the oscillatory and resonance effects pervasive in traditional RoPE, achieving improved extrapolation and performance in extended-sequence transformers for language and representation learning tasks (Dai et al., 5 Sep 2025, Xu et al., 20 Sep 2025).

1. Theoretical Foundations

HoPE arises by reinterpreting the geometric mechanism underlying positional encodings in self-attention. In RoPE, token vectors are decomposed into 2D subspaces, each rotated by an angle proportional to token position using block-diagonal matrices ρE(pgi)\rho_E(p\,g_i) with frequencies gig_i. The resulting dot-product attention kernel for tokens at positions mm and nn is expressed as: qm,knRoPE=i=1d/2(qm(i))ρE(gi)nmkn(i)\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)} where the rotation is periodic in nmn-m, causing oscillatory attention weights for large token distances due to the trigonometric nature of cos\cos and sin\sin functions.

HoPE replaces these planar (Euclidean) rotations with Lorentz transformations in hyperbolic space. The canonical 2D Lorentz boost (the hyperbolic analog of a rotation) is given by: Bi(Δp)=(cosh(Δpθi)sinh(Δpθi) sinh(Δpθi)cosh(Δpθi))B_i(\Delta p) = \begin{pmatrix} \cosh(\Delta p \theta_i) & \sinh(\Delta p \theta_i) \ \sinh(\Delta p \theta_i) & \cosh(\Delta p \theta_i) \end{pmatrix} where θi\theta_i parameterizes per-block curvature, and gig_i0 is the positional offset. The full HoPE operator over gig_i1 dimensions is: gig_i2 where the gamma factor gig_i3 provides additional exponential damping.

Crucially, as the curvature vanishes (gig_i4), the hyperbolic formulation reverts to the Euclidean (RoPE) case, as shown by the limiting behavior of hyperbolic trigonometric functions: gig_i5 for gig_i6. Thus, RoPE is a special case of HoPE with zero curvature (Dai et al., 5 Sep 2025).

2. Mathematical Properties

The defining property of HoPE is the strict monotonicity of the attention decay. For a single 2D subspace, the unnormalized dot-product between the query at gig_i7 and key at gig_i8 is: gig_i9 Combining transformations yields: mm0 resulting in exponential decay in mm1 whenever mm2. This structure eliminates the oscillatory “resonances” present in RoPE, in which attention periodically recurs for increasing token separations.

Additionally, the block-diagonal construction in each embedding plane allows HoPE to retain the efficient complexity scaling of standard rotary encodings without additional dependence on sequence length. The operator is implemented as a direct sum over mm3 two-dimensional Lorentz blocks, each with learnable or fixed mm4, with a global damping mm5. For the multi-dimensional case, the HoPE operator is: mm6

3. Implementation Details

HoPE requires minimal changes to transformer workflows. Input token representations are split into 2D chunks; for each chunk, the Lorentz boost with parameter mm7 is computed at a position-dependent argument. Query and key chunks are rotated using mm8/mm9-parametrized matrices, with exponential damping applied elementwise. Pseudocode (see (Dai et al., 5 Sep 2025)) illustrates the algorithmic loop over nn0 subspaces:

  • For queries (at position nn1): apply nn2 with nn3 multiplicative damping.
  • For keys (at position nn4): apply nn5 with nn6 (i.e., reciprocal) damping.
  • The final dot-product incorporates both relative Lorentz boosts and overall distance-dependent decay.

Numerical stability for large nn7 is addressed using numerically stable library routines for hyperbolic functions or clamping input ranges.

HoPE introduces only two extra parameter vectors of length nn8: nn9 (curvatures) and qm,knRoPE=i=1d/2(qm(i))ρE(gi)nmkn(i)\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)}0 (damping), and does not require additional state per sequence or token (Dai et al., 5 Sep 2025).

4. Empirical Evaluation

HoPE has been evaluated extensively on standard long-context benchmarks for language modeling and downstream tasks (Dai et al., 5 Sep 2025).

Zero-Shot Perplexity

On PG19 and arXiv, HoPE demonstrates significantly lower perplexity at long sequence lengths compared to RoPE and Alibi, especially as token distances increase, indicating improved robustness and generalization for untrained context sizes.

Method 1024 2048 3072 4096 5120 6144
RoPE 12.82 25.80 56.28 88.59 116.63 144.13
HoPE 13.35 16.46 35.07 60.03 85.94 110.02
Alibi 11.95 25.11 52.54 79.04 107.59 132.80

Downstream Tasks

On SCROLLS, covering QA, NLI, and summarization (sequence length up to 8192), HoPE outperforms or matches sinusoidal, RoPE, and Alibi encodings across all major metrics. Ablation studies confirm that the damping factor qm,knRoPE=i=1d/2(qm(i))ρE(gi)nmkn(i)\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)}1 is influential in controlling the rate of attention decay; over-damping may harm capacity, while under-damping reduces long-range selectivity.

5. Geometric and Causal Extensions

The hyperbolic formalism underlying HoPE extends naturally to other forms of representation, including causality-informed and non-sequential feature sets. In CAPE ("Causality-Induced Positional Encoding") (Xu et al., 20 Sep 2025), feature graphs inferred by structural equation modeling are embedded as points in hyperbolic space (using the hyperboloid/Lorentz model), with rotary angles derived from hyperbolic distances. This associates strength of causal relationships (inverse hyperbolic separation) and specificity (radial position) to attention weighting, further leveraging the monotonicity and flexibility of hyperbolic encodings.

In such settings, the self-attention computation takes the form: qm,knRoPE=i=1d/2(qm(i))ρE(gi)nmkn(i)\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)}2 where rotary matrices qm,knRoPE=i=1d/2(qm(i))ρE(gi)nmkn(i)\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)}3 are derived from hyperbolic embeddings of the underlying causal DAG. This generalizes HoPE beyond sequential data to arbitrary relational or graph-structured features.

6. Strengths, Limitations, and Future Directions

HoPE's strengths include:

  • Monotonic, bias-free exponential decay of attention weights as a function of token/feature separation.
  • Theoretical grounding in the Lorentz group, unifying and extending prior Euclidean methods.
  • Efficient implementation matching RoPE’s computational cost.
  • Robustness against resonance and oscillatory failure modes.

Limitations:

  • Choice of curvature and damping parameters (qm,knRoPE=i=1d/2(qm(i))ρE(gi)nmkn(i)\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)}4, qm,knRoPE=i=1d/2(qm(i))ρE(gi)nmkn(i)\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)}5) must be tuned for each application; improper calibration can under- or over-damp long-range attention.
  • Empirical validation to date is limited to textual modeling, with cross-modal and multimodal generalization remaining to be systematically verified (Dai et al., 5 Sep 2025).

Potential extensions include learnable curvature per transformer head, incorporation into encoder-decoder and retrieval-augmented architectures, integration with alternative hyperbolic models (Poincaré disk or upper half-plane), and hybridization with sparse attention patterns for extreme sequence lengths.

HoPE generalizes the following prior methodologies:

  • Rotary Positional Encoding (RoPE), recoverable as the curvature qm,knRoPE=i=1d/2(qm(i))ρE(gi)nmkn(i)\langle q_m, k_n \rangle_{\text{RoPE}} = \sum_{i=1}^{d/2} (q_m^{(i)})^\top \rho_E(g_i)^{n-m} k_n^{(i)}6 limit of the Lorentz-boost formalism.
  • Linear or absolute encodings (e.g., sinusoidal), improved by HoPE's geometric damping and stable extrapolation.
  • Causality-aware hyperbolic embeddings, as in CAPE, which systematically encode graph relationships and feature hierarchies in hyperbolic space and inject them via rotary attention mechanisms into transformers (Xu et al., 20 Sep 2025).

Both theoretical and empirical analyses demonstrate that HoPE achieves a smoother, geometry-consistent decay of attention, supporting improved length extrapolation and modeling capacity for long-sequence and graph-structured data domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hyperbolic Rotary Positional Encoding (HoPE).