- The paper introduces HELM, which leverages mixture-of-curvature experts in hyperbolic space to capture the semantic hierarchies of natural language.
- It proposes novel modules including Hyperbolic Multi-Head Latent Attention, Hyperbolic Rotary Positional Encoding, and hyperbolic RMSNorm to enhance efficiency and stability.
- Experiments demonstrate that HELM variants outperform Euclidean baselines on reasoning benchmarks, with the 1B HELM-MiCE model achieving up to 24.9% average accuracy.
This paper introduces HELM (Hyperbolic LLMs), a family of LLMs designed to operate entirely within hyperbolic space. The core motivation stems from the observation that natural language exhibits inherent semantic hierarchies and complex geometric structures, which are not fully captured by traditional LLMs relying on Euclidean geometry and operations like dot products. Empirical analysis of token embeddings from existing LLMs reveals a wide distribution of negative Ricci curvature, suggesting that hyperbolic space, known for its expansive, scale-free, and low-distortion properties, is a more suitable embedding space for text data.
The paper addresses several limitations of previous attempts at hyperbolic LMs, including representational inflexibility, lack of essential modern LLM modules, and poor scalability. To this end, HELM incorporates several key architectural innovations:
- Mixture-of-Curvature Experts (MiCE): To capture the observed variation in token embedding curvature, HELM-MiCE employs an architecture where different experts within a layer operate in distinct hyperbolic spaces with varying curvatures. This allows the model to learn more fine-grained geometric structures. Input tokens are projected to the expert's specific manifold, processed by the expert, and then projected back to the input manifold before being combined using a Lorentzian centroid and residual connection. The gating mechanism for selecting experts is also formulated based on minimizing hyperbolic distance.
- Hyperbolic Multi-Head Latent Attention (HMLA): To address the scalability issues of quadratic self-attention in hyperbolic space, HMLA is introduced. Inspired by Euclidean MLA, HMLA projects queries, keys, and values to a lower-dimensional latent space before performing attention. This significantly reduces the memory footprint for the KV cache during inference and active memory during training compared to standard hyperbolic self-attention. The memory complexity is reduced from O(2nnhL) to O((nkv+nr)L), where n is embedding dimension, hn is number of heads, L is number of layers, nkv and nr are latent dimensions, with nkv,nr≪nnh.
- Hyperbolic Rotary Positional Encoding (HoPE): A hyperbolic equivalent of Rotary Positional Encoding (RoPE) is developed. HoPE applies a Lorentzian rotation, which acts as a Euclidean rotation on the space-like dimensions, parameterized by the token position. Theoretical analysis shows that HoPE encodes relative positional information, exhibits long-term decay of attention scores with increasing relative distance, allows attention to be maximal at arbitrary relative distances, and enables learning diagonal and off-diagonal attention patterns, similar to its Euclidean counterpart.
- Hyperbolic RMSNorm: A hyperbolic version of Root Mean Square Normalization is introduced, applied to the space-like dimension of the Lorentz vector. This formulation is shown to be invariant to input scaling, contributing to training stability, analogous to Euclidean RMSNorm.
The overall HELM architecture follows a decoder-only transformer structure. Token embeddings are initialized in hyperbolic space (specifically, the Lorentz model). Decoder blocks consist of a hyperbolic attention mechanism (HMLA or self-attention) and a hyperbolic feedforward network (either a dense Hyperbolic SwiGLU FFN - HFFNSG - or a MiCE module). Hyperbolic RMSNorm is applied before the attention and FFN layers, and Lorentzian residual connections are used to combine layer outputs. The final output is projected to logits for next-token prediction.
For practical implementation, the Lorentz model of hyperbolic geometry is chosen for its optimization stability. Key operations like Lorentzian linear transformations (HLT), Lorentzian residual connections (x⊕Ly), and Lorentzian self-attention are defined based on the Lorentz inner product and distance. The paper provides mathematical details and proofs for the theoretical properties of HoPE and Hyperbolic RMSNorm.
Experimental evaluation is conducted on standard multiple-choice question answering benchmarks (MMLU, ARC-Challenging, CommonsenseQA, HellaSwag, OpenbookQA) at 100M and 1B parameter scales. HELM-D (dense hyperbolic model) and HELM-MiCE (hyperbolic model with Mixture-of-Curvature Experts) are trained from scratch on a 5B token English Wikipedia dataset. These are compared against Euclidean baselines (LLaMA for dense, DeepSeekV3 for MoE) trained on the same dataset and at comparable scales.
The results demonstrate that both HELM-D and HELM-MiCE consistently outperform their Euclidean counterparts across the benchmarks, particularly on reasoning tasks like MMLU and ARC-Challenging. The HELM-MiCE variants generally achieve higher accuracy than HELM-D, highlighting the benefit of the mixed-curvature approach. For instance, the 1B HELM-MiCE model achieved an average accuracy of 24.9% across the benchmarks, compared to 23.9% for the 1B DeepSeekV3 baseline.
Ablation studies further support the proposed modules. A HELM-MiCE variant with constant curvature across experts (MiCE-Const) performs worse than HELM-MiCE with distinct curvatures, confirming the advantage of allowing experts to operate in different geometric spaces. Replacing HoPE with learned relative positional encodings results in slightly lower overall performance, suggesting HoPE's benefits in hyperbolic space.
Implementation considerations include the use of Riemannian optimization techniques and addressing numerical stability, particularly during training at larger scales (e.g., adapting hyperbolic word embedding training). The paper notes that HELM models had longer training times (1.5x to 1.8x) compared to Euclidean baselines on the tested hardware.
The paper concludes by acknowledging limitations such as training on a smaller dataset compared to commercial LLMs and potential under-exposure to specific domains. Future work could involve scaling HELM to larger datasets and compute scales, guided by hyperbolic scaling laws, and exploring further improvements in hyperbolic model architectures.
The open-source code for HELM is available on GitHub at github.com/Graph-and-Geometric-Learning/helm.