HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts (2505.24722v1)

Published 30 May 2025 in cs.LG and cs.AI

Abstract: LLMs have shown great success in text modeling tasks across domains. However, natural language exhibits inherent semantic hierarchies and nuanced geometric structure, which current LLMs do not capture completely owing to their reliance on Euclidean operations. Recent studies have also shown that not respecting the geometry of token embeddings leads to training instabilities and degradation of generative capabilities. These findings suggest that shifting to non-Euclidean geometries can better align LLMs with the underlying geometry of text. We thus propose to operate fully in Hyperbolic space, known for its expansive, scale-free, and low-distortion properties. We thus introduce HELM, a family of HypErbolic LLMs, offering a geometric rethinking of the Transformer-based LLM that addresses the representational inflexibility, missing set of necessary operations, and poor scalability of existing hyperbolic LMs. We additionally introduce a Mixture-of-Curvature Experts model, HELM-MICE, where each expert operates in a distinct curvature space to encode more fine-grained geometric structure from text, as well as a dense model, HELM-D. For HELM-MICE, we further develop hyperbolic Multi-Head Latent Attention (HMLA) for efficient, reduced-KV-cache training and inference. For both models, we develop essential hyperbolic equivalents of rotary positional encodings and RMS normalization. We are the first to train fully hyperbolic LLMs at billion-parameter scale, and evaluate them on well-known benchmarks such as MMLU and ARC, spanning STEM problem-solving, general knowledge, and commonsense reasoning. Our results show consistent gains from our HELM architectures -- up to 4% -- over popular Euclidean architectures used in LLaMA and DeepSeek, highlighting the efficacy and enhanced reasoning afforded by hyperbolic geometry in large-scale LM pretraining.

Summary

The paper introduces HELM, which leverages mixture-of-curvature experts in hyperbolic space to capture the semantic hierarchies of natural language.
It proposes novel modules including Hyperbolic Multi-Head Latent Attention, Hyperbolic Rotary Positional Encoding, and hyperbolic RMSNorm to enhance efficiency and stability.
Experiments demonstrate that HELM variants outperform Euclidean baselines on reasoning benchmarks, with the 1B HELM-MiCE model achieving up to 24.9% average accuracy.

This paper introduces HELM (Hyperbolic LLMs), a family of LLMs designed to operate entirely within hyperbolic space. The core motivation stems from the observation that natural language exhibits inherent semantic hierarchies and complex geometric structures, which are not fully captured by traditional LLMs relying on Euclidean geometry and operations like dot products. Empirical analysis of token embeddings from existing LLMs reveals a wide distribution of negative Ricci curvature, suggesting that hyperbolic space, known for its expansive, scale-free, and low-distortion properties, is a more suitable embedding space for text data.

The paper addresses several limitations of previous attempts at hyperbolic LMs, including representational inflexibility, lack of essential modern LLM modules, and poor scalability. To this end, HELM incorporates several key architectural innovations:

Mixture-of-Curvature Experts (MiCE): To capture the observed variation in token embedding curvature, HELM-MiCE employs an architecture where different experts within a layer operate in distinct hyperbolic spaces with varying curvatures. This allows the model to learn more fine-grained geometric structures. Input tokens are projected to the expert's specific manifold, processed by the expert, and then projected back to the input manifold before being combined using a Lorentzian centroid and residual connection. The gating mechanism for selecting experts is also formulated based on minimizing hyperbolic distance.
Hyperbolic Multi-Head Latent Attention (HMLA): To address the scalability issues of quadratic self-attention in hyperbolic space, HMLA is introduced. Inspired by Euclidean MLA, HMLA projects queries, keys, and values to a lower-dimensional latent space before performing attention. This significantly reduces the memory footprint for the KV cache during inference and active memory during training compared to standard hyperbolic self-attention. The memory complexity is reduced from $O(2nn_hL)$ to $O((n_{kv} + n_r)L)$ , where $n$ is embedding dimension, $h_n$ is number of heads, $L$ is number of layers, $n_{kv}$ and $n_r$ are latent dimensions, with $n_{kv}, n_r \ll nn_h$ .
Hyperbolic Rotary Positional Encoding (HoPE): A hyperbolic equivalent of Rotary Positional Encoding (RoPE) is developed. HoPE applies a Lorentzian rotation, which acts as a Euclidean rotation on the space-like dimensions, parameterized by the token position. Theoretical analysis shows that HoPE encodes relative positional information, exhibits long-term decay of attention scores with increasing relative distance, allows attention to be maximal at arbitrary relative distances, and enables learning diagonal and off-diagonal attention patterns, similar to its Euclidean counterpart.
Hyperbolic RMSNorm: A hyperbolic version of Root Mean Square Normalization is introduced, applied to the space-like dimension of the Lorentz vector. This formulation is shown to be invariant to input scaling, contributing to training stability, analogous to Euclidean RMSNorm.

The overall HELM architecture follows a decoder-only transformer structure. Token embeddings are initialized in hyperbolic space (specifically, the Lorentz model). Decoder blocks consist of a hyperbolic attention mechanism (HMLA or self-attention) and a hyperbolic feedforward network (either a dense Hyperbolic SwiGLU FFN - $\mathrm{HFFN}_{SG}$ - or a MiCE module). Hyperbolic RMSNorm is applied before the attention and FFN layers, and Lorentzian residual connections are used to combine layer outputs. The final output is projected to logits for next-token prediction.

For practical implementation, the Lorentz model of hyperbolic geometry is chosen for its optimization stability. Key operations like Lorentzian linear transformations ( $\mathrm{HLT}$ ), Lorentzian residual connections ( $x \oplus_\mathcal{L} y$ ), and Lorentzian self-attention are defined based on the Lorentz inner product and distance. The paper provides mathematical details and proofs for the theoretical properties of HoPE and Hyperbolic RMSNorm.

Experimental evaluation is conducted on standard multiple-choice question answering benchmarks (MMLU, ARC-Challenging, CommonsenseQA, HellaSwag, OpenbookQA) at 100M and 1B parameter scales. HELM-D (dense hyperbolic model) and HELM-MiCE (hyperbolic model with Mixture-of-Curvature Experts) are trained from scratch on a 5B token English Wikipedia dataset. These are compared against Euclidean baselines (LLaMA for dense, DeepSeekV3 for MoE) trained on the same dataset and at comparable scales.

The results demonstrate that both HELM-D and HELM-MiCE consistently outperform their Euclidean counterparts across the benchmarks, particularly on reasoning tasks like MMLU and ARC-Challenging. The HELM-MiCE variants generally achieve higher accuracy than HELM-D, highlighting the benefit of the mixed-curvature approach. For instance, the 1B HELM-MiCE model achieved an average accuracy of 24.9% across the benchmarks, compared to 23.9% for the 1B DeepSeekV3 baseline.

Ablation studies further support the proposed modules. A HELM-MiCE variant with constant curvature across experts (MiCE-Const) performs worse than HELM-MiCE with distinct curvatures, confirming the advantage of allowing experts to operate in different geometric spaces. Replacing HoPE with learned relative positional encodings results in slightly lower overall performance, suggesting HoPE's benefits in hyperbolic space.

Implementation considerations include the use of Riemannian optimization techniques and addressing numerical stability, particularly during training at larger scales (e.g., adapting hyperbolic word embedding training). The paper notes that HELM models had longer training times (1.5x to 1.8x) compared to Euclidean baselines on the tested hardware.

The paper concludes by acknowledging limitations such as training on a smaller dataset compared to commercial LLMs and potential under-exposure to specific domains. Future work could involve scaling HELM to larger datasets and compute scales, guided by hyperbolic scaling laws, and exploring further improvements in hyperbolic model architectures.

The open-source code for HELM is available on GitHub at github.com/Graph-and-Geometric-Learning/helm.

PDF Markdown

HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts (2505.24722v1)

Summary

Related Papers