GraphMoRE: Riemannian Mixture-of-Experts

Updated 10 May 2026

GraphMoRE is a neural architecture that dynamically assigns graph components to distinct Riemannian experts defined on constant-curvature manifolds.
Its topology-aware gating mechanism minimizes geometric distortion and enhances downstream task efficacy in anomaly detection and graph embedding.
Training leverages unsupervised Riemannian-Adam optimization and explicit distortion loss, achieving superior performance over single-manifold models.

Riemannian Mixture-of-Experts (GraphMoRE) encompasses a class of neural architectures for graph representation learning that operate via sparse gating of multiple expert models, each defined on potentially distinct Riemannian geometries. This approach addresses the heterogeneous geometric and topological structures in real-world graphs by dynamically assigning each node, substructure, or input sample to the most suitable combination of manifold-expert networks. The general principle is to reduce embedding distortion and improve downstream task efficacy by personalizing the geometric context through which each element is processed.

1. Motivation: Heterogeneity and the Need for Mixture-of-Riemannian-Experts

Graph structures often consist of heterogeneous subcomponents—such as chains, cliques, hierarchical trees, or lattices—each aligning most naturally with different constant-curvature spaces (hyperbolic for hierarchies, Euclidean for grids, spherical for cliques). Embedding all vertices into a single, globally homogeneous manifold is fundamentally limited: it guarantees high distortion for topological features that do not conform to the selected curvature. Empirical findings from anomaly detection and graph embedding demonstrate that task performance, as measured by AUROC, can vary by 10–20 points as the global curvature parameter $\kappa$ is swept from negative (hyperbolic) to positive (spherical) values, with no single choice working well for all graph types (Zhao et al., 6 Feb 2026).

Prior attempts to address this, such as product manifolds or multi-manifold models, remain globally homogeneous in curvature profile and are insufficiently adaptive to local topology (Guo et al., 2024). GraphMoRE frameworks, in contrast, implement node- or subgraph-level selection and fusion of a set of Riemannian experts, each defined by a separate constant-curvature manifold. This adaptive, per-sample mixture minimizes distortion and captures the latent geometry of diverse graphs.

2. Architectural Foundations: The MoRE Framework

The principal GraphMoRE architectures share the following ingredients:

Curvature-Specialized Riemannian Experts: Each expert consists of a (potentially deep) graph neural network operating in its own manifold $\mathcal{M}_{\kappa}$ , with geometry governed by its curvature parameter $\kappa$ (e.g., Poincaré/hyperbolic for $\kappa<0$ , Euclidean for $\kappa=0$ , spherical for $\kappa>0$ ). These GNNs perform message passing via exponential/logarithmic maps at the origin, applying aggregation in the local tangent space.
Topology- or Task-Aware Gating Mechanisms: To allocate elements to experts, GraphMoRE employs gating networks that infer local geometry. In anomaly detection (Zhao et al., 6 Feb 2026), the gate is anomaly-aware and memory-driven, routing samples to experts by historical reconstruction fidelity. In topology modeling (Guo et al., 2024), the gate encodes multi-scale neighborhood information via graph samplers and deep set encoders, producing a softmax distribution over expert weights for each node.
Fusion and Alignment of Expert Outputs: Outputs from multiple experts are fused via a weighted manifold product at the origin, yielding a personalized, mixed-curvature embedding per node. To enable meaningful comparison and pooling across samples with non-identical expert mixtures, GraphMoRE introduces alignment techniques, such as attention-based reweighting, that privilege shared expert components in distance computations.
Distortion Minimization and Task-driven Losses: The training objectives couple standard downstream losses with explicit regularization terms for geometric distortion. The latter penalizes discrepancies between graph-theoretic and Riemannian distances, thereby steering the gating mechanism to favor low-distortion embeddings.

3. Riemannian Geometry and Mixture Operations

Each expert operates in a constant-curvature Riemannian manifold $\mathcal{M}_\kappa^d$ , defined and parameterized as follows (Guo et al., 2024, Zhao et al., 6 Feb 2026):

$\mathcal{M}_\kappa^d = \left\{x \in \mathbb{R}^d \mid -\kappa \|x\|^2 < 1 \right\}$

Geodesic Distance: For $\kappa<0$ (hyperbolic), distance is given by the Poincaré ball metric; for $\kappa=0$ (Euclidean), by $\mathcal{M}_{\kappa}$ 0; and for $\mathcal{M}_{\kappa}$ 1 (spherical), by spherical arc length.
(Log, Exp) Maps: To preserve computational robustness, feature aggregation and update steps occur in tangent space, with encodings mapped to and from the manifold via the corresponding exponential and logarithmic maps.
Mixture Fusion: The fused embedding $\mathcal{M}_{\kappa}$ 2 for a node $\mathcal{M}_{\kappa}$ 3 is computed as a weighted manifold product: $\mathcal{M}_{\kappa}$ 4 where $\mathcal{M}_{\kappa}$ 5 is the gating weight and $\mathcal{M}_{\kappa}$ 6 denotes scaling in tangent space followed by mapping onto the manifold. This produces a non-homogeneous, per-node "personalized" geometric context.
Alignment for Cross-Mixture Distances: To meaningfully compare nodes with differing expert mixtures, GraphMoRE computes alignment coefficients via a softmax over the Hadamard product of gating assignments, yielding aligned weights $\mathcal{M}_{\kappa}$ 7 for pairwise distance computations: $\mathcal{M}_{\kappa}$ 8 This ensures that comparability is maximized in common expert substates.

4. Applications: Anomaly Detection and Topological Embedding

Graph Anomaly Detection. The GAD-MoRE model (Zhao et al., 6 Feb 2026) demonstrates the advantage of diverse curvature experts in zero-shot anomaly detection. Its key contributions include the anomaly-aware multi-curvature feature alignment (MCFA), a residual-embedding GNN backbone, reconstruction-based mixture scoring, and the memory-based dynamic router (MDR) that adaptively chooses experts based on historical anomaly reconstruction. Anomaly scores are computed via reconstruction error in the mixed-curvature embedding, with ablation showing that all three components (MCFA, MoRE, MDR) are essential for performance.

GAD-MoRE reports 82.09% average AUROC and 36.96% AUPRC across standard benchmarks. These results exceed both zero-shot and fine-tuned state-of-the-art baselines, demonstrating >5 AUROC and ~3 AUPRC improvement over best generalist competitors on multiple domains.

Topological Heterogeneity and Foundation Models. In general representation learning, the GraphMoRE framework (Guo et al., 2024) integrates Riemannian experts with a topology-aware gate, enabling each node to inhabit a distinct mixed-curvature context. Objective function design penalizes deviation from graph shortest-path distances, directly minimizing geometric distortion. This yields consistently lower distortion (e.g., 0.22 on Cora versus 0.69 or higher for product-manifold baselines), AUCs stable above 98% on synthetic heterogeneous graphs, and improved node classification performance over previous methods.

The table summarizes core architectural components across representative GraphMoRE instantiations:

Framework	Key Components	Specialized for
GAD-MoRE (Zhao et al., 6 Feb 2026)	MCFA, MoRE, MDR	Zero-shot anomaly detection
GraphMoRE (Guo et al., 2024)	Topology-aware gate, distortion loss	General topological embedding
MoG (Zhang et al., 2024)	Expert graph sparsifiers, Grassmann mixture	Graph sparsification

5. Training Procedures and Optimization Strategies

GraphMoRE architectures are trained end-to-end, predominantly in an unsupervised manner, using Riemannian-Adam optimizers for expert network parameters and Adam for gating/head parameters. Key losses include:

Reconstruction or Downstream Task Loss: $\mathcal{M}_{\kappa}$ 9, e.g., reconstruction, cross-entropy, or link prediction.
Distortion Regularizer: $\kappa$ 0 penalizes quadratic deviation between Riemannian and shortest-path distances: $\kappa$ 1
Gating Entropy / Importance Losses: For robust expert usage and high-entropy allocation.

Hyperparameters include the number of experts $\kappa$ 2 (typically 3–5), initialization of curvature parameters $\kappa$ 3, manifold embedding dimension $\kappa$ 4, and mixture fusion (top- $\kappa$ 5) routing control.

Pseudocode for training follows a standard "sample → encode → gate → expert → fuse → loss" pipeline. Memory or history-based gating, when deployed, introduces additional state tracking for episodic selection.

6. Experimental Insights and Comparative Results

GraphMoRE consistently exhibits superior performance across tasks demanding adaptation to structural or geometric heterogeneity:

In anomaly detection, ablation studies reveal that omitting any of MCFA, MoRE, or MDR components degrades AUROC by 5–7 points. GAD-MoRE significantly exceeds zero-shot and few-shot generalist baselines (Zhao et al., 6 Feb 2026).
In topological embedding, GraphMoRE obtains the lowest average embedding distortion and the highest link prediction/node classification metrics on real and synthetic datasets (Guo et al., 2024).
In graph sparsification, Mixture-of-Graphs (MoG, (Zhang et al., 2024)) variants using expert sparsifiers and Riemannian barycenters yield improved sparsity-utility tradeoffs and speedups with negligible test score reductions.

Alignment strategies and explicit distortion regularization are critical for cross-manifold distance consistency and robust downstream performance.

7. Implications and Future Directions

The Riemannian Mixture-of-Experts paradigm establishes a flexible architectural template for learning on graphs combining fundamentally different geometric properties. It suggests a foundational approach for next-generation graph models ("graph foundation models") that adaptively learn and combine geometric priors at the node- or sample-level, moving beyond rigid or global manifold choices. A plausible implication is that further variants—incorporating additional manifold types, task-conditional gating, or continuous curvature mixtures—may further improve modeling of real-world graph distributions and anomalies. The methodology remains distinguished by its capacity for local geometric adaptation, principled distortion minimization, and demonstrable empirical superiority across diverse tasks (Zhao et al., 6 Feb 2026, Guo et al., 2024).