GMoPE: Graph Mixture-of-Prompt-Experts

Updated 7 November 2025

GMoPE is an architectural framework integrating Mixture-of-Experts and prompt-based learning to specialize graph neural networks across heterogeneous domains.
It employs structure-aware routing and soft orthogonality regularization to ensure expert specialization while reducing adaptation costs via prompt-only fine-tuning.
GMoPE achieves state-of-the-art performance in graph tasks, delivering improved AUC and efficiency over prior methods with minimal adaptation overhead.

A Graph Mixture-of-Prompt-Experts (GMoPE) Framework for Graph Foundation Models

GMoPE (Graph Mixture-of-Prompt-Experts) is an architectural framework for constructing graph foundation models (GFMs) with enhanced scalability, transferability, and adaptation efficiency. GMoPE systematically addresses key challenges in graph transfer learning—including negative transfer, expert collapse, and adaptation cost—by structurally integrating Mixture-of-Experts (MoE) with prompt-based learning for graph neural networks (GNNs). The approach enforces expert specialization via prompt orthogonality, enables structure-aware routing, and reduces adaptation complexity through prompt-only fine-tuning, collectively yielding empirically and theoretically superior cross-domain performance.

1. Architectural Principles and MoE-Prompt Integration

GMoPE comprises $M$ expert GNNs, each associated with a trainable prompt vector $\mathbf{p}_m \in \mathbb{R}^{d_p}$ acting as a semantic and structural bias. For each graph input, node features are first aligned (optionally via SVD preprocessing for feature unification). For expert $m$ , a prompt-augmented input is constructed as

$\hat{X}_m^{(i)} = [\tilde{X}^{(i)} \| \mathbf{p}_m \mathbf{1}^{\top}] \in \mathbb{R}^{|\mathcal{V}^{(i)}| \times (d_0 + d_p)},$

where $d_0$ is the feature dimension, and prompt broadcasting is uniform across nodes. All expert and prompt parameters are jointly optimized during pretraining.

Unlike prior prompt methods (e.g., GPF) that use a single learnable prompt with a shared GNN backbone, GMoPE’s per-expert prompting enables conditional specialization for each region of the graph or domain partition, strictly increasing the expressivity class: $\mathcal{F}_\mathrm{GPF} \subsetneq \mathcal{F}_\mathrm{GMoPE}.$

2. Structure-Aware Routing and Confidence-Weighted Aggregation

GMoPE employs a structure-aware routing mechanism to dynamically combine experts' outputs. During training and inference, the router computes, for each expert, a "Rawscore" from the expert's mean task loss across batch $\mathcal{B}$ : $\mathrm{Rawscore}_m = \frac{1}{|\mathcal{B}|} \sum_{s \in \mathcal{B}} \mathcal{L}(E_m(\hat{X}_m^{(i)}; s)).$ The gating vector $g(x)_i$ is then softmaxed over the top- $K$ experts with temperature $\tau$ (soft router), or averaged over the top- $K$ (hard router), selecting the region-appropriate specialists: $g(x)_i = \begin{cases} \frac{\exp(\mathrm{Rawscore}_i / \tau)}{\sum_{j \in \mathcal{K}} \exp(\mathrm{Rawscore}_j / \tau)} & i \in \mathcal{K} \ 0 & \text{otherwise} \end{cases}$ where $\mathcal{K}$ is the set of top- $K$ experts.

At inference, instead of uniform averaging, a confidence-guided aggregation is applied: each expert output is weighted by its inverted normalized entropy, providing robustness and better calibration.

3. Soft Orthogonality Regularization and Expert Specialization

To avoid expert collapse (all experts converging to similar behavior) and enforce specialization, GMoPE introduces a soft orthogonality loss among prompt vectors: $\mathcal{L}_\mathrm{ortho} = \frac{1}{M(M-1)} \sum_{m \neq n} \exp\left(\frac{\mathbf{p}_m^\top \mathbf{p}_n}{\|\mathbf{p}_m\|_2 \|\mathbf{p}_n\|_2}\right).$ This constraint, integrated into the pretraining objective,

$\min_{\theta,\,\mathbf{p}}\, \lambda \mathcal{L}_\mathrm{ortho} + \frac{1}{BM} \sum_{i=1}^B \sum_{m=1}^M g_m^i(\hat{X}_m^i)\, \mathcal{L}_\mathrm{pre}(\Phi_{\theta_m}),$

systematically encourages prompt vectors to occupy orthogonal subspaces, promoting semantic and structural diversity among experts. Ablation studies demonstrate that omitting this term increases expert overlap and diminishes transfer performance.

4. Prompt-Only Fine-Tuning and Adaptation Efficiency

A central feature of GMoPE is its prompt-only fine-tuning strategy for downstream transfer. After pretraining:

All expert GNN weights and routing parameters are frozen.
Only the expert-specific prompt vectors $\{\mathbf{p}_m\}$ and the (lightweight) prediction head are updated for the target task/domain.

This yields a significant reduction in adaptation parameter count—often to less than 1% of full fine-tuning—while retaining performance close to, or exceeding, full-parameter methods. For example, in graph classification (Table quoted):

Model	Graph Cls Tunable Params
FT (full-tune)	3,584
GraphPrompt	32
GPF	16
AnyGraph (MoE, FT)	10,752
GMoPE	12

The adaptation cost scales with the number of experts and task classes, not with the size of the GNN or input features.

5. Empirical Performance Across Graph Domains

GMoPE establishes state-of-the-art results across link prediction, node classification, and graph classification tasks, on both citation/e-commerce graphs (Cora, Citeseer, Pubmed, Computers, Photo) and molecular datasets (PROTEINS, DD, NCI109). Results show:

Up to +3.42% AUC over prior MoE (AnyGraph) in link prediction.
+19.84% (node) and +6.84% (graph) over strong prompt baselines (GraphPrompt).
Robustness in both unsupervised (GAE, DGI, GraphCL) and supervised (EdgePred) pretraining regimes.
Parameter and adaptation overhead remains minimal due to prompt-only fine-tuning, with negligible SVD/alignment preprocessing cost.

Ablations confirm the necessity of MoE+prompt hybridization, prompt orthogonality, and structure-based routing for maximal generalization and transfer efficiency.

6. Implications for Scalable Cross-Domain Graph Learning

GMoPE's approach enables foundation models to:

Specialize structurally and semantically across highly heterogeneous graph domains by combining expert diversity (via orthogonal prompts) and dynamic routing.
Deliver efficient, rapid adaptation for new tasks via low-overhead, prompt-centric fine-tuning.
Serve as a modular backbone, supporting arbitrary GNN architectures as experts and compatibility with dominant graph pretraining strategies.
Avoid negative transfer and catastrophic forgetting through fixed expert parameters and selective prompt adaptation.

The function class achievable under GMoPE strictly subsumes existing prompt-based GNNs, with greater flexibility and theoretical expressivity.

7. Mathematical Summary and Theoretical Foundation

Soft orthogonality loss: $\mathcal{L}_\text{ortho} = \frac{1}{M(M-1)} \sum_{m \neq n} \exp\left(\frac{\mathbf{p}_m^\top \mathbf{p}_n}{\|\mathbf{p}_m\|_2 \|\mathbf{p}_n\|_2}\right)$ Prompt-only fine-tuning (transfer) objective: $\min_{\mathbf{p},\,\phi} \lambda \mathcal{L}_\mathrm{ortho} + \frac{1}{BM} \sum_{i=1}^B \sum_{m=1}^M g_m^i(\hat{X}_m^i)\, \mathcal{L}_\mathrm{task}(f_\phi(\Phi_{\theta_m}))$ where $\lambda$ tunes orthogonality regularization, and $f_\phi(\Phi_{\theta_m})$ is the prediction head on top of frozen expert $m$ .

GMoPE demonstrates that a synergy of MoE architecture and prompt-conditioned specialization, mediated by soft orthogonality and structure-aware routing, is fundamental for building scalable, robust, and parameter-efficient Graph Foundation Models. The framework offers strong empirical advances and a theoretical superset function class, paving the way for universal graph learning systems adaptable across diverse domains with minimal adaptation overhead.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Expert Tactic Prompt (ETP).