Multiscale Interaction Mixture of Experts

Updated 26 January 2026

MI-MoE is a topology-aware framework that replaces fixed neighborhood definitions with a multiscale ensemble of distance-cutoff experts for adaptive molecular modeling.
It leverages multiple interaction ranges—from covalent bonds to long-range packing—with a topological gating mechanism that dynamically routes inputs based on persistent descriptors.
Empirical evaluations demonstrate that MI-MoE significantly improves regression and classification performance on standard benchmarks, such as MoleculeNet and polymer property prediction tasks.

The Multiscale Interaction Mixture of Experts (MI-MoE) is a topology-aware framework designed for efficient and adaptive modeling of spatial interactions in 3D molecular graph neural networks (GNNs). The central innovation of MI-MoE is to replace rigid, globally fixed neighborhood definitions with an ensemble of distance-cutoff experts routed by topological descriptors capturing multiscale connectivity. MI-MoE explicitly targets the diverse length scales at which critical molecular phenomena—such as non-covalent interactions, stereoelectronic effects, and long-range packing forces—manifest, and addresses the challenge that no single geometric cutoff suffices for all tasks. It consistently improves leading 3D GNN backbones in regression and classification tasks across molecular and polymer property benchmarks by adaptively modulating the interaction budget using topological summaries of the input conformer (Nguyen et al., 19 Jan 2026).

1. Distance-Cutoff Expert Formulation

MI-MoE abandons the conventional single-cutoff scheme in GNN-based molecular modeling by employing an ensemble of $K$ interaction experts, each corresponding to a physically motivated distance radius. Given a molecular point cloud $\mathcal{P}=(\mathcal{V},\mathbf{R})$ —where $\mathcal{V}$ indexes $n$ atoms and $\mathbf{R}\in\mathbb{R}^{n\times3}$ are atomic coordinates—the interaction graph for cutoff $r$ is

$\mathcal{G}^{(r)} = (\mathcal{V},\; \mathcal{E}^{(r)},\; \mathbf{R}), \quad \mathcal{E}^{(r)} = \{(i, j): \|\mathbf{r}_i - \mathbf{r}_j\| \le r\}.$

A filtration $\mathcal{G}^{(r_1)} \subseteq \cdots \subseteq \mathcal{G}^{(r_M)}$ emerges as $r$ increases. MI-MoE selects $K=5$ cutoffs $\{2.0,\,2.5,\,3.0,\,3.5,\,4.0\}$ Å, ranging from covalent to medium-range packing effects. Each expert $E_k$ applies an independent $L$ -layer 3D GNN (e.g., SchNet, DimeNet++, PaiNN) to $\mathcal{G}^{(c_k)}$ with its own parameters.

The message-passing at expert $k$ , layer $\ell$ proceeds as:

For $i\in\mathcal{V}$ with neighbor $j\in \mathcal{N}_{c_k}(i)$ , compute edge messages

$\mathbf{m}_{ij}^{(\ell)} = \mathrm{MSG}^{(\ell)}\bigl(\mathbf{h}_i^{(\ell-1)}, \mathbf{h}_j^{(\ell-1)}, \mathbf{x}_i^{(\ell-1)}, \mathbf{x}_j^{(\ell-1)}, \mathbf{e}_{ij}\bigr)$

where $\mathbf{e}_{ij}$ encodes relative geometry.

Aggregate messages and update features:

$\mathbf{m}_i^{(\ell)} = \mathrm{AGG}^{(\ell)}\left\{\mathbf{m}_{ij}^{(\ell)}\right\}$

$(\mathbf{h}_i^{(\ell)},\,\mathbf{x}_i^{(\ell)}) = \mathrm{UPD}^{(\ell)}(\mathbf{h}_i^{(\ell-1)},\,\mathbf{x}_i^{(\ell-1)},\,\mathbf{m}_i^{(\ell)})$

Post $L$ layers, a READOUT step yields the embedding $\mathbf{h}_{\mathcal{G}^{(c_k)}}$ .

Short-range ($2.0$ Å) experts capture covalent structure; intermediate cutoffs ($2.5$–$3.0$ Å) address hydrogen bonds and steric constraints; larger cutoffs ($3.5$–$4.0$ Å) encompass extended, diffuse interactions.

2. Topological Gating and Routing

Expert activation in MI-MoE is governed by a filtration-based topological encoder designed to inspect the conformer's multiscale structure. A dense sequence of radii $\{r_t\}_{t=1}^T$ discretizes the interval $[c_1 - w/2, c_K + w/2]$ (typically $T\approx 20$ , $w=1.0$ Å, $\Delta r=0.25$ Å). At each scale $\mathcal{G}^{(r_t)}$ , five normalized descriptors are computed:

Randić index
Wiener index
Global efficiency
Betti curves from persistent homology ( $p=0,1$ )

Concatenation gives the topological trajectory $X_{\mathrm{topo}}\in \mathbb{R}^{T\times5}$ . After flattening, an MLP computes unnormalized expert scores $\boldsymbol{\alpha}^{\mathrm{raw}} = f_{\mathrm{topo}}(X_{\mathrm{topo}})\in \mathbb{R}^{K}$ . Sparse gating is enforced: only the top- $k$ (typically $k=2$ ) experts remain active via masking and softmax, producing a sparse attention vector $\boldsymbol{\alpha}=(\alpha_1,\dots,\alpha_K)$ . This vector adaptively routes the molecular input to the most informative interaction scales as dictated by its topological connectivity across radii.

3. Forward Pass and Model Architecture Integration

MI-MoE functions as a modular enhancement for any compatible 3D GNN backbone. The forward pass incorporates the following steps:

Construct sparse graphs $\{\mathcal{G}^{(c_k)}\}_{k=1}^K$ and dense graphs $\{\mathcal{G}^{(r_t)}\}_{t=1}^T$ .
Compute $X_{\mathrm{topo}}$ via filtration descriptors.
Obtain gating logits and apply top- $k$ masking for sparse activation.
Compute each expert $E_k$ 's embedding $\mathbf{h}_{\mathcal{G}^{(c_k)}}$ .
Form the final representation:

$\mathbf{h} = \sum_{k=1}^K \alpha_k\,\mathbf{h}_{\mathcal{G}^{(c_k)}}$

Experts are typically $L=3$ layers deep (hidden dimension $d=128$ ), and the gating MLP has two hidden layers of width $256$. Despite using multiple experts, MI-MoE maintains competitive parameter counts by halving the number of layers per expert compared to standalone baselines.

4. Training Objectives and Hyperparameters

The training objective comprises the primary task loss $\mathcal{L}_{\mathrm{task}}$ (e.g., MSE for regression, cross-entropy for classification) and two MoE regularizers:

Score balance ( $\mathcal{L}_{\mathrm{score}}$ ): encourages uniform gating weight distribution across experts.
Load balance ( $\mathcal{L}_{\mathrm{load}}$ ): targets even expert selection frequencies across the batch.

The total loss:

$\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{task}} + \lambda \left( \mathcal{L}_{\mathrm{score}} + \mathcal{L}_{\mathrm{load}} \right)$

with $\lambda=1.0$ .

Optimization uses AdamW with learning rates in $\{10^{-4},10^{-3}\}$ (validated), weight decay $10^{-5}$ , batch size $\{32,64\}$ , and dropout in $\{0,0.5\}$ . A cosine-annealing learning rate with a 10-epoch warm-up and up to 120 epochs is used, with early stopping if no performance improvement is observed in 30 epochs.

Key hyperparameters include five experts, cutoffs $\{2.0,2.5,3.0,3.5,4.0\}$ Å, dense radii window $w=1.0$ Å, $\Delta r=0.25$ Å, MLP gate depth two (width $256$), and top- $k=2$ selected experts.

5. Empirical Results and Benchmark Performance

Experimental evaluation uses MoleculeNet tasks (regression: FreeSolv, ESOL, Lipophilicity; classification: BACE, BBBP, SIDER, Tox21, ClinTox) and polymer property prediction (electron affinity $E_{ea}$ , ionization energy $E_i$ , crystallization $X_c$ , refractive index $\eta_c$ ). MI-MoE is benchmarked against 2D GNNs, SMILES transformers, prominent 3D GNNs (including SchNet, DimeNet++, PaiNN), and recent MoE variants.

Key findings:

Average RMSE reduction by $\sim$ 0.15 and ROC-AUC boost by $\sim$ 8% across MoleculeNet, relative to base 3D GNNs.
MI-MoE-SchNet halves the error on electron affinity and crystallization tasks compared to best prior geometry- or topology-aware models.

Table 1. Representative MoleculeNet Results

Backbone	Task	Baseline	+ MI-MoE	Δ
SchNet	RMSE	1.23 ± 0.15	1.06 ± 0.20	–0.17
	ROC-AUC	69.7 ± 4.9%	81.3 ± 3.9%	+11.6%
DimeNet++	RMSE	1.14 ± 0.25	1.04 ± 0.22	–0.10
	ROC-AUC	78.7 ± 3.3%	80.0 ± 2.5%	+1.3%
PaiNN	RMSE	1.09 ± 0.30	0.97 ± 0.27	–0.12
	ROC-AUC	72.8 ± 4.7%	79.9 ± 4.4%	+7.1%

Table 2. Polymer Property Prediction RMSE

Model	$E_{ea}$	$E_i$	$X_c$	$\eta_c$
Mol-TDL	0.263	0.417	15.86	0.068
GEM	0.274	0.313	17.82	0.092
MI-MoE-SchNet	0.148	0.240	8.95	0.065

Compared to Uni-Mol and TopExpert, MI-MoE is competitive or superior, particularly at similar or reduced computational depth.

6. Ablation Studies and Architectural Choices

Ablation experiments address choices in expert cutoffs, gating network architecture, and aggregation strategies:

Cutoff selection: Varying the cutoff sets ( $\{2,3,4,5,6\}$ Å and $\{2,4,6,8,10\}$ Å) affects performance, with task-dependent preferences (e.g., BACE favors default mid-range; FreeSolv/Tox21 sometimes prefers extended range).
Gating architectures: The MLP gate consistently outperforms the Transformer-based gate on most tasks; marginal Transformer gains observed in some classification regimes.
Expert aggregation: Using topology to select a single expert ( $k=1$ ) surpasses fixed-cutoff baselines; allowing a Top-2 sparse mixture (full MI-MoE) further enhances generality and regularization.

7. Context, Significance, and Implications

MI-MoE advances 3D molecular graph learning by specializing expert GNNs to physically meaningful interaction ranges and leveraging persistent topological signatures for adaptive routing. Its modularity enables consistent performance improvements as a drop-in component across a variety of invariant and equivariant 3D GNN architectures and molecular domains. These results demonstrate the efficacy of topology-aware, multiscale routing over conventional neighborhood heuristics. A plausible implication is that filtration-based gating and multiscale ensembles may generalize to broader applications where the relevant interaction scales are heterogeneous and data-dependent (Nguyen et al., 19 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Topology-Aware Multiscale Mixture of Experts for Efficient Molecular Property Prediction (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multiscale Interaction Mixture of Experts (MI-MoE).

Multiscale Interaction Mixture of Experts

1. Distance-Cutoff Expert Formulation

2. Topological Gating and Routing

3. Forward Pass and Model Architecture Integration

4. Training Objectives and Hyperparameters

5. Empirical Results and Benchmark Performance

6. Ablation Studies and Architectural Choices

7. Context, Significance, and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multiscale Interaction Mixture of Experts

1. Distance-Cutoff Expert Formulation

2. Topological Gating and Routing

3. Forward Pass and Model Architecture Integration

4. Training Objectives and Hyperparameters

5. Empirical Results and Benchmark Performance

6. Ablation Studies and Architectural Choices

7. Context, Significance, and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research