Multiscale Interaction Mixture of Experts
- MI-MoE is a topology-aware framework that replaces fixed neighborhood definitions with a multiscale ensemble of distance-cutoff experts for adaptive molecular modeling.
- It leverages multiple interaction ranges—from covalent bonds to long-range packing—with a topological gating mechanism that dynamically routes inputs based on persistent descriptors.
- Empirical evaluations demonstrate that MI-MoE significantly improves regression and classification performance on standard benchmarks, such as MoleculeNet and polymer property prediction tasks.
The Multiscale Interaction Mixture of Experts (MI-MoE) is a topology-aware framework designed for efficient and adaptive modeling of spatial interactions in 3D molecular graph neural networks (GNNs). The central innovation of MI-MoE is to replace rigid, globally fixed neighborhood definitions with an ensemble of distance-cutoff experts routed by topological descriptors capturing multiscale connectivity. MI-MoE explicitly targets the diverse length scales at which critical molecular phenomena—such as non-covalent interactions, stereoelectronic effects, and long-range packing forces—manifest, and addresses the challenge that no single geometric cutoff suffices for all tasks. It consistently improves leading 3D GNN backbones in regression and classification tasks across molecular and polymer property benchmarks by adaptively modulating the interaction budget using topological summaries of the input conformer (Nguyen et al., 19 Jan 2026).
1. Distance-Cutoff Expert Formulation
MI-MoE abandons the conventional single-cutoff scheme in GNN-based molecular modeling by employing an ensemble of interaction experts, each corresponding to a physically motivated distance radius. Given a molecular point cloud —where indexes atoms and are atomic coordinates—the interaction graph for cutoff is
A filtration emerges as increases. MI-MoE selects cutoffs  Å, ranging from covalent to medium-range packing effects. Each expert applies an independent -layer 3D GNN (e.g., SchNet, DimeNet++, PaiNN) to with its own parameters.
The message-passing at expert , layer proceeds as:
- For with neighbor , compute edge messages
where encodes relative geometry.
- Aggregate messages and update features:
Post layers, a READOUT step yields the embedding .
Short-range ($2.0$ Å) experts capture covalent structure; intermediate cutoffs ($2.5$–$3.0$ Å) address hydrogen bonds and steric constraints; larger cutoffs ($3.5$–$4.0$ Å) encompass extended, diffuse interactions.
2. Topological Gating and Routing
Expert activation in MI-MoE is governed by a filtration-based topological encoder designed to inspect the conformer's multiscale structure. A dense sequence of radii discretizes the interval (typically ,  Å,  Å). At each scale , five normalized descriptors are computed:
- Randić index
- Wiener index
- Global efficiency
- Betti curves from persistent homology ()
Concatenation gives the topological trajectory . After flattening, an MLP computes unnormalized expert scores . Sparse gating is enforced: only the top- (typically ) experts remain active via masking and softmax, producing a sparse attention vector . This vector adaptively routes the molecular input to the most informative interaction scales as dictated by its topological connectivity across radii.
3. Forward Pass and Model Architecture Integration
MI-MoE functions as a modular enhancement for any compatible 3D GNN backbone. The forward pass incorporates the following steps:
- Construct sparse graphs and dense graphs .
- Compute via filtration descriptors.
- Obtain gating logits and apply top- masking for sparse activation.
- Compute each expert 's embedding .
- Form the final representation:
Experts are typically layers deep (hidden dimension ), and the gating MLP has two hidden layers of width $256$. Despite using multiple experts, MI-MoE maintains competitive parameter counts by halving the number of layers per expert compared to standalone baselines.
4. Training Objectives and Hyperparameters
The training objective comprises the primary task loss (e.g., MSE for regression, cross-entropy for classification) and two MoE regularizers:
- Score balance (): encourages uniform gating weight distribution across experts.
- Load balance (): targets even expert selection frequencies across the batch.
The total loss:
with .
Optimization uses AdamW with learning rates in (validated), weight decay , batch size , and dropout in . A cosine-annealing learning rate with a 10-epoch warm-up and up to 120 epochs is used, with early stopping if no performance improvement is observed in 30 epochs.
Key hyperparameters include five experts, cutoffs  Å, dense radii window  Å,  Å, MLP gate depth two (width $256$), and top- selected experts.
5. Empirical Results and Benchmark Performance
Experimental evaluation uses MoleculeNet tasks (regression: FreeSolv, ESOL, Lipophilicity; classification: BACE, BBBP, SIDER, Tox21, ClinTox) and polymer property prediction (electron affinity , ionization energy , crystallization , refractive index ). MI-MoE is benchmarked against 2D GNNs, SMILES transformers, prominent 3D GNNs (including SchNet, DimeNet++, PaiNN), and recent MoE variants.
Key findings:
- Average RMSE reduction by 0.15 and ROC-AUC boost by 8% across MoleculeNet, relative to base 3D GNNs.
- MI-MoE-SchNet halves the error on electron affinity and crystallization tasks compared to best prior geometry- or topology-aware models.
Table 1. Representative MoleculeNet Results
| Backbone | Task | Baseline | + MI-MoE | Δ |
|---|---|---|---|---|
| SchNet | RMSE | 1.23 ± 0.15 | 1.06 ± 0.20 | –0.17 |
| ROC-AUC | 69.7 ± 4.9% | 81.3 ± 3.9% | +11.6% | |
| DimeNet++ | RMSE | 1.14 ± 0.25 | 1.04 ± 0.22 | –0.10 |
| ROC-AUC | 78.7 ± 3.3% | 80.0 ± 2.5% | +1.3% | |
| PaiNN | RMSE | 1.09 ± 0.30 | 0.97 ± 0.27 | –0.12 |
| ROC-AUC | 72.8 ± 4.7% | 79.9 ± 4.4% | +7.1% |
Table 2. Polymer Property Prediction RMSE
| Model | ||||
|---|---|---|---|---|
| Mol-TDL | 0.263 | 0.417 | 15.86 | 0.068 |
| GEM | 0.274 | 0.313 | 17.82 | 0.092 |
| MI-MoE-SchNet | 0.148 | 0.240 | 8.95 | 0.065 |
Compared to Uni-Mol and TopExpert, MI-MoE is competitive or superior, particularly at similar or reduced computational depth.
6. Ablation Studies and Architectural Choices
Ablation experiments address choices in expert cutoffs, gating network architecture, and aggregation strategies:
- Cutoff selection: Varying the cutoff sets ( Å and  Å) affects performance, with task-dependent preferences (e.g., BACE favors default mid-range; FreeSolv/Tox21 sometimes prefers extended range).
- Gating architectures: The MLP gate consistently outperforms the Transformer-based gate on most tasks; marginal Transformer gains observed in some classification regimes.
- Expert aggregation: Using topology to select a single expert () surpasses fixed-cutoff baselines; allowing a Top-2 sparse mixture (full MI-MoE) further enhances generality and regularization.
7. Context, Significance, and Implications
MI-MoE advances 3D molecular graph learning by specializing expert GNNs to physically meaningful interaction ranges and leveraging persistent topological signatures for adaptive routing. Its modularity enables consistent performance improvements as a drop-in component across a variety of invariant and equivariant 3D GNN architectures and molecular domains. These results demonstrate the efficacy of topology-aware, multiscale routing over conventional neighborhood heuristics. A plausible implication is that filtration-based gating and multiscale ensembles may generalize to broader applications where the relevant interaction scales are heterogeneous and data-dependent (Nguyen et al., 19 Jan 2026).