Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multiscale Interaction Mixture of Experts

Updated 26 January 2026
  • MI-MoE is a topology-aware framework that replaces fixed neighborhood definitions with a multiscale ensemble of distance-cutoff experts for adaptive molecular modeling.
  • It leverages multiple interaction ranges—from covalent bonds to long-range packing—with a topological gating mechanism that dynamically routes inputs based on persistent descriptors.
  • Empirical evaluations demonstrate that MI-MoE significantly improves regression and classification performance on standard benchmarks, such as MoleculeNet and polymer property prediction tasks.

The Multiscale Interaction Mixture of Experts (MI-MoE) is a topology-aware framework designed for efficient and adaptive modeling of spatial interactions in 3D molecular graph neural networks (GNNs). The central innovation of MI-MoE is to replace rigid, globally fixed neighborhood definitions with an ensemble of distance-cutoff experts routed by topological descriptors capturing multiscale connectivity. MI-MoE explicitly targets the diverse length scales at which critical molecular phenomena—such as non-covalent interactions, stereoelectronic effects, and long-range packing forces—manifest, and addresses the challenge that no single geometric cutoff suffices for all tasks. It consistently improves leading 3D GNN backbones in regression and classification tasks across molecular and polymer property benchmarks by adaptively modulating the interaction budget using topological summaries of the input conformer (Nguyen et al., 19 Jan 2026).

1. Distance-Cutoff Expert Formulation

MI-MoE abandons the conventional single-cutoff scheme in GNN-based molecular modeling by employing an ensemble of KK interaction experts, each corresponding to a physically motivated distance radius. Given a molecular point cloud P=(V,R)\mathcal{P}=(\mathcal{V},\mathbf{R})—where V\mathcal{V} indexes nn atoms and R∈Rn×3\mathbf{R}\in\mathbb{R}^{n\times3} are atomic coordinates—the interaction graph for cutoff rr is

G(r)=(V,  E(r),  R),E(r)={(i,j):∥ri−rj∥≤r}.\mathcal{G}^{(r)} = (\mathcal{V},\; \mathcal{E}^{(r)},\; \mathbf{R}), \quad \mathcal{E}^{(r)} = \{(i, j): \|\mathbf{r}_i - \mathbf{r}_j\| \le r\}.

A filtration G(r1)⊆⋯⊆G(rM)\mathcal{G}^{(r_1)} \subseteq \cdots \subseteq \mathcal{G}^{(r_M)} emerges as rr increases. MI-MoE selects K=5K=5 cutoffs {2.0, 2.5, 3.0, 3.5, 4.0}\{2.0,\,2.5,\,3.0,\,3.5,\,4.0\} Å, ranging from covalent to medium-range packing effects. Each expert EkE_k applies an independent LL-layer 3D GNN (e.g., SchNet, DimeNet++, PaiNN) to G(ck)\mathcal{G}^{(c_k)} with its own parameters.

The message-passing at expert kk, layer â„“\ell proceeds as:

  • For i∈Vi\in\mathcal{V} with neighbor j∈Nck(i)j\in \mathcal{N}_{c_k}(i), compute edge messages

mij(ℓ)=MSG(ℓ)(hi(ℓ−1),hj(ℓ−1),xi(ℓ−1),xj(ℓ−1),eij)\mathbf{m}_{ij}^{(\ell)} = \mathrm{MSG}^{(\ell)}\bigl(\mathbf{h}_i^{(\ell-1)}, \mathbf{h}_j^{(\ell-1)}, \mathbf{x}_i^{(\ell-1)}, \mathbf{x}_j^{(\ell-1)}, \mathbf{e}_{ij}\bigr)

where eij\mathbf{e}_{ij} encodes relative geometry.

  • Aggregate messages and update features:

mi(â„“)=AGG(â„“){mij(â„“)}\mathbf{m}_i^{(\ell)} = \mathrm{AGG}^{(\ell)}\left\{\mathbf{m}_{ij}^{(\ell)}\right\}

(hi(ℓ), xi(ℓ))=UPD(ℓ)(hi(ℓ−1), xi(ℓ−1), mi(ℓ))(\mathbf{h}_i^{(\ell)},\,\mathbf{x}_i^{(\ell)}) = \mathrm{UPD}^{(\ell)}(\mathbf{h}_i^{(\ell-1)},\,\mathbf{x}_i^{(\ell-1)},\,\mathbf{m}_i^{(\ell)})

Post LL layers, a READOUT step yields the embedding hG(ck)\mathbf{h}_{\mathcal{G}^{(c_k)}}.

Short-range ($2.0$ Å) experts capture covalent structure; intermediate cutoffs ($2.5$–$3.0$ Å) address hydrogen bonds and steric constraints; larger cutoffs ($3.5$–$4.0$ Å) encompass extended, diffuse interactions.

2. Topological Gating and Routing

Expert activation in MI-MoE is governed by a filtration-based topological encoder designed to inspect the conformer's multiscale structure. A dense sequence of radii {rt}t=1T\{r_t\}_{t=1}^T discretizes the interval [c1−w/2,cK+w/2][c_1 - w/2, c_K + w/2] (typically T≈20T\approx 20, w=1.0w=1.0 Å, Δr=0.25\Delta r=0.25 Å). At each scale G(rt)\mathcal{G}^{(r_t)}, five normalized descriptors are computed:

Concatenation gives the topological trajectory Xtopo∈RT×5X_{\mathrm{topo}}\in \mathbb{R}^{T\times5}. After flattening, an MLP computes unnormalized expert scores αraw=ftopo(Xtopo)∈RK\boldsymbol{\alpha}^{\mathrm{raw}} = f_{\mathrm{topo}}(X_{\mathrm{topo}})\in \mathbb{R}^{K}. Sparse gating is enforced: only the top-kk (typically k=2k=2) experts remain active via masking and softmax, producing a sparse attention vector α=(α1,…,αK)\boldsymbol{\alpha}=(\alpha_1,\dots,\alpha_K). This vector adaptively routes the molecular input to the most informative interaction scales as dictated by its topological connectivity across radii.

3. Forward Pass and Model Architecture Integration

MI-MoE functions as a modular enhancement for any compatible 3D GNN backbone. The forward pass incorporates the following steps:

  1. Construct sparse graphs {G(ck)}k=1K\{\mathcal{G}^{(c_k)}\}_{k=1}^K and dense graphs {G(rt)}t=1T\{\mathcal{G}^{(r_t)}\}_{t=1}^T.
  2. Compute XtopoX_{\mathrm{topo}} via filtration descriptors.
  3. Obtain gating logits and apply top-kk masking for sparse activation.
  4. Compute each expert EkE_k's embedding hG(ck)\mathbf{h}_{\mathcal{G}^{(c_k)}}.
  5. Form the final representation:

h=∑k=1Kαk hG(ck)\mathbf{h} = \sum_{k=1}^K \alpha_k\,\mathbf{h}_{\mathcal{G}^{(c_k)}}

Experts are typically L=3L=3 layers deep (hidden dimension d=128d=128), and the gating MLP has two hidden layers of width $256$. Despite using multiple experts, MI-MoE maintains competitive parameter counts by halving the number of layers per expert compared to standalone baselines.

4. Training Objectives and Hyperparameters

The training objective comprises the primary task loss Ltask\mathcal{L}_{\mathrm{task}} (e.g., MSE for regression, cross-entropy for classification) and two MoE regularizers:

  • Score balance (Lscore\mathcal{L}_{\mathrm{score}}): encourages uniform gating weight distribution across experts.
  • Load balance (Lload\mathcal{L}_{\mathrm{load}}): targets even expert selection frequencies across the batch.

The total loss:

Ltotal=Ltask+λ(Lscore+Lload)\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{task}} + \lambda \left( \mathcal{L}_{\mathrm{score}} + \mathcal{L}_{\mathrm{load}} \right)

with λ=1.0\lambda=1.0.

Optimization uses AdamW with learning rates in {10−4,10−3}\{10^{-4},10^{-3}\} (validated), weight decay 10−510^{-5}, batch size {32,64}\{32,64\}, and dropout in {0,0.5}\{0,0.5\}. A cosine-annealing learning rate with a 10-epoch warm-up and up to 120 epochs is used, with early stopping if no performance improvement is observed in 30 epochs.

Key hyperparameters include five experts, cutoffs {2.0,2.5,3.0,3.5,4.0}\{2.0,2.5,3.0,3.5,4.0\} Å, dense radii window w=1.0w=1.0 Å, Δr=0.25\Delta r=0.25 Å, MLP gate depth two (width $256$), and top-k=2k=2 selected experts.

5. Empirical Results and Benchmark Performance

Experimental evaluation uses MoleculeNet tasks (regression: FreeSolv, ESOL, Lipophilicity; classification: BACE, BBBP, SIDER, Tox21, ClinTox) and polymer property prediction (electron affinity EeaE_{ea}, ionization energy EiE_i, crystallization XcX_c, refractive index ηc\eta_c). MI-MoE is benchmarked against 2D GNNs, SMILES transformers, prominent 3D GNNs (including SchNet, DimeNet++, PaiNN), and recent MoE variants.

Key findings:

  • Average RMSE reduction by ∼\sim0.15 and ROC-AUC boost by ∼\sim8% across MoleculeNet, relative to base 3D GNNs.
  • MI-MoE-SchNet halves the error on electron affinity and crystallization tasks compared to best prior geometry- or topology-aware models.

Table 1. Representative MoleculeNet Results

Backbone Task Baseline + MI-MoE Δ
SchNet RMSE 1.23 ± 0.15 1.06 ± 0.20 –0.17
ROC-AUC 69.7 ± 4.9% 81.3 ± 3.9% +11.6%
DimeNet++ RMSE 1.14 ± 0.25 1.04 ± 0.22 –0.10
ROC-AUC 78.7 ± 3.3% 80.0 ± 2.5% +1.3%
PaiNN RMSE 1.09 ± 0.30 0.97 ± 0.27 –0.12
ROC-AUC 72.8 ± 4.7% 79.9 ± 4.4% +7.1%

Table 2. Polymer Property Prediction RMSE

Model EeaE_{ea} EiE_i XcX_c ηc\eta_c
Mol-TDL 0.263 0.417 15.86 0.068
GEM 0.274 0.313 17.82 0.092
MI-MoE-SchNet 0.148 0.240 8.95 0.065

Compared to Uni-Mol and TopExpert, MI-MoE is competitive or superior, particularly at similar or reduced computational depth.

6. Ablation Studies and Architectural Choices

Ablation experiments address choices in expert cutoffs, gating network architecture, and aggregation strategies:

  • Cutoff selection: Varying the cutoff sets ({2,3,4,5,6}\{2,3,4,5,6\} Å and {2,4,6,8,10}\{2,4,6,8,10\} Å) affects performance, with task-dependent preferences (e.g., BACE favors default mid-range; FreeSolv/Tox21 sometimes prefers extended range).
  • Gating architectures: The MLP gate consistently outperforms the Transformer-based gate on most tasks; marginal Transformer gains observed in some classification regimes.
  • Expert aggregation: Using topology to select a single expert (k=1k=1) surpasses fixed-cutoff baselines; allowing a Top-2 sparse mixture (full MI-MoE) further enhances generality and regularization.

7. Context, Significance, and Implications

MI-MoE advances 3D molecular graph learning by specializing expert GNNs to physically meaningful interaction ranges and leveraging persistent topological signatures for adaptive routing. Its modularity enables consistent performance improvements as a drop-in component across a variety of invariant and equivariant 3D GNN architectures and molecular domains. These results demonstrate the efficacy of topology-aware, multiscale routing over conventional neighborhood heuristics. A plausible implication is that filtration-based gating and multiscale ensembles may generalize to broader applications where the relevant interaction scales are heterogeneous and data-dependent (Nguyen et al., 19 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multiscale Interaction Mixture of Experts (MI-MoE).