Neural Operator Experts

Updated 9 May 2026

Neural Operator Experts are specialized machine learning architectures that partition infinite-dimensional operator learning among dynamically selected subnetworks, enhancing efficiency in solving complex PDEs.
They employ mixture-of-experts frameworks using soft and hard gating strategies to allocate tasks based on spatial decomposition, physics regimes, and multiscale features.
These models deliver scalability and interpretability, achieving significant speedups and universal approximation capabilities in high-dimensional scientific computing.

Neural operator experts constitute a class of machine learning architectures in which a set of specialized operator-learning subnetworks (experts) are dynamically selected, combined, or otherwise orchestrated to learn rich infinite-dimensional mappings between function spaces, particularly solutions to parametric partial differential equations (PDEs). These architectures use gating mechanisms—either hard or soft, spatial, temporal, or context-dependent—to route information or blend outputs from various experts, allowing the model to partition complex operator solution spaces and to address heterogeneity, nonstationarity, domain-decomposition, PDE-type diversity, and multi-resolution features with increased accuracy, interpretability, and efficiency. The neural operator expert paradigm underlies advances in foundational and multitask operator learning, mixture-of-experts (MoE) architectures, local-global/multiscale surrogates, physics-informed surrogates, and rigorous universal approximation frameworks for operator learning across scientific computing, geophysics, climate modeling, and more.

1. Conceptual Foundations and Architectures

Neural operator experts generalize the vanilla neural operator paradigm, in which a single network learns a map $\mathcal{G}: \mathcal{U} \to \mathcal{V}$ between function spaces, by distributing this task over a collection of dynamically or statically selected subnetworks—each an “expert” neural operator, possibly of distinct architecture or kernel. The most common realization places these experts in a mixture-of-experts (MoE) framework, where a gating or routing network assigns, for each input sample, for each spatial location, or for each physical regime, a weight or routing decision to one or more experts, whose outputs are then combined, often as weighted sums or via a winning-take-all (hard routing) policy.

Two broad paradigms are prevalent:

Parallel soft-MoE: The output at each location or context is a convex (partition-of-unity) combination of the experts, with softmax-normalized gating (Deighan et al., 6 Feb 2025, Tripura et al., 2023, Sharma et al., 2024).
Tree/hierarchical/hard-MoE: Inputs are recursively routed down a (possibly deep) hierarchy of gating modules to a specialized leaf expert (Kratsios et al., 2024).

Experts may be designed to specialize along different axes:

Spatial decomposition (domain regions, near-well vs. far-field in reservoir models) (Deighan et al., 6 Feb 2025, Du et al., 2024, Sharma et al., 2024);
Physics or PDE regimes (different equation type/expert per operator or regime) (Wang et al., 29 Oct 2025, Gopakumar et al., 26 Feb 2026);
Wavelet/frequency/multiscale decomposition (local wavelet experts, scale-wise experts) (Tripura et al., 2023, Etienam et al., 2024, Etienam et al., 2024);
Boundary condition or domain geometry specialization (Deighan et al., 6 Feb 2025, Sharma et al., 2024);
Time-step scale (in multi-stride temporal models) (Pan et al., 14 Apr 2026).

2. Mathematical Formulations and Universal Approximation

The general MoE neural operator can be written as

$\mathcal{P}(u)(x) = \sum_{i=1}^K g_i(x) O_i(u)(x)$

where $\{O_i\}$ are expert neural operator modules and $g_i(x)$ are spatially (or contextually) distributed gating weights, constrained such that $g_i(x) \ge 0, \sum_{i} g_i(x) = 1$ almost everywhere. For hierarchical hard-MoE schemes, input routing is via tree search to the most appropriate expert (Kratsios et al., 2024).

Mixture-of-expert operator architectures soften the curse of dimensionality by decomposing the approximation task across many local, lower-capacity experts, each handling a (typically Lipschitz) subregion or submanifold of the function space. The distributed universal approximation theorem states: any uniformly continuous nonlinear operator $G^+$ between $L^2$ spaces can be uniformly approximated over Sobolev balls by a MoNO (mixture of neural operators), with each expert’s parameter count scaling as $O(\varepsilon^{-1})$ for accuracy $\varepsilon$ , at the expense of an exponential (in dimension and inverse error) number of experts (Kratsios et al., 2024). This construction ensures each expert module is computationally tractable and deployable, addressing memory bottlenecks associated with monolithic global operator learners.

3. Gating Mechanisms and Expert Specialization

Gating, determining how and where experts are activated, underpins expert specialization and efficiency:

Softmax gating: A NN (often a small MLP or CNN) computes location-dependent logits, normalized to convex weights (Sharma et al., 2024, Tripura et al., 2023, Deighan et al., 6 Feb 2025).
Hard gating (tree): Routing proceeds hierarchically by nearest-neighbor criterion in embedding space; only one expert is evaluated per sample (Kratsios et al., 2024).
Label-conditioned gating: For multitask PDE learning, gating may receive an explicit PDE-type label as input to optimize expert selection (Tripura et al., 2023);
Multi-stride/time-step routing: For multi-scale temporal prediction, gating is based on log-scale of stride (Pan et al., 14 Apr 2026);
Domain mask gating: Use of characteristic or distance masks to handle physical boundaries or domain decompositions (Deighan et al., 6 Feb 2025);
Memory-based gating: Semantic memory banks store and recall gating weights per previously seen physics or PDE, supporting continual learning (Tripura et al., 2023).

Training often leverages sparsity, regularizing the gating entropy or via top-K activation, both for memory/computational efficiency and for interpretability, as routings cluster by PDE type or physical regime (Wang et al., 29 Oct 2025, Sharma et al., 2024).

4. Key Architectures and Representative Applications

Mixture-of-Experts Operator Transformer (MoE-POT)

MoE-POT (Wang et al., 29 Oct 2025) adopts a sparse MoE Transformer architecture: at each layer, a router selects 4 experts (among 16 routed plus 2 shared) via softmax and top-K gating. Each expert is a convolutional subnetwork; two shared experts span universal features, while routed experts handle PDE-type specific behaviors. MoE-POT is pre-trained on multi-PDE datasets, achieving 40% lower zero-shot errors at equal activated parameter counts compared to dense transformer baselines, and enables accurate dataset-type inference from routing vectors.

Neural Combinatorial Wavelet Neural Operator (NCWNO)

NCWNO (Tripura et al., 2023) uses wavelet-domain local experts; each layer applies an integral operator realized as a convex ensemble of wavelet-kernel experts. Gating is spatially and task-dependent, with expert kernel parameters frozen during continual adaptation. This architecture is robust against catastrophic forgetting and enables rapid foundation-to-specialist transfer across multiple PDE regimes, with empirical results outperforming single-task FNO/DeepONet by one to two orders of magnitude in error.

Hierarchical MoE and MoNO

Distributed universal approximation for operator learning is rigorously established in (Kratsios et al., 2024). The MoNO architecture organises experts as leaves of a v-ary tree, with recursive MLP-based routing. Each expert approximates the operator on a local Sobolev ball, reducing curse-of-dimensionality scaling from the depth/width/rank of a single network to the number of experts.

Multi-Stride MoE Temporal Neural Operators

Ms-MoE-IFactFormer (Pan et al., 14 Apr 2026) combines an implicit factorized Transformer backbone with stride-dependent routed experts (plus a shared module), gating by log-scale of time-step. This construction achieves stable, fine-time-step, long-horizon turbulence prediction, maintaining long-time statistical fidelity and physical realism even at temporal resolutions up to 20 $\times$ finer than prior baselines.

DeepONet-based MoE and Ensemble Operators

Spatial MoE DeepONets (via partition-of-unity PoU gating) and trunk-ensemble DeepONets (Sharma et al., 2024) combine local trunk networks using fixed or learnable spatial weights, achieving 2–4 $\mathcal{P}(u)(x) = \sum_{i=1}^K g_i(x) O_i(u)(x)$ 0 lower test error on 2D/3D PDEs versus single-trunk or POD-DeepONet. These methods enable adaptive treatment of sharp local gradients, domain inhomogeneities, or mixed boundary conditions.

Physics-Informed and Domain-Decomposition MoE

In practical geoscience and reservoir workflows, PINO-CCR surrogates (Etienam et al., 2024, Etienam et al., 2024) use a PINO for global field solutions, with a cluster-classify-regress MoE architecture to correct well rates via local domain experts trained on the Peaceman model. In turbulent flow, domain-decomposed MoE NNOs assign distinct experts to near-wall and core flow (Deighan et al., 6 Feb 2025).

5. Practical Implementation and Scaling

Operator expert architectures require careful tuning of:

The number and capacity of experts—greater numbers yield error reductions (up to hardware/memory limits), but per-epoch time scales linearly with expert count (Tripura et al., 2023, Wang et al., 29 Oct 2025).
Gating network architecture and regularization (balancing specificity vs. redundancy).
Training protocols, such as freezing expert parameters in continual learning and supporting zero-shot transfer via semantic memory (Tripura et al., 2023).
Efficient soft MoE allows large model capacity at modest inference cost, since only a selected subset of experts are activated per example (Wang et al., 29 Oct 2025).
Interpretability: routers can be analyzed to recover dataset/PDE classification, expert specialization, and even support OOD detection (Wang et al., 29 Oct 2025, Deighan et al., 6 Feb 2025).

Empirical studies show that these architectures offer orders-of-magnitude speedup versus classical PDE solvers, efficient continual adaptation, robustness to catastrophic forgetting, and physically interpretable decomposition of solution feature space.

6. Theoretical and Algorithmic Implications

Mixture-of-expert neural operator architectures provide a rigorous path to easing the curse of dimensionality in high-accuracy operator learning:

Each local expert can be trained independently, enabling data-parallelism and modularization.
Hard or soft partitioning aligns with natural decompositions of the function space, e.g. by physical regime, spatial subdomain, or scale.
Expert interpretability affords physical validation (e.g., learned kernels and gates can be compared to analytical stencils or physical heuristics) (Gopakumar et al., 26 Feb 2026, Tripura et al., 2023).
Memory and compute load can be distributed: at inference, only the active expert(s) are resident; gating depth can be matched to hardware constraints (Kratsios et al., 2024).

Limitations include the exponential scaling of expert count with ambient dimension and error (in worst-case theory), necessitating the trade-off between expert size and their number. Practical architectures use sparsity and context-specific routers to mediate this trade-off.

7. Emerging Directions and Open Challenges

Neural operator expert frameworks are rapidly evolving:

Physics-informed and equation-constrained gating, as seen in operator-splitting-based MoE frameworks for modular PDE learning (Gopakumar et al., 26 Feb 2026);
Foundation operator models with memory-based gating and semantic memory for task recall and transfer (Tripura et al., 2023, Bacho et al., 25 Nov 2025);
Hybrid schemes integrating signal processing bias (Hilbert- or wavelet-based experts), multi-dimensional analytic transforms, and modular combination with classical numerical stencils (Pordanesh et al., 6 Aug 2025);
Adaptive, online expert addition/removal, and dynamic resource allocation;
Extending MoE operator learning to unstructured domains, non-Euclidean geometries, irregular boundary conditions, and parametric model selection.

Challenges remain in optimally allocating expert capacity, handling OOD generalization, quantifying uncertainty (especially in Bayesian expert models), and providing a unified theory for dynamic, data-adaptive expert routing.

References

"Mixture of Experts Soften the Curse of Dimensionality in Operator Learning" (Kratsios et al., 2024)
"A foundational neural operator that continuously learns without forgetting" (Tripura et al., 2023)
"Mixture-of-Experts Operator Transformer for Large-Scale PDE Pre-Training" (Wang et al., 29 Oct 2025)
"Stable Fine-Time-Step Long-Horizon Turbulence Prediction with a Multi-Stepsize Mixture-of-Experts Neural Operator" (Pan et al., 14 Apr 2026)
"Ensemble and Mixture-of-Experts DeepONets For Operator Learning" (Sharma et al., 2024)
"Mixture of neural operator experts for learning boundary conditions and model selection" (Deighan et al., 6 Feb 2025)
"Learning Physical Operators using Neural Operators" (Gopakumar et al., 26 Feb 2026)
"Reservoir History Matching of the Norne field with generative exotic priors and a coupled Mixture of Experts -- Physics Informed Neural Operator Forward Model" (Etienam et al., 2024)
"Hilbert Neural Operator: Operator Learning in the Analytic Signal Domain" (Pordanesh et al., 6 Aug 2025)