MoE-POT: Mixture-of-Experts Operator Transformer

Updated 12 May 2026

The paper introduces a novel sparse-activated Transformer that specializes in efficiently learning solution operators for heterogeneous time-dependent PDEs.
It integrates a multi-head Fourier integral operator layer with a dynamic Mixture-of-Experts module that activates only a subset of experts per token to improve inference speed.
Empirical results demonstrate up to 40% error reduction and high interpretability, enabling robust scaling and meta-learning potential for diverse PDE tasks.

The Mixture-of-Experts Operator Transformer (MoE-POT) is a sparse-activated Transformer architecture designed for efficient and expressive large-scale pre-training of solution operators for time-dependent partial differential equations (PDEs). MoE-POT is motivated by the heterogeneity of PDE datasets and the inefficiency of conventional dense neural operators in handling diverse equation types and large parameter spaces. By introducing structured expert specialization, learned routing, and parameter-efficient design, MoE-POT advances operator learning for multiscale and heterogeneous scientific problems (Wang et al., 29 Oct 2025, Sharma et al., 2024, Gao et al., 2022).

1. Architectural Foundations

MoE-POT replaces the conventional dense feed-forward network blocks in each Transformer layer with a Mixture-of-Experts (MoE) module. Each block consists of a multi-head Fourier integral operator (FIO) layer followed by this MoE layer. The key MoE layer incorporates two classes of experts:

Routed Experts: A set of $N_r=16$ small convolutional sub-networks $\{E^{(\ell)}_{r,1}, ..., E^{(\ell)}_{r,16}\}$ , each able to adapt to equation-specific dynamics.
Shared Experts: $N_s=2$ convolutional sub-networks $\{E^{(\ell)}_{s,1}, E^{(\ell)}_{s,2}\}$ , always activated, capturing universal PDE properties and enforcing global inductive biases such as conservation and symmetry.

Given post-Fourier features $z_0^\ell(x)\in \mathbb{R}^d$ at location $x$ , the MoE layer's router-gating network, parametrized as a small CNN, produces routing logits $s^\ell(z_0^\ell(x))$ . Softmax normalization gives a routing distribution, where only the top- $K=4$ routed experts are activated per sample (i.e., sparse MoE). The output is the weighted sum of the results of the activated experts, plus an equal-weighted sum from the shared experts:

$z^{\ell+1}(x) = \frac{1}{N_s} \sum_{m=1}^{N_s} E^{(\ell)}_{s,m}(z_0^\ell(x)) + \sum_{k=1}^K p^\ell_{i_k}(x)E^{(\ell)}_{r,i_k}(z_0^\ell(x))$

This structure ensures locality, adaptivity, and parameter~efficiency (Wang et al., 29 Oct 2025, Sharma et al., 2024).

2. Sparse Activation, Cost-Effectiveness, and Scaling

MoE-POT decouples model capacity from inference cost through controllable expert activation. At inference, only $K+N_s=6$ out of 18 possible expert networks are active per token and layer, ensuring that activated parameters are a fraction of the model's total. For example, MoE-POT-Small has $\{E^{(\ell)}_{r,1}, ..., E^{(\ell)}_{r,16}\}$ 0M total parameters but only~90M are active, and MoE-POT-Medium has $\{E^{(\ell)}_{r,1}, ..., E^{(\ell)}_{r,16}\}$ 1M total parameters with $\{E^{(\ell)}_{r,1}, ..., E^{(\ell)}_{r,16}\}$ 2M active. This sparsity yields significant speed-ups (30–40% faster per step compared to dense DPOT models at fixed activation budget) and enables models with $\{E^{(\ell)}_{r,1}, ..., E^{(\ell)}_{r,16}\}$ 3M parameters to be deployed without prohibitive computational overhead (Wang et al., 29 Oct 2025).

3. Routing, Regularization, and Specialization

The router-gating network learns to select experts dynamically, focusing capacity on equations with similar dynamics or localized features. To prevent expert collapse—a pathological concentration of routing on a few experts—each layer includes a load-balancing regularization term:

$\{E^{(\ell)}_{r,1}, ..., E^{(\ell)}_{r,16}\}$ 4

where $\{E^{(\ell)}_{r,1}, ..., E^{(\ell)}_{r,16}\}$ 5 is each expert's aggregate routing mass and $\{E^{(\ell)}_{r,1}, ..., E^{(\ell)}_{r,16}\}$ 6 is the coefficient of variation. This ensures robust usage across all experts and supports emergent specialization. Empirical results demonstrate that the router's top- $\{E^{(\ell)}_{r,1}, ..., E^{(\ell)}_{r,16}\}$ 7 selection enables experts to cluster around specific PDE types or solution regimes (Wang et al., 29 Oct 2025).

4. Operator Pre-Training Methodology

MoE-POT builds on the auto-regressive denoising operator paradigm introduced in DPOT: given solution frames $\{E^{(\ell)}_{r,1}, ..., E^{(\ell)}_{r,16}\}$ 8, the model predicts $\{E^{(\ell)}_{r,1}, ..., E^{(\ell)}_{r,16}\}$ 9. During pre-training:

Inputs for each time step are patchified ( $N_s=2$ 0), Fourier-feature-embedded, and temporally aggregated.
Gaussian noise $N_s=2$ 1 is added for robustness.
The loss function sums squared errors over time and includes the balance regularization from all layers:

$N_s=2$ 2

Pre-training uses six public time-dependent PDE datasets representing diverse dynamics: incompressible/compressible Navier–Stokes, shallow water, reaction–diffusion, and flows on irregular domains (Wang et al., 29 Oct 2025).

5. Empirical Performance and Scaling Laws

MoE-POT exhibits robust scaling properties. At fixed activation-parameter budgets, it consistently outperforms dense neural operator baselines.

On the NS( $N_s=2$ 3) task, MoE-POT reduces test L2RE from 0.0569 (DPOT-Medium, 122M activated) to 0.0552 (MoE-POT-Small, 90M activated).
On the CNS task, the same comparison yields a 57% relative L2RE reduction.
Across 6 tasks, zero-shot error reductions reach up to 40% compared to dense baselines. Fine-tuning further improves results, with MoE-POT-Medium achieving state-of-the-art on five of six tasks examined. MoE-POT models also follow a scaling law: for a given activation-parameter budget, error decreases monotonically with model size, outperforming dense models of the same or larger sizes (Wang et al., 29 Oct 2025).

6. Interpretability and Expert Specialization

A salient property of MoE-POT is the interpretability of its gating mechanism. The expert-routed decisions cluster naturally by equation type: the routing patterns from a held-out sample are sufficient to recover the originating dataset identity with over 98% accuracy, without any explicit classification objective. This emergent behavior suggests that routed experts develop specialization to distinct PDE regimes, while shared experts encode broadly applicable priors. The gating clusters become stable and generalize to out-of-distribution datasets by epoch~250 of pre-training (Wang et al., 29 Oct 2025). A plausible implication is that router-gating outputs could be used for unsupervised discovery or meta-learning over broad scientific datasets.

7. Relation to MoE Operator Networks and Broader Implications

The MoE-POT design builds upon prior mixture-of-experts surrogates for operator learning, notably partition-of-unity MoE DeepONets (PoU-MoE) (Sharma et al., 2024). PoU-MoE statically partitions the domain with kernelized spatial weights, while MoE-POT generalizes this by replacing static partitions with learned, input- and context-dependent routing and by leveraging deep expert convolutional networks and self-attention. Related parameter-efficient MoE architectures for language modeling employ matrix product operator (MPO) factorization and parameter sharing among experts for further compression (Gao et al., 2022). However, current MoE-POT implementations do not yet incorporate such tensor factorization, instead leveraging convolutional experts and token-wise gating.

MoE-POT enables state-of-the-art operator surrogates with unprecedented parameter counts and inference efficiency, while introducing mechanisms for specialization, interpretability, and potential meta-reasoning over heterogeneous PDEs. Limitations include the lack of theoretical guarantees on router mechanism optimality and challenges in extending the approach to continuous-parameter PDE families or parameter-conditioned operator learning. Future research directions include theoretical analysis of router specialization, integration with MPO-based factorization, parameter-conditioned transformers, and automated pre-training curriculum design informed by gating statistics (Wang et al., 29 Oct 2025, Sharma et al., 2024, Gao et al., 2022).