Universal Approximation Theorem for MoEs

Updated 5 December 2025

MoEs are flexible neural network architectures that use gating functions to combine specialized experts for adaptive, localized function approximation.
The universal approximation theorem for MoEs guarantees that any continuous function or density can be approximated arbitrarily well using a sufficient number of experts and proper gate designs.
Extensions of the theorem cover multivariate outputs, operator learning, and mixed-effects models, highlighting practical implications for scalable and efficient high-dimensional modeling.

A mixture of experts (MoE) model is a flexible neural network architecture in which multiple specialized sub-models ("experts") contribute to the final output via gating functions dependent on the input, allowing modular, data-adaptive partitioning. The universal approximation theorem for MoEs formalizes the capacity of these models to approximate broad classes of functions and probability distributions to arbitrary accuracy, extending classical approximation results to architectures structured by both gating and local expert components.

1. Mathematical Formulation and General Setup

For a compact domain $K \subset \mathbb{R}^d$ , a standard MoE mean function takes the form

$g(x; \Theta) = \sum_{k=1}^K \pi_k(x; W, b)\, h_k(x; \theta_k)$

where $\pi_k(x;W,b)$ is a gating function (typically softmax-linear or Gaussian radial basis), and $h_k(x; \theta_k)$ is the $k$ -th expert from a class $\mathcal{H}$ (e.g., polynomials, neural nets, or local linear regressors). The function space of interest is $C(K)$ , the Banach space of continuous real-valued functions over $K$ under the uniform norm. In density estimation contexts or conditional modeling, analogous MoE forms combine gating and expert densities over product domains $Z = X \times Y$ , extending the setup to Lebesgue–Bochner spaces, conditional probability spaces, or more complex mixed-effects distributions (Nguyen et al., 2016, Nguyen et al., 2020, Fung et al., 2022).

2. Universal Approximation Theorems for MoE Mean and Density Functions

The central result asserts that for any $f \in C(K)$ and $\varepsilon > 0$ , there exist parameters and architecture size ( $K$ , $\Theta$ ) such that

$\|f - g\|_\infty < \varepsilon$

with dual requirements:

The gating functions $\pi_k$ can approximate any partition of unity over $K$ .
The expert family $\mathcal{H}$ is itself universal in $C(K)$ .

This result is agnostic to expert type; for example, polynomials, single-layer sigmoidal NNs, or local regressors all suffice. The proof exploits classical partition-of-unity arguments, showing that the gating network can approximate any indicator or bump function, while experts locally approximate target values (Nguyen et al., 2016). No upper bound on $K$ is required a priori; capacity grows as $\varepsilon \to 0$ .

For density functions and conditional modeling, a similar theorem establishes that for target PDF $f(y|x) \in C(Z)$ and $\psi \in C(\mathbb{R}^q)$ a base PDF for experts, finite MoE mixtures

$m(y|x) = \sum_{k=1}^K \mathrm{Gate}_k(x; \gamma) \mathrm{Expert}_k(y; \beta_k)$

are dense in $L^p(Z)$ for $1\leq p < \infty$ , with almost-uniform convergence in the univariate input case (Nguyen et al., 2020). Gating density and expert density lemmas confirm that both softmax and Gaussian gating architectures equivalently achieve universality.

3. Extensions to Multivariate Outputs, Operator Learning, and Mixed-Effects Models

The universality property extends to multiple-output MoLE models (Nguyen et al., 2017), operator-valued MoEs (Kratsios et al., 13 Apr 2024), and mixed-effects models for multilevel or hierarchical data (Fung et al., 2022):

For multivariate outputs: Vector-valued MoE mean functions are dense in $C_q(X)$ , while density models approximate joint conditionals via closure under multiplication and addition. The proofs adjust the marginal approximation arguments via separate coordinates and combinatorial closure lemmas.
In operator learning: Mixtures of neural operators (MoNOs) achieve uniform approximation of nonlinear Lipschitz operators $G^+: K \subset L^2([0,1]^d) \to L^2([0,1]^{d'})$ on Sobolev balls, with per-expert NN complexity scaling as $O(\varepsilon^{-1})$ while the number of experts accounts for the curse of dimensionality. Routing is implemented via hierarchical k-means tree structures where each leaf is a local NO expert (Kratsios et al., 13 Apr 2024).
For mixed-effects regression: The mixed MoE (MMoE) models, with gating dependent on both observed features and random effects, are dense in the space of continuous mixed-effects models under weak convergence metrics. The denseness result accommodates arbitrary dependency structures via nested softmax routers under minimal technical assumptions, leveraging expert denseness and softmax partitioning in both the observed and latent-effect spaces (Fung et al., 2022).

4. Key Proof Ingredients and Architectural Implications

Classical proofs for MoE universal approximation deploy several critical steps:

Partitioning the input (or latent factor) space into fine coverages where the target function or density is near-constant.
Construction of partitions of unity approximated closely by the gating network (softmax or Gaussian gates).
Local approximation by experts whose class is universal in the relevant function space.
Assembling the global function/density by summing over gates and experts, using triangle inequalities to bound the total error.

These steps extend naturally to operator learning, multilevel random effect models, and multivariate outputs via combinatorial closure properties (addition/multiplication) and tree or nested routing mechanisms.

5. Quantitative Rates, Sample Complexity, and VC-Dimension Bounds

For MoEs with (P)ReLU MLP experts, quantitative approximation bounds are achieved:

For any $f: [0,1]^n \to \mathbb{R}$ Lipschitz and $\varepsilon>0$ , the MoE model achieves uniform approximation with $L = O(\varepsilon^{-1})$ experts, each an MLP with depth $J=O(\varepsilon^{-1})$ and width $W=O(1)$ ; importantly, only one expert is activated in inference, loading $O(\varepsilon^{-1})$ parameters, thus mitigating the canonical $O(\varepsilon^{-n/2})$ ReLU network complexity (Kratsios et al., 5 Feb 2024).
The VC-dimension for such MoMLP architectures is $O(L\log^2 L\,\max\{nL\log L, JW^2\log(JW)\})$ , supporting PAC generalization theory for MoE (Kratsios et al., 5 Feb 2024).
In MoNO-based operator learning, the complexity per expert grows polynomially (e.g., $O(\varepsilon^{-1})$ ) while the number of experts $\Lambda$ scales as $O(\varepsilon^{-d/2})$ , transferring the curse of dimensionality from expert size to routing complexity.

6. Generalizations, Relaxed Assumptions, and Comparison to Classical Theorems

The approximation guarantee generalizes over:

Domain shape (compact or bounded, possibly non-cube).
Smoothness of target (continuous functions in $C(K)$ , weakly continuous CDFs, or conditional PDFs).
Gating class (softmax-linear, Gaussian radial, bump-like gates).
Expert architecture (polynomial, sigmoidal nets, local affine, MLPs, or neural operators).

MoE universal approximation parallels the Cybenko/Hornik results for single-layer neural nets, offering enhanced flexibility via modular data-driven partitioning and local specialization. In many cases, MoEs afford faster convergence and reduced parameterization relative to wide single-layer networks, largely due to their localized treatment of complex functions (Nguyen et al., 2016). For stochastic or mixed-effects modeling, the universality in weak convergence does not require strong moment, differentiability, or smoothness assumptions on the target, making the MoE framework broadly applicable (Fung et al., 2022).

7. Practical Considerations and Modelling Design

In application, the MoE architecture enables scalable, memory-efficient models: inference loads only the relevant expert, supporting high-param MoE deployments.
The design of the gating network (soft/hard routing, tree or hierarchical logit scales) controls the partition granularity and expressiveness.
For mixed-effects data, the ability to universally approximate complex dependency structures with simple softmax gating and expert families facilitates consistent estimation via EM or variational inference.

A plausible implication is that MoEs, in both deterministic and stochastic settings, constitute a framework for modular universal function and distribution modeling, whose expressiveness is matched by practical scalability and partition-awareness. In operator learning, the curse of dimensionality is redirected into routing complexity, not per-expert capacity, suggesting efficient parameter deployment in high-dimensional settings.

References: (Nguyen et al., 2016, Nguyen et al., 2020, Nguyen et al., 2017, Kratsios et al., 5 Feb 2024, Kratsios et al., 13 Apr 2024, Fung et al., 2022)