Mixture of Neuron Experts (MoNE)
- The paper demonstrates that MoNE achieves universal approximation by using neuron-level gating to select highly active neurons, reducing overall active parameters.
- It employs intra-expert top-k selection to enhance computational efficiency while matching or exceeding the accuracy of traditional MoE models.
- MoNE’s design enables practical model pruning and adaptive inference, offering a scalable framework for efficient neural network deployment in resource-constrained settings.
Mixture of Neuron Experts (MoNE) is a neural network architecture and theoretical framework built on the classical Mixture-of-Experts (MoE) paradigm but refined to operate at a finer granularity, whereby expert selection and activation are performed at the neuron or subnetwork level within each expert. MoNE leverages intra-expert sparsity and neuron-level gating to improve both parameter utilization and computational efficiency. Core theoretical contributions include universal approximation theorems, practical sparsification studies, and empirical demonstrations that neuron-granular mixtures match or exceed the performance of traditional MoE with lower active parameter counts (Cheng et al., 7 Oct 2025, Nguyen et al., 2016).
1. Theoretical Foundations and Universal Approximation
The foundational universal approximation theorem for MoNE establishes that for any continuous target function defined on a compact set and any desired accuracy , there exists a configuration of MoNE parameters such that the MoNE mean function satisfies
MoNE mean functions are of the form
where are gating functions and are expert functions. The class is dense in , the space of continuous functions on the compact domain (Nguyen et al., 2016). This result is analogous to the universal approximation property for fully connected networks, but MoNE achieves approximation by soft-partitioning the input space and delegating localized approximation tasks to neuron-level experts.
The key implications are:
- MoNE architectures can approximate any continuous function arbitrarily well given sufficient neurons and flexible gating.
- The modular structure, with each neuron expert focusing on local regimes, leads to potentially improved efficiency and interpretable, locally-tuned approximations.
- Extensions to multiple-output and conditional density settings also hold, with denseness guarantees for vector-valued functions using Gaussian (or softmax) gating (Nguyen et al., 2017).
2. Motivations and Sparsification Observations
Empirical analyses reveal that in standard MoE models, activated experts contain many neurons with near-zero activation, implying a significant degree of underutilization of the network's capacity. Systematic pruning of expert parameters by ranking their activation magnitudes shows that up to of parameters within the activated subset can be removed with negligible task-performance degradation; substantial performance drops occur only after pruning over (Cheng et al., 7 Oct 2025). Visualization further confirms that most neuron activations remain near zero across a variety of MoE instantiations.
This observation motivates the MoNE methodology:
- Rather than performing expert selection only at the expert level, perform top- selection at the neuron level within each activated expert.
- This approach can halve the number of activated parameters per MoE layer while retaining full (or superior) predictive accuracy compared to traditional MoE evaluated at equivalent activated parameter budgets.
3. MoNE Architecture and Inference Mechanism
MoNE achieves neuron-granular expert selection within each activated expert by applying a top- threshold to per-neuron gating values. For an input , the expert output is decomposed as follows: Let , . This can be rewritten as a sum over neurons: The set identifies the highest-magnitude neuron activations within each expert. Only this subset is retained in the forward computation, reducing both the volume of computation and the number of active parameters.
Key properties of the MoNE approach:
- No additional router parameters or inter-expert communication are required for neuron-level selection.
- The computational overhead of the intra-expert top- operation is negligible relative to full expert computation.
4. Parameter Utilization, Efficiency, and Performance
Experiments on a variety of model and task configurations reveal several consistent findings (Cheng et al., 7 Oct 2025):
- MoNE matched or exceeded the task accuracy of traditional MoE models, activating only of the MoE layer’s parameters.
- At matched activated parameter counts, MoNE consistently outperformed standard MoE (with relative improvements of $1$– in several settings).
- Inference latency and GPU memory consumption are comparable to traditional MoE, because the top‑ operation for intra-expert selection is lightweight and local.
- Introducing a neuron-granular load balance loss (NG-LBL) further encourages uniform utilization of neuron experts, mitigating cases where a small group of neuron experts receives a disproportionate fraction of the gating mass.
5. Mathematical Formulations and Implementation
MoNE's formalism proceeds by decomposing an expert's output as a sum over neuron experts and applying a top‑ operator:
- For each selected expert, select the neurons with the highest .
- Only these neurons and the associated blocks in are used to compute the output for the given input.
- No additional routing network is needed for this fine-grained selection; the selection is performed using per-sample activations.
An additional neuron-level load balance loss is introduced: where is the fraction of tokens routed to neuron in expert and is the average gating value. This loss promotes balanced neuron utilization throughout training.
6. Practical Implications and Future Directions
MoNE extends the efficiency and scalability advantages of MoE to a finer granularity of computation. By selecting only neuron-level subexperts with high activation, MoNE
- Substantially increases the effective utilization of activated parameters per token.
- Achieves considerable reduction in the number of active parameters and computation without degrading performance.
- Suggests strategies for structured model pruning, compression, and adaptive computation at inference, applicable in resource-constrained environments.
Theoretical results motivate further research into adaptive algorithms for expert/neuron selection and exploration of how such granular mixtures can be extended to more complex or hierarchical structures (e.g., mixtures of subnets or layers) (Nguyen et al., 2016). Open questions include the optimal criteria for intra-expert neuron selection beyond top‑ magnitude and the interaction between neuron-granular gating and global model calibration.
7. Context Within the Landscape of MoE Architectures
MoNE is situated as a refinement of modern mixture-of-experts models that seeks to address the inefficiencies of classic expert-level selection by leveraging empirically observed sparsity at the neuron level within experts. This is distinct from prior approaches that target expert-level routing, hierarchical mixtures, or parameter sharing, and instead exploits the activation structure within each expert for dynamic neuron-wise gating. The approach is supported by both theoretical density results and rigorous empirical benchmarking, demonstrating its practical viability for large-scale deployment of sparse neural architectures.
In summary, Mixture of Neuron Experts (MoNE) delivers a theoretically grounded, practically validated framework for improving parameter and computational efficiency in MoE-like neural architectures by leveraging neuron-level sparsity and selection. Its principal innovation lies in intra-expert, neuron-granular mixture modeling, enabling greater utilization of neural capacity and robust performance at lower activation budgets (Cheng et al., 7 Oct 2025, Nguyen et al., 2016).