Mixture of Experts (MoE) Fusion

Updated 5 May 2026

Mixture of Experts (MoE) fusion is a technique that leverages adaptive gating to dynamically combine outputs from specialized models based on input characteristics.
It utilizes both token-level and macro-level routing to achieve expert specialization, computational sparsity, and enhanced prediction accuracy.
Recent extensions incorporate grouped and dynamic routing strategies that preserve information and improve hardware efficiency and uncertainty quantification.

A Mixture of Experts (MoE) fusion is a flexible, modular strategy for combining outputs from multiple specialized submodels (“experts”). The core property of MoE fusion is its use of a learned or algorithmic gating mechanism to select or weight experts for each input, often on a per-sample, per-token, or per-modality basis. MoE fusion has evolved from classical neural MoE layers to sophisticated multi-stage, macro-scale, and hybrid formulations underpinning contemporary large-scale and multimodal AI systems. This article details the principles, mathematical foundations, and representative architectures of MoE Fusion, with an emphasis on recent advances spanning neural, multimodal, macro-agent, and hardware-optimized contexts as exemplified in state-of-the-art research.

1. Mathematical Foundations of Mixture-of-Experts Fusion

The canonical MoE fusion computes a weighted sum of expert outputs using data-dependent gating:

$y(x) = \sum_{i=1}^E r_i(x)\;h_i(x)$

where $h_i(x)$ is the output of the i-th expert (typically a neural subnetwork or MLP), and $r_i(x)\geq 0$ , $\sum_i r_i(x)=1$ , is a routing (gating) weight generated by a parametric router (often a softmax over linear projections of $x$ ). In sparsely-gated variants, only the top-k routing weights are nonzero for efficiency (Shu et al., 17 Nov 2025):

Router computation:

$g(x) = W_g x + b_g \in \mathbb{R}^{E},\quad r = \mathrm{softmax}(g(x))$

Top- $k$ is then applied to sparsify $r$ .

Expert evaluation and fusion:

$y_{\text{MoE}}(x) = \sum_{i \in \mathrm{TopK}(r)} r_i h_i(x)$

Optionally followed by residual and normalization. Auxiliary regularization, such as a load-balancing loss, is commonly added to encourage uniform expert utilization:

$L_{\text{balance}} = \frac{1}{E} \sum_{e=1}^E f_e p_e$

where $h_i(x)$ 0 and $h_i(x)$ 1 denote expert usage statistics (Shu et al., 17 Nov 2025).

Cross-model macro-fusion (multi-agent or ensemble scale): MoE can occur not only within a network but across model outputs, using concatenation, learned weighting, or decision agents to combine predictions (Shu et al., 17 Nov 2025, Liu et al., 18 Nov 2025).

2. Micro- and Macro-scale MoE Fusion Architectures

Micro-level MoE fusion occurs within neural layers, typically replacing a feed-forward subnetwork by a sparse MoE block:

Token-level routing: For each token representation $h_i(x)$ 2, the router assigns it to a small subset of experts, yielding token-wise specialization while maintaining parameter efficiency and computational sparsity (Shu et al., 17 Nov 2025).
Implementation: In LLaMoE (an MoE-augmented LLaMA 3.1 8B), only the final transformer attention block is modified with an MoE FFN comprising 4 experts, top-2 routing, and SwiGLU activations. All other weights are frozen (Shu et al., 17 Nov 2025).

Macro-level MoE fusion generalizes this principle across larger architectural nodes or even whole networks:

Multi-agent ensembles: Agents (e.g., LLaMoE, GPT-4o, DeepSeek V3) independently process the same task, and their outputs, together with the prompt, are concatenated and passed to a meta-agent (e.g., GPT-4o as decision agent), which performs the final fusion/classification (Shu et al., 17 Nov 2025).
Hierarchical MoE: Fusion can be performed in stages—micro fusion within agents, macro fusion across agents—enabling both fine-grained and high-level specialization.
Anisotropy-preserving concatenation: In frameworks such as (Liu et al., 18 Nov 2025), expert outputs are projected and concatenated, preserving the disjoint geometries of heterogeneous embeddings and improving downstream nonlinear discrimination compared to scalar mix (weighted sum) fusion.

3. Extensions: Grouped, Dynamic, and Information-Preserving Fusion

Recent MoE fusion architectures introduce specialized routing mechanisms or fusion policies to address domain-specific challenges and expert heterogeneity:

Grouped routing (AT-MoE): Experts are organized into interpretable groups (e.g., by function, domain, or style), each with its own temperature-controlled softmax allocation, followed by within-group normalization. This ensures interpretable, layered, and controllable fusion of LoRA-based experts fine-tuned for specific sub-tasks (Li et al., 2024):

$h_i(x)$ 3

Expert merging/refreshing: To avoid redundant or underused experts, training-time fusion procedures merge high-usage experts into generalized ones and overwrite infrequently used experts, thus promoting generalization and addressing catastrophic forgetting in incremental learning (Park, 2024).
Functional alignment across models: To harmonize disparate pre-trained experts (e.g., from different LLM checkpoints), functional alignment via activation-based neuron permutation addresses parameter misalignment, followed by a lightweight router to synthesize the multi-expert output (Wang et al., 23 Sep 2025).

4. Information Fusion Beyond Weighted Mixtures

While classical MoE fuses expert outputs via weighted averaging, recent work emphasizes the preservation of geometry and distributional structure:

Concatenation fusion: Instead of averaging, concatenating projected expert outputs maintains their anisotropy (i.e., preservation of discriminative “spikes” in embedding space), enabling the downstream classifier to exploit features from distinct expert spaces independently (Liu et al., 18 Nov 2025).
Distribution-level fusion: In CoCoAFusE, predictive distributions from experts are fused not just by weighting but by interpolating means and variances as a function of a learned parameter $h_i(x)$ 4, enabling a smooth transition between competitive (“mixture”) and collaborative (“blend”) regimes, yielding tighter posterior intervals and improved uncertainty quantification (Ugolini et al., 2 May 2025).
Groupwise and dynamic gating: Nonlinear group-by-group or token-by-token routing enables selective activation, interpretable control, and support for missing modalities or variable subnetwork sparsity (Li et al., 2024, Wang et al., 29 Aug 2025, Zhang et al., 1 Dec 2025).

5. Applications, Empirical Performance, and Specializations

MoE fusion is deployed in a broad array of application domains, with significant empirical performance improvements:

Language understanding and generation: Two-tier MoE (e.g., MoMoE) achieves state-of-the-art F1 and precision by combining token-level specialization with decision-level correction among agents. On financial sentiment analysis tasks, macro-level MoA fusion yielded a further 1.9% F1-score gain over single-agent MoE, with MoMoE attaining the highest overall performance (Shu et al., 17 Nov 2025).
Multilingual and multimodal fusion: Anisotropy-preserving concatenation in cross-lingual LLM ensembles improves area under curve (AUC) and query throughput over standard dense or scalar-mixed baselines, supporting efficient, region-specialized expert routing (Liu et al., 18 Nov 2025).
Healthcare and missing modalities: Specialized experts are dynamically selected based on available data combinations and missing indicators, with gating MLPs shifting weight automatically to experts matching the observed pattern and yielding robustness in clinical prediction across variable modality availability (Wang et al., 29 Aug 2025).
3D vision and multimodal grounding: MoE3D combines sparse token-expert gating, information aggregation through attention, and progressive pretraining. Top-1 MoE gating outperforms top-k or dense fusion, improving mean IoU by up to 6.1 on challenging multi-modal 3D tasks (Li et al., 27 Nov 2025).
Hardware efficiency and deployment: MoE fusion enables high-throughput deployment of enormous LLMs by activating only a sparse subset of model weights per-token, with further gains from hardware–aware scheduling, unified dataflows, and DRAM-access minimization (Huang et al., 25 Jul 2025).

6. Theoretical Considerations, Regularization, and Optimization

MoE fusion poses several theoretical and algorithmic challenges, addressed in contemporary research as follows:

Load-balancing regularization: Auxiliary losses penalize unbalanced expert utilization, preventing “expert collapse” and promoting training stability in both neural and macro-level MoE architectures (Shu et al., 17 Nov 2025, Liu et al., 18 Nov 2025, Xiang et al., 11 Aug 2025).
Convergence properties: Gating function design affects statistical efficiency; for example, Laplace gating in sparse-MoE architectures achieves strictly better convergence rates for density estimation than softmax or Gaussian gates (Han et al., 2024).
Dynamic adaptation: Expert fusion can be viewed as a form of dynamic ensemble learning, with gates adapting to reflect domain shifts, specialty domains, or even missing data (via learned “absent” embeddings). Optimization typically proceeds via EM or end-to-end gradient descent, sometimes supplemented by periodic expert merging, functional alignment, or expert pruning (Park, 2024, Wang et al., 23 Sep 2025).

7. Limitations, Open Challenges, and Future Directions

MoE fusion is limited by several structural and statistical factors:

Expert coordination: As the expert pool grows, ensuring diversity and avoiding redundancy becomes challenging; merging procedures and diversity-promoting losses mitigate, but do not eliminate, over-specialization or collapse (Park, 2024).
Architectural compatibility: Macro-fusion techniques (e.g., Symphony-MoE) necessitate identical architectures among source models for expert alignment; generalizing to systematically different architectures remains open (Wang et al., 23 Sep 2025).
Computational efficiency: While sparse gating and hardware-aware engineering alleviate inference costs, scaling to “infinite” or continuous expert spaces, as in ∞-MoE, brings new challenges in routing efficiency and kernel implementation (Takashiro et al., 25 Jan 2026).
Interpretability: Progressive routing, group-specific experts, and interpretable fusion strategies (e.g., in AT-MoE) enhance understanding, but overall global behavior of large-scale MoE systems can remain opaque.

MoE fusion remains a rapidly evolving paradigm, with advances in multi-level specialization, distribution-preserving fusion, and large-scale ensemble coordination driving continued gains across a diverse spectrum of AI domains. Recent work demonstrates that algorithmically principled fusion—sparse, dynamic, structure-aware, and information-preserving—can enhance both performance and efficiency, while promoting model robustness and adaptability (Shu et al., 17 Nov 2025, Liu et al., 18 Nov 2025, Wang et al., 29 Aug 2025, Li et al., 27 Nov 2025, Park, 2024).