Soft Mixture-of-Experts (Soft MoE)

Updated 6 December 2025

Soft Mixture-of-Experts is a neural architecture that employs soft, fully differentiable gating to mix expert outputs via convex combinations.
It uses input- and context-dependent routing mechanisms to improve training stability, efficiency, and scalability compared to hard-gated MoEs.
Empirical studies and theoretical analyses show that Soft MoE enhances performance in diverse domains such as vision, language, and reinforcement learning.

A Soft Mixture-of-Experts (Soft MoE) is a neural architecture that replaces hard, discrete expert selection with smooth, fully differentiable routing mechanisms, enabling the dynamic, learnable allocation of input features to multiple expert subnetworks. This approach generalizes classical MoE frameworks by allowing each output to be a convex combination of the outputs of all experts, mixed via input- or context-dependent soft gating functions. Soft MoE architectures offer major advantages in terms of training stability, efficiency, scalability, and consistent expert utilization compared to hard-gated (top-k or sparse) MoEs. Soft MoE mechanisms have been adopted across deep learning domains, including vision, language, multimodal, and reinforcement learning systems.

1. Mathematical Formulation and Gating Mechanisms

At the core, a Soft MoE layer consists of $n$ expert subnetworks $f_1, \ldots, f_n$ , each mapping from $\mathbb{R}^d \to \mathbb{R}^d$ (or a task-specific output space). Instead of routing each input or token to a single expert via a hard or top-k gating, Soft MoE computes soft routing weights for each input, so every expert can contribute fractionally to the final prediction.

Unified Routing Structure

The general input-output mapping for a batch of $m$ tokens $X \in \mathbb{R}^{m \times d}$ is:

Each expert receives a convex combination of tokens:

$z_j = \sum_{i=1}^m D_{ij} x_i \quad \text{with} \quad D \in \mathbb{R}^{m \times n} \; (\mathrm{columnwise~softmax})$

Each expert computes $y_j = f_j(z_j)$ .
Each output token $x_i$ aggregates all expert outputs:

$y_i = \sum_{j=1}^n C_{ij} y_j \quad \text{with} \quad C \in \mathbb{R}^{m \times n} \; (\mathrm{rowwise~softmax})$

where $D$ and $C$ are fast to compute via linear projections and softmaxes on the token and expert axes, respectively (Puigcerver et al., 2023, Chung et al., 2 Sep 2024, Liu et al., 29 Jan 2024).

Modern Soft MoE implementations (vision, transformers, speech, adapters, etc.) instantiate this abstract recipe by leveraging learned slot-projection matrices, softmax gating, and differentiable recombination steps, all optimized jointly with the expert parameters via backpropagation.

2. Theoretical Properties and Convergence Analysis

Recent advances establish rigorous statistical guarantees and sample complexity bounds for Softmax-gated MoE models under regression and classification settings. Given a family of expert functions $h(x; \eta)$ and softmax gating:

$f_G(x) = \sum_{j=1}^k \pi_j(x) h(x, \eta_j), \quad \pi_j(x) = \frac{e^{\omega_j^T x + \beta_j}}{\sum_\ell e^{\omega_\ell^T x + \beta_\ell}}$

key results include:

Convergence rates: Under strong identifiability conditions (see below), parameter and function estimation achieves optimal rates $O_P((\log n / n)^{1/2})$ for regression/classification, and $O_P((\log n / n)^{1/4})$ for over-specified models (Nguyen et al., 5 Mar 2025, Nguyen et al., 5 Feb 2024, Nguyen et al., 2023).
Strong identifiability: Required for sharp convergence, this condition demands that the expert family $h(x, \eta)$ is such that its low-order derivatives yield linearly independent features. Two-layer nonlinear MLPs with sigmoidal/tanh activations satisfy this; linear and polynomial experts do not, leading to exponentially slower parameter rates owing to coupling under PDE constraints in the Taylor expansion (Nguyen et al., 5 Mar 2025, Nguyen et al., 5 Feb 2024).
Failure of single-expert soft routing: Soft MoE with a single expert cannot represent certain simple convex functions even with an arbitrarily powerful expert, revealing an implicit bias toward distributed, specialized representations (Chung et al., 2 Sep 2024).
Modified gates and routers: Replacing the gating softmax input by a nonlinear transformation $M(x)$ , so that expert and gating parameter subspaces are algebraically independent, eliminates pathological estimation slowdowns in the presence of expert collapse (Nguyen et al., 2023).

These theoretical findings dictate architectural and implementation choices for large-scale MoE systems, favoring strongly identifiable, nonlinear experts and nonlinear router gates, and justifying the preference for softmax-based, fully differentiable routing strategies.

3. Architectural Variants and Implementation Details

Soft MoE principles have yielded diverse, domain-adapted architectures:

Vision and language transformers: Replacement of MLP blocks in ViTs or transformer encoders/decoders with Soft MoE layers—often with a single slot per expert ( $p=1$ ), and up to hundreds of experts per layer (Puigcerver et al., 2023, Liu et al., 29 Jan 2024, Wu et al., 2023, Cappellazzo et al., 1 Feb 2024). Routing is achieved by computing per-slot affinity matrices via normalized projections, followed by rowwise and columnwise softmax normalization.
Adapters and parameter-efficient transfer learning: Soft mixture-of-adapters (Soft-MoA) utilizes small bottleneck adapters as experts; soft routing is applied between token representations and adapters for scalable fine-tuning with minimal overhead (Cappellazzo et al., 1 Feb 2024).
Low-rank experts: In multimodal and large LMM backbones, Soft MoE can use LoRA-style low-rank adapters as experts. All routing and recombination remains fully differentiable, and training involves only LoRA and routing weights for efficient instruction-tuning (Wu et al., 2023).
Multi-task and multi-gate MoE: Multi-gate MoE structures (MMoE) allocate per-task gates over a shared expert pool (Huang et al., 2023). Extensions like Balanced MoE (BMoE) layer task-gradient normalization atop soft expert allocation to mitigate negative transfer (Huang et al., 2023).
Specialized regularization in multimodal settings: Auxiliary objectives based on KL divergence between per-modality routing distributions (as in Soft Modality-Aware Routing, SMAR) promote expert specialization without architectural changes, crucial for balancing multimodal and text-only performance (Xia et al., 6 Jun 2025).
Adversarial regularization and hierarchical soft constraints: Additional loss terms can encourage expert diversity (output disagreement among non-selected experts) and softly promote expert sharing among related hierarchical classes (Xiao et al., 2020).

Soft MoE always preserves full differentiability of the compute graph, enabling end-to-end training with standard SGD optimizers.

4. Practical Benefits and Empirical Outcomes

The Soft MoE paradigm confers multiple empirical advantages:

Scalability and parameter efficiency: Orders-of-magnitude increase in parameter count for constant or marginally higher inference FLOPs compared to dense or sparse MoEs; e.g., Soft MoE Huge/14 with 128 experts and 16 MoE layers achieves 27.3B parameters with only 2% more inference time than ViT Huge/14 (Puigcerver et al., 2023).
Training stability: Absence of dropped tokens, expert collapse, or routing instability, even with hundreds of experts and large sequence lengths, due to everywhere-positive routing weights and the elimination of hard buffer constraints (Puigcerver et al., 2023, Liu et al., 29 Jan 2024).
Improved performance: Across vision, audio, and multimodal benchmarks, Soft MoE consistently outperforms both dense MLP transformers and standard top-k/sparse MoEs at fixed compute, and shows superior scaling with increasing expert pool size while avoiding dead experts (Puigcerver et al., 2023, Cappellazzo et al., 1 Feb 2024, Wu et al., 2023, Liu et al., 29 Jan 2024).
Simplified implementation: Only two softmaxes and linear projections per layer are required, with no need for load-balancing or routing-entropy auxiliary losses, in contrast to sparse/top-k MoEs (Puigcerver et al., 2023, Liu et al., 29 Jan 2024).
Expert specialization and inference efficiency: Empirically, as the number of experts grows (even at fixed total parameter count), per-input routing weights become informative; thus, a small subset of experts can suffice for each input with negligible accuracy loss, enabling significant inference speedups (Chung et al., 2 Sep 2024).

Domain-specific adaptations—including modality-specific and cross-modality soft expert ensembles, hierarchical soft-gating with semantic constraints, and low-rank expert parameterizations—enable wide-ranging, application-tailored deployments across language, remote sensing, and multimodal LLMs (Hackel et al., 17 Sep 2025, Xia et al., 6 Jun 2025, Wu et al., 2023).

5. Training Dynamics and Optimization Landscape

Recent theoretical work has elucidated the feature learning dynamics of overparameterized Soft MoEs:

Student-teacher recovery: With random initialization and moderate overparameterization, joint gradient flow on router and expert parameters induces a feature learning phase guided by the experts, resulting in sequential alignment of student experts and routers to their teacher counterparts (Liao et al., 8 Oct 2025). This guided phase transition can be rigorously proved under population gradient flow, leveraging Hermite expansions for nonlinearity analysis.
Pruning and fine-tuning: Redundant experts can be pruned post-training with little to no generalization gap, after which local strongly convex fine-tuning achieves global optimality (Liao et al., 8 Oct 2025). The optimization landscape of the pruned Soft MoE model is locally strongly convex around the global minimum, with no spurious stationary points.
Router nonlinearity and initialization: Theoretical results recommend sufficiently nonlinear router architectures (e.g., sigmoid or softmax gates), and moderate overparameterization ( $m \gtrsim m^* \log m^*$ for $m^*$ teacher experts) to ensure feature recovery and optimization success.
Smoothness benefits: Soft routing grants smoothness properties—such as positive definiteness of the population loss Hessian near the global optimum and the absence of pathological high-order saddle points—that facilitate both gradient-based optimization and generalization.

6. Applications Across Domains

Soft MoE architectures have been successfully applied in a breadth of domains:

Large-scale vision backbones: Replacement of MLP blocks in transformers for improved upstream and downstream performance at negligible inference cost (Puigcerver et al., 2023, Liu et al., 29 Jan 2024).
Multimodal LMMs: Instruction-tuned models for vision-language tasks leverage Soft MoE-based low-rank adapters for scalable, specialist-expert, and generalist performance, with state-of-the-art results (Wu et al., 2023, Xia et al., 6 Jun 2025).
Parameter-efficient transfer learning: Soft mixture-of-adapters for audio transformers delivers near-dense performance with only 20-30% additional training time versus single adapters (Cappellazzo et al., 1 Feb 2024).
Industrial multi-task learning: Multi-gate MoE with soft sharing and dynamic gradient balancing for multi-quality variable estimation achieves significant improvement in predictive accuracy and robustness to negative transfer (Huang et al., 2023).
Question generation and template-based NLP: Soft MoE over latent template slots enables compositional control for complexity-aware sequence generation (Bi et al., 2021).
Remote sensing: Efficient Soft MoE integration into CSMAE FMs enables doubling computational efficiency over standard approaches for EO tasks (Hackel et al., 17 Sep 2025).
Policy learning in RL: Soft mixture-of-Gaussian policies equipped with differentiable softmax gating and variance-reduced gradient estimators expedite multimodal skill acquisition in continuous control (Ren et al., 2021).

7. Limitations and Open Problems

Single-expert limitations: Soft MoEs with only a single expert and soft routing cannot emulate even simple convex functions, emphasizing the necessity for multiplicity in expert representation (Chung et al., 2 Sep 2024).
Expert identifiability: Polynomial or input-independent experts lead to identifiability pathologies and arbitrarily slow parameter recovery, mandating the use of input-dependent, strongly identifiable (nonlinear) expert classes (Nguyen et al., 5 Mar 2025, Nguyen et al., 5 Feb 2024).
Subset discovery and specialization: Although explicit discovery of minimal expert subsets sufficient for a given input is computationally intractable, gating-induced heuristics (e.g., ranking by total per-input gate weight) enable efficient approximate specialization (Chung et al., 2 Sep 2024).
Router design and sample efficiency: Gate and router architectures must be designed to avoid algebraic coupling with expert parameters; nonlinear gates restore polynomial sample complexity lost by linear router-expert PDE ties (Nguyen et al., 2023, Nguyen et al., 5 Mar 2025).

Soft Mixture-of-Experts methods, by avoiding hard routing constraints and by ensuring smooth, dense expert utilization, continue to drive scalable, stable, and efficient architectures across deep learning domains while posing analytic and computational challenges that inform ongoing theoretical research.