Mixture-of-Experts Scaling
- Mixture-of-Experts (MoE) scaling is a neural architecture that uses conditional computation via a router to activate a sparse set of expert subnetworks per token.
- It achieves parameter decoupling and sublinear compute growth by routing inputs to only the top-k experts, reducing per-token computation despite large total parameter counts.
- Recent research presents precise scaling laws, optimal hyperparameter strategies, and innovative architectures like CartesianMoE and ReXMoE to maximize efficiency and enable compression.
Mixture-of-Experts (MoE) Scaling refers to a family of neural network architectures and training paradigms that enable models to scale their parameter count and representational capacity beyond what is computationally feasible in a dense setting, by leveraging conditional computation and increased architectural modularity. MoE models utilize a set of "experts"—typically independent feed-forward subnetworks—and a router mechanism that, for each input (often at the token level), selects or gates a sparse subset of these experts per forward pass. This approach provides sublinear computational and memory scaling with respect to total parameter count, allowing extremely large language and vision models to be trained and deployed efficiently. Recent advances in theory and system design have yielded precise scaling laws, optimal hyperparameter regimes, and specialized compression techniques, making MoE the dominant paradigm for parameter-efficient scaling of large language and vision models.
1. Fundamental Principles of MoE Scaling
A canonical MoE layer replaces a dense subnetwork (e.g., a Transformer FFN block) with parallel "experts," each a parameterized subnetwork (typically a two-layer MLP with hidden dimension ). For each token or input state, a router computes a sparse probability vector , and the top- largest entries (with ) are selected. The output is
This "conditional computation" means that for any token, only a small fraction of all experts are evaluated, making the compute and memory cost per token independent, or only weakly dependent, on the total parameter count. This decoupling enables models with many billions to trillions of parameters to be tractably trained and served (Kim et al., 2021).
Key scaling properties:
- Parameter decoupling: Total parameters grow linearly in , but per-token compute cost grows only with .
- Sparsity factor: Define network sparsity as ; lower increases scaling benefits and changes the representational regime (Chaudhari et al., 26 Oct 2025).
- Sublinear compute: Overall compute grows sublinearly with parameter count.
2. Joint Scaling Laws and Compute-Optimal MoE Regimes
Recent work has resulted in precise scaling laws that relate model loss to data size (tokens), total parameter count , number of active parameters per token , the number of active experts , and the ratio of shared experts (Zhao et al., 28 Sep 2025, Ludziejewski et al., 7 Feb 2025, Tian et al., 23 Jul 2025):
- Compute- and memory-optimal selection: Given a total parameter or compute budget, one can derive closed-form or algorithmic optima for , , and (active/total parameters), yielding explicit recipes for MoE hyperparameter selection (Zhao et al., 28 Sep 2025, Ludziejewski et al., 7 Feb 2025). The optimal number of active experts and shared expert ratio are nearly universal, while should decrease as total model size increases (e.g., from 25% to 10% to 5% as increases from to ).
- Efficiency Leverage (EL): The computational advantage of MoE over a dense baseline is quantified by EL, indicating how much less compute is needed to achieve comparable loss. Scaling laws for EL show superlinear gains as the activation ratio decreases, especially when expert granularity is in the optimal range (–12) and total compute budget is large (Tian et al., 23 Jul 2025).
Empirically, EL of 7× is consistently achieved in practice, for example, in Ling-mini-beta (17.5B total, 0.85B active params) for the same 1T tokens as a 6.1B dense model (Tian et al., 23 Jul 2025).
3. Architectures and System-Level Innovations
MoE architectural research has expanded in several directions:
- Fine-grained MoE: Introduces a granularity hyperparameter (the ratio between the FFN size and expert hidden size), enabling scaling to large expert counts with smaller expert dimension. Empirical scaling laws show optimal increases with compute and that is highly suboptimal (Krajewski et al., 12 Feb 2024, Krajewski et al., 3 Jun 2025).
- CartesianMoE: Splits experts into two subgroups, sequentially routed and combined, forming a Cartesian product. This achieves multi-tier knowledge sharing—global, group-wise, and expert-specific—yielding improved perplexity and downstream task performance, as well as better robustness to expert ablation (Su et al., 21 Oct 2024).
- ReXMoE: Loosens the restriction of layer-local expert sets, enabling experts to be reused across adjacent layers via progressive scaling routing (PSR). This approach yields improved expressivity and allows richer expert combinations without inflating parameter counts (Tan et al., 20 Oct 2025).
- Multilinear MoE: Approximates the massive weight tensor of all experts via low-rank tensor factorizations (e.g., Tucker, CP, TT). Enables scaling the number of experts well beyond what is tractable by direct parameterization, with empirical evidence of improved monosemanticity and class specialization (Oldfield et al., 19 Feb 2024).
- Task/Domain-level Routing: Task-level MoE routes at the task rather than token level, making it possible to extract small, ready-to-deploy sub-networks from large MoEs for efficient inference and serving (Kudugunta et al., 2021).
System-level contributions (e.g., DeepSpeed MoE) have enabled scaling to trillions of parameters by combining data parallelism, expert parallelism, model parallelism, memory offload, and modular checkpointing, with optimized all-to-all communication and memory management (Kim et al., 2021, Singh et al., 2 Oct 2025).
4. Training Paradigms and Inference-Time Elasticity
A central challenge of MoE scaling is robust adaptation to dynamic computational budgets and serving environments:
- Elastic Mixture-of-Experts (EMoE): Standard MoE routers fail when the number of activated experts at inference differs from the training ; performance quickly degrades. EMoE addresses this by stochastically varying at training (sampling from a larger ) and adding a KL-divergence router loss. This approach enables inference-time adjustment of up to $2$– the training with monotonic performance gains (Gu et al., 26 Sep 2025).
- Matryoshka MoE (M-MoE): Randomizes both globally and per-layer during training, leading to a consistent expert ranking that unlocks per-layer and per-inference elasticity. A single M-MoE model matches the full suite of fixed- "specialist" models across , at reduced training cost (Wang et al., 30 Sep 2025).
- ElasticMoE system: In production serving, enables fine-grained, zero-downtime scaling of MoE LLMs in cloud environments by decoupling inference from memory operations, zero-copy expert migration, and high-bandwidth peer-to-peer transfers. Yields nearly lower scale-up latency and higher throughput during scaling (Singh et al., 2 Oct 2025).
5. Compression and Efficient Deployment
MoE models, with their high total parameter counts, pose unique challenges for deployment. Recent compression and slimming methods include:
- Expert Slimming & Trimming: Individual expert weights are pruned or quantized (Expert Slimming), and entire experts are pruned based on low routing frequency (Expert Trimming/Drop). These can be combined with Layer Drop (pruning entire MoE layers) and Block Drop (pruning whole transformer blocks) for aggressive compression (He et al., 4 Jun 2024).
- MoBE: Each up/gate matrix in an expert is decomposed into a unique thin matrix and a shared basis expansion (a convex combination of layer-level bases), reducing parameter count by $24$– with accuracy drop—even at trillion-parameter scale (Chen et al., 7 Aug 2025).
- EAC-MoE: Expert-selection calibration after quantization (QESC) addresses "expert-shift" errors in the router induced by weight quantization, focusing on preserving correct top- expert indices. Pruning based on expert-selection frequency (PESF) dynamically removes experts never (or rarely) routed to, reducing memory and inference cost with minimal quality loss (Chen et al., 3 Aug 2025).
6. Interpretability, Specialization, and Representation
MoE scaling has direct implications for model interpretability and specialization:
- Sparsity vs. Superposition: Network sparsity () is the primary control parameter for monosemanticity and interpretability. As , MoEs represent features monosemantically, reducing interference among features. Conversely, dense models (single-expert, high ) exhibit superposed features, impeding interpretability (Chaudhari et al., 26 Oct 2025).
- Fine-grained and Factorized MoE: Fine-grained MoE and tensor-factorized MoE (e.g., MoE) preserve end-to-end differentiability and scale experts to the thousands, enabling interpretable, class-specialized representations while keeping memory and compute tractable (Krajewski et al., 12 Feb 2024, Oldfield et al., 19 Feb 2024).
A plausible implication is that, at scale, experts in properly regularized, highly sparse MoEs can become naturally aligned with coherent feature or task subdomains, enabling more interpretable and debuggable large models.
7. Applications and Empirical Scaling Results
- Language Modeling: MoE scaling enables training and serving of models with $100$B–$1$T parameters on existing hardware, yielding state-of-the-art quality for equivalent or reduced compute (Kim et al., 2021, Zhao et al., 28 Sep 2025, Tian et al., 23 Jul 2025). Scaling curves show the MoE–dense efficiency gap widens with growing compute and data budgets (Krajewski et al., 12 Feb 2024, Tian et al., 23 Jul 2025).
- Vision and Multimodal: In computer vision, MoE yields maximum benefit for per-token activated parameter budgets in the $20$–$60$M regime—higher capacity brings diminishing returns; the MoE advantage is most pronounced under moderate-to-large sample and model size (Videau et al., 27 Nov 2024).
- Reinforcement Learning: MoE modules (notably soft rather than hard routing) restore monotonic performance—parameter scaling laws to deep RL, a regime where dense scaling commonly hurts. Adding MoE blocks increases both final performance and learning dynamics stability in Atari and offline RL (Obando-Ceron et al., 13 Feb 2024).
- Multilingual and Task Routing: Task-level routing in massively multilingual NMT and multitask models allows preservation of MoE's quality while activating only task-relevant experts at inference, yielding nearly higher throughput and significant memory savings (Kudugunta et al., 2021).
- System Scaling: Efficient architectures (e.g., DeepSpeed MoE) combine multi-dimensional parallelism and heterogeneous memory offload to support training at the trillion-parameter scale, with near-linear weak scaling and practical utilization (Kim et al., 2021, Singh et al., 2 Oct 2025).
Overall, Mixture-of-Experts scaling, now buttressed with empirically-validated scaling laws and architectural advances, is central to the efficient construction, training, compression, and deployment of next-generation large-scale neural networks (Zhao et al., 28 Sep 2025, Tian et al., 23 Jul 2025, He et al., 4 Jun 2024, Krajewski et al., 3 Jun 2025).