Slicing and Dicing: Configuring Optimal Mixtures of Experts

Published 12 May 2026 in cs.LG and cs.CL | (2605.11689v1)

Abstract: Mixture-of-Experts (MoE) architectures have become standard in LLMs, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. It remains an open question whether these choices can be optimized independently, without considering interactions. We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms. We find that at every active-parameter scale that we study, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128.Further, the optimal expert size is nearly invariant to total parameter count and depends only on active parameter count. Third, we see that other choices like shared experts, heterogeneous experts and load-balancing settings have small effects relative to expert count and granularity, although dropless routing yields a consistent gain. Overall, our results suggest a simpler recipe: focus on expert count and granularity, other choices have minimal effect on final quality.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper shows that performance gains in MoE LMs primarily stem from increasing total expert count at a fixed coarse granularity within compute and memory constraints.
It rigorously decouples hyperparameters to demonstrate that expert heterogeneity and always-active modules offer minimal benefit or can even impair performance.
Empirical analysis across scales indicates that proper routing and load balancing are secondary to efficient parameter allocation in enhancing language modeling accuracy.

Slicing and Dicing: A Systematic Analysis of Mixture-of-Experts Configuration Space

Overview and Motivation

This paper provides the most comprehensive empirical study to date of Mixture-of-Experts (MoE) architectural choices for LLMs, interrogating the interplay among expert count, expert granularity, sharing/heterogeneity, and routing/load-balancing mechanisms across over 2,000 pretraining runs. Spanning model scales from 10M to 6.6B parameters, the work rigorously decouples multiple often-tied hyperparameters and exposes how performance emerges as a function of both active and inactive parameter allocation. The principal finding is that, within compute- and memory-matched settings, performance gains trace almost entirely to maximizing total expert parameterization—primarily by increasing expert count—while auxiliary design choices yield only minimal additional benefit. This significantly simplifies the practical optimization landscape for MoE LMs.

Detailed Experimental Design

The experimental suite exhaustively explores a multi-dimensional MoE design space, including both homogeneous and heterogeneous expert sets, optionally supplemented by always-active "generalist" FFNs, as well as systematic sweeps over load-balancing hyperparameters and routing choices. Models are FLOP-matched on the basis of active FFN parameters, isolating performance effects attributable to varying total expert count $n$ , granularity $g$ (fractional expert width versus dense baseline), or activation sparsity $s = n \cdot g$ . Furthermore, expert size and expert count are manipulated independently to tease apart their contributions (Figure 1).

Figure 1: MoE layer schematic. Each token traverses shared self-attention, is routed to high-affinity experts (possibly of heterogeneous sizes), and the expert outputs are combined. Generalist (G) modules are always active (not routed); heterogeneity can arise from mixing expert granularities within a layer.

At all scales, language modeling performance is evaluated using macro-average cross-entropy on diverse held-out domains; ablation analyses are conducted on downstream tasks (e.g., BoolQ, HellaSwag, MMLU) as well. The models are pretrained on a multilingual, high-quality corpus following Chinchilla guidelines for data-to-parameter ratio, with all hyperparameters precisely documented for rigorous comparison.

Key Empirical Findings

1. Performance Is Predominantly Parameter-Driven

Across all activation scales $\geq 50$ M, MoEs consistently and substantially outperform parameter-matched dense baselines so long as the total (inactive) parameter budget is maximized—primarily by increasing $n$ at nearly fixed $g$ (Figures 2, 8, 9, 11). This effect subsides only at very small compute, where dense models saturate data fit before sparsity provides leverage (Figures 14, 18, 19).

Figure 2: Validation performance scales monotonically with total expert parameter count, holding active parameters fixed.

Figure 3: Scaling trends for 50-110M active, up to 1.4B total parameters reinforce the dominance of total parameter count over other settings.

Notably, expert granularity $g$ (i.e., the size of each expert) should remain coarse (e.g., $1/4$ to $1/2$ of dense width), only slowly increasing with extreme activation sparsity, rather than becoming ultra-fine-grained (Figures 2, 8, 20, 21). Increasing $n$ at near-constant $g$ 0 provides the best utilization of memory for performance uplift.

2. Expert Heterogeneity and Generalists Are Ineffective or Detrimental

Introducing expert heterogeneity (mixed granularities within a layer) or always-active generalist layers does not confer any systematic improvement over the best homogeneous MoE at fixed memory/compute. Heterogeneous MoEs interpolate between the best matching homogeneous settings but provide negligible or no gains (Figures 6, 12, 13).

Figure 4: Heterogeneous expert pools underperform relative to the envelope traced out by homogeneous configurations at the same sparsity and parameter budget.

Adding a generalist FFN module consistently reduces validation accuracy, even when allocated considerable width (Figure 5):

Figure 5: The inclusion of a generalist module consistently hurts performance regardless of configuration.

These results suggest the architectural flexibility for mixing expert types is unnecessary or even counterproductive under optimal parameter allocation.

3. Routing and Load Balancing: Marginal Gains, Little Sensitivity

MoE performance is robust to a wide range of load balancing hyperparameters and auxiliary routing losses, requiring only coarse tuning to avoid degenerate expert loading. Dropless routing (preventing token loss due to routing overflow) provides a modest but consistent gain, particularly at lower expert counts, but as expert count increases, its impact wanes (Figure 6).

Figure 6: Dropless routing marginally outperforms default (token-dropping) routing, especially at moderate expert count.

Poorly chosen load balancing settings (especially excessively high loss-free bias) can impair performance (Figure 7), but within moderate ranges, the sensitivity is low. These findings demote the importance of elaborate auxiliary losses or routing innovations relative to macro architectural decisions.

Figure 7: Load balancing mechanisms must be adequately tuned; very strong load-free bias can harm results, but moderate settings are robust.

4. MoE Advantage Emerges Only Above a Compute Floor

At low activation scales (10-20M active parameters), even sweeping expert count and granularity, MoE LMs underperform dense baselines under Chinchilla-optimal token count (Figures 14-17). Only by increasing the data budget (by 5-20 $g$ 1), thus matching the compute of larger models, do MoEs begin to surpass dense models. This aligns the empirical regime with scaling law predictions.

Figure 8: At 10M active, MoEs underperform at Chinchilla-optimal tokens, but win after 20 $g$ 2 more data (compute-matched with 50M scale).

Figure 9: Similarly, at 20M active, accessing more data is required for MoEs to overtake dense models.

5. Fine-Grained Divisions Without Sparsity Harm Performance

Dividing FFN parameters into multiple always-active smaller components (without exploitation of conditional routing/sparsity) does not improve—and often impairs— performance, further underscoring the necessity of both sparsity and overall parameter scaling for MoE benefit (Figure 10).

Figure 10: The performance boost of MoEs is not due to granularity alone; splitting dense FFNs without sparsity severely underperforms.

Implications and Theoretical Reflection

The results paint a strikingly simple picture: Optimal MoE model design for a fixed compute/memory budget, and in the scaling-law regime, is almost entirely a matter of allocating as many total parameters as possible to conditional experts, maximizing expert count at a fixed, coarse granularity, and discarding most other architectural flexibility. Sophisticated tuning of auxiliary router losses, capacity factors, granularity mixing, and shared generalist modules confer minimal to negative benefit once global sparsity and parameter count are properly set.

This finding contradicts recent claims in the literature favoring intricate heterogeneity [wang2025hmoe], pervasive use of shared experts [zhao2025comprehensivescalinglawmixtureofexperts], or the necessity of highly fine-grained experts [tian2025greaterleveragescalinglaws]. Instead, the present results favor a scale-dependent, not fixed, recipe, advocating for aggressive increase in expert count as scale and budget permit. Consequently, theoretical analyses or practical recipes that tie optimal MoE configuration “constants” across very different scales may fail to generalize (cf. [misfitting]).

Practically, this enables the design of high-performing MoE LMs with minimal architecture search: select active parameter target, choose a coarse expert granularity ( $g$ 3 is near-optimal across most scales), allocate maximal expert count within the memory envelope, and adjust only minimally for router and load-balancing settings.

Future Directions

This systematic analysis opens several additional avenues:

Scaling Laws at Trillion-Parameter Regimes: How persistently do these trends extrapolate to ultra-large models? What are the practical hardware/throughput breakpoints for further increasing expert count?
Domain-Specific Experts: While heterogeneous granularity offers no consistent gain in language modeling, could domain-conditional experts for specialized modalities produce different outcomes?
Efficient Distributed Implementation: The hardware cost (especially memory) of extremely large $g$ 4 must be balanced against model throughput and training complexity; system engineering advances (e.g., MegaBlocks routing [megablocks]) may further optimize for requested architectures.
Downstream Impact: While the paper’s downstream task ablations show some gain mirroring language modeling loss, task-specific trends and possible transfer improvements remain under-explored.

Conclusion

This paper provides strong evidence that most MoE design complexity—including expert heterogeneity, generalists, and fine-grained auxiliary routing—is unnecessary for optimal performance: maximizing total expert parameter (via count, not width) and guaranteeing effective routing/load balancing are sufficient and almost entirely determinant of achieved accuracy in language modeling regimes above the compute threshold. These findings should incentivize both researchers and practitioners to prioritize these global configuration choices, abandoning excessive micro-architectural tuning except as required by resource constraints or hardware idiosyncrasies (2605.11689).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper studies a special way to build big LLMs called “Mixture of Experts” (MoE). Instead of having one giant part that processes every word, MoE uses many small “experts.” For each word, a “router” picks only a few experts to do the work. This makes training more efficient while still letting the model be very powerful.

The authors ran over 2,000 training experiments to figure out which MoE design choices really matter for getting good performance and which ones don’t.

What questions did the researchers ask?

In simple terms, they wanted to know:

How many experts should you have, and how big should each expert be?
Is it better to mix different sizes of experts or keep them all the same?
Should you include a “generalist” expert that always helps, no matter what?
How much do routing tricks (like balancing work across experts or avoiding dropped tokens) affect performance?
Can you tune these choices independently, or do they interact in complicated ways?

How did they study it?

They built many versions of MoE models and trained them on the same kind of data (a mix of web text, code, math, and encyclopedias). To make fair comparisons:

They kept the “active compute” the same across models. Think of it like giving every model the same amount of running time per word, even though some have more experts waiting on the sidelines.
They changed one thing at a time: total number of experts, the size of each expert (called “granularity”), whether experts had different sizes, whether there was a generalist expert that’s always used, and how the router balances work.
They measured performance with a standard score called “cross-entropy loss.” Lower is better.

A helpful analogy: imagine a school with many teachers (experts). For each student’s question (token), a guidance counselor (router) picks the best 1–2 teachers to answer. The school size (total experts) can be huge, but only a few teachers are active per question, keeping the time and effort per question the same.

What did they find?

Here are the main takeaways:

More total experts helps, even if most are inactive at any moment. The models got better as they added more experts and parameters overall, even when the ratio of total to active parameters was extremely large. In plain terms: a big bench of specialists to choose from is useful.
The best expert size depends mostly on how many parameters are active when a token is processed, not on the total size of the model. So pick expert sizes based on your per-token compute budget, then scale total experts up as memory allows.
Focus on two things: expert count and expert size (granularity). These two choices drive most of the performance. Other choices matter much less.
“Dropless routing” gives a consistent boost. Sometimes an expert’s “classroom” gets too full and extra tokens are dropped or sent elsewhere. Dropless routing makes sure tokens aren’t dropped, and this usually improves results.
Using a single active expert per token (top‑1 routing) underperforms. Activating two or more small experts per token tends to be better than relying on a single large expert.
Splitting the feed-forward network (FFN) into smaller pieces without sparsity hurts. If you keep all parts active all the time (no routing, no sparsity), performance gets worse. The benefit comes from smart, sparse activation, not just slicing the FFN.
“Generalist” experts that are always active don’t help and often hurt. Having a shared expert that participates for every token sounds nice, but experiments showed it usually makes things worse.
Mixing different expert sizes (heterogeneous experts) doesn’t beat a well‑tuned set of same‑size experts. Flexibility alone didn’t give improvements; the homogeneous setup performed as well or better.
Load balancing settings matter only a little—if you stay within reasonable ranges. You should avoid extreme settings that overcorrect routing, but otherwise, you don’t need to spend lots of time tuning these hyperparameters.
MoE beats dense models once you have enough compute or data. At very small scales, MoE didn’t outperform regular dense models. But with more data or bigger active model sizes, MoE consistently did better.

Why does it matter?

This study simplifies how to build strong MoE LLMs:

Put your effort into choosing the number of experts and their size. Set expert size based on your per-token compute, then add more total experts if you have memory to spare.
Turn on dropless routing.
Don’t rely on always-on generalist experts or complex mixes of different expert sizes.
Keep load balancing reasonable, but don’t overthink it.

For teams training large models, this is practical advice that can save time and resources. It suggests a clean recipe: scale total experts and pick a stable expert size, and you’ll likely get most of the benefits of MoE without diving into overly complicated configurations.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of concrete gaps and unresolved questions that future work could address:

Scaling beyond studied regimes: Do the reported trends (e.g., “increase total experts; keep granularity coarse”) hold at substantially larger active parameter counts (≥1B) and total parameters (≥10s–100s of billions or trillion-scale), or do returns saturate or reverse at very high activation sparsity $s$ and total expert counts (e.g., $n>1024$ , $s>128$ )?
Compute–data optimality: What are the compute-optimal trade-offs among expert count, granularity, and training tokens across data scales? How do optimal configurations shift under different data budgets and mixtures (given small models required much more data to realize MoE gains)?
Generalization across datasets: Do the conclusions persist across markedly different corpora (e.g., multilingual, speech-text, noisy web, domain-specialized biomedical/legal) and data qualities, not just the OLMoE mixture?
Downstream performance and robustness: How do configuration choices translate to downstream tasks (reasoning, coding, instruction-following), in-context learning, and robustness? The paper emphasizes validation cross-entropy and notes variance on downstream tasks without a systematic analysis.
Inference efficiency and systems impact: What are the real-world throughput, latency, memory bandwidth, and communication costs of increasing $n$ (with constant active parameters) across hardware backends (GPU, TPU) and expert-sharding strategies? Does dropless routing maintain system efficiency at scale?
Router algorithm breadth: How do alternative routing strategies (expert-choice, optimal transport/Sinkhorn, temperature/noise schedules, top- $k$ beyond the $k=1/g$ coupling, per-sequence/document routing) affect quality, stability, and balance under FLOP-matched comparisons?
Decoupling active expert count from granularity: What is the optimal $k$ (active experts) at fixed active parameter count when $k$ and per-expert size are varied independently (rather than enforcing $k\cdot g=1$ )? Does an optimal $k$ emerge beyond 1–2 across scales?
Capacity management: How do capacity factor choices and alternative overflow handling interact with dropless routing, particularly at very high $n$ , and what are the accuracy–efficiency trade-offs?
Load-balancing mechanisms and schedules: Can adaptive or learned schedules for auxiliary loss weight and loss-free bias $\gamma$ (rather than fixed $\{10^{-4},10^{-2}\}$ and $\{0,10^{-3}\}$ ) improve stability and performance, especially for large $n$ where fixed $\gamma$ appears too coarse?
Stability and router regularization: What is the systematic effect of Z-loss, entropy penalties, temperature annealing, or noise injection on router stability across scales, beyond preliminary or appendix-only ablations?
Expert heterogeneity search: Does more expressive heterogeneity (more than two pools, unequal compute splits, per-layer heterogeneity, learned/dynamic expert sizing, or continuous parameterization) provide benefits that the restricted two-pool, equal-split design may have missed?
Generalists under broader conditions: Are shared experts beneficial under other regimes (e.g., larger scales, multi-task mixtures, instruction tuning, different placement or warm-start/distillation of generalists, or multiple smaller generalists)? Current negative results may be configuration- or scale-specific.
Layerwise MoE placement: How do optimal expert count and granularity vary by layer depth, and does selective MoE placement (early vs middle vs late layers) outperform uniform placement?
Attention vs FFN sparsity: What are the effects of applying MoE to attention (or hybrid MoE across FFN and attention) on quality/efficiency relative to FFN-only MoE?
Expert specialization analysis: How do routing patterns and expert specialization evolve (e.g., by domain/token types), how stable are they across seeds, and do they correlate with performance? Can this inform adaptive reallocation or pruning/merging of experts?
Training duration and convergence: Do optimal configurations shift with much longer training (near-convergence) vs the reported budgets? Are observed trends transient early-training effects?
Optimizer and regularization sensitivity: How do optimizer choices, learning-rate schedules, weight decay, dropout, and normalization strategies specifically affect routers and experts across configurations?
Fine-grained experts at larger actives: The paper recommends relatively coarse experts (e.g., $g\approx 1/4$ ) at studied scales. Is there a reproducible functional relationship between active parameter count and optimal $g$ that can be formalized as a scaling law?
Tokenization and sequence length: How do tokenization schemes and sequence length/batch-shape (which affect per-step load variance) impact load imbalance, dropless benefits, and optimal capacity factors?
Fractional/continuous granularity: Does allowing non-dyadic granularity (beyond $\{1,1/2,1/4,\dots\}$ ) or continuous expert-width allocation yield additional gains under the same compute/memory budgets?
Multi-expert communication: Do mechanisms for cross-expert communication/aggregation (beyond weighted summation per token) improve learning and stability, especially at high $n$ ?
Safety and alignment considerations: How do MoE configuration choices impact harmful content generation, calibration, or controllability post-alignment, and do shared experts or heterogeneity affect safety tuning?
Reproducibility across infrastructures: Are results consistent across different training stacks (DeepSpeed, Megablocks variants), cluster topologies, and sharding strategies, given their potential to influence token dropping and utilization?

These gaps outline actionable directions to test the robustness, scalability, and practical utility of the proposed “simpler recipe” for configuring MoE models.

View Paper Prompt View All Prompts

Practical Applications

Overview

This paper provides a large-scale, systematic exploration of Mixture-of-Experts (MoE) design choices across >2,000 pretraining runs (up to 6.6B total params). The central, actionable findings are:

Quality increases monotonically with total MoE parameters (inactive expert capacity), even at very high activation sparsity.
The optimal expert size (granularity) is largely determined by the active parameter scale, not the total parameter count.
Other knobs (shared/generalist experts, heterogeneous expert sizes, most load-balancing settings) have small effects; dropless routing yields a consistent, small gain.
Top-1 (single-expert) routing underperforms; MoE benefits do not materialize at very small active scales unless the data/compute budget is increased.

Below are concrete applications, grouped by immediacy, including sectors, potential tools/workflows, and feasibility assumptions.

Immediate Applications

These applications can be deployed now using existing frameworks (e.g., DeepSpeed-MoE, Megatron-Core, Fairseq-MoE, MegaBlocks) and standard GPU clusters.

Production-ready MoE configuration recipe and HPO reduction
- What to do: Narrow the search to only two primary knobs—total expert count n and expert granularity g—and set k (active experts) via k = 1/g. Use top-2-or-more routing (avoid top-1). Enable dropless routing. Avoid shared “generalist” experts and heterogeneous expert sizing by default. Keep load-balancing weights within reasonable ranges (e.g., αLB ~ 1e-2 to 1e-4) and avoid overly large loss-free bias at high n.
- Practical defaults by active parameter scale (illustrative from the paper’s trends):
- ~50M active params: g in [1/4, 1/2]
- ~110M active params: g in [1/8, 1/4]
- ~300M active params: g ≤ 1/8
- Increase n as memory allows; expect quality to rise with total parameters for fixed FLOPs.
- Sectors: Software/AI platforms, cloud providers, model vendors, enterprise ML teams.
- Tools/workflows: “MoE Configurator” (script or UI) that takes active-parameter target, VRAM per device, number of devices, kernel availability (dropless), and outputs (n, g, k) with safe LB settings and router defaults. Integrate as a template in your training pipeline.
- Assumptions/dependencies: Block-sparse/dropless kernels available; FLOP-matched comparisons; results validated up to ~300M active and 6.6B total parameters on the reported data mixture.
Capacity gains at fixed training FLOPs (latency/cost budgets)
- What to do: When compute-per-token is fixed (e.g., latency or energy constraints), increase total expert count n (and thus total parameters) to improve quality without raising per-token FLOPs.
- Sectors: Customer support/chatbots, code assistants, search and ads, content moderation.
- Tools/products: SKU variants of your LLM (same latency target, varying total n for higher quality tiers); “memory-for-quality” pricing tiers in model serving.
- Assumptions/dependencies: Sufficient GPU memory or model-parallelism bandwidth to host more (inactive) experts.
Adopt dropless routing to reduce token loss and improve stability/quality
- What to do: Replace capacity-factor token drops with dropless routing (e.g., MegaBlocks). Gains are small but consistent across scales; parity at extremely high n.
- Sectors: Any production MoE deployment; regulated domains (where dropped tokens could be problematic).
- Tools/workflows: Kernel swap-in; router instrumentation for drop-rate to verify elimination; A/B tests.
- Assumptions/dependencies: Availability of dropless kernels; engineering integration cost.
GPU procurement and capacity planning with activation sparsity s
- What to do: Use s = n * g as a simple planning metric. Given a latency/FLOP budget (fixed k = 1/g), scale quality by adding total experts n until memory is saturated.
- Sectors: Cloud/HPC, finance/quant firms, AI infrastructure teams.
- Tools/workflows: Procurement calculators that translate budgeted FLOPs and VRAM into (n, g, k) with expected quality improvements; scheduler policies that prioritize VRAM for higher-n models when compute is fixed.
- Assumptions/dependencies: Memory constraints dominate; communication overhead manageable with your parallelism strategy.
Simplify architectures by removing always-on generalists and heterogeneous expert pools
- What to do: Default to homogeneous, routed-only experts. Paper shows generalists and heterogeneity do not beat well-configured homogeneous MoE at matched budgets.
- Sectors: All model builders (reduces engineering complexity, failure modes).
- Tools/workflows: De-scope feature flags for shared experts/heterogeneity; focus QA/testing on the simpler recipe.
- Assumptions/dependencies: Your tasks are similar to the study’s (LM pretraining, diverse text mixture). Some niche multi-task regimes may still benefit from shared experts—validate locally.
Safer router defaults
- What to do: Avoid top-1 routing; default to top-2-or-more. Keep LB weight in a modest range; avoid large loss-free bias at high n.
- Sectors: Any MoE user; especially new adopters who lack router expertise.
- Tools/workflows: Router presets; monitoring dashboards for load imbalance, expert utilization, and router loss.
- Assumptions/dependencies: Same router class and token-choice scheme as common MoE stacks.
When NOT to use MoE at very small scales
- What to do: If active parameters are ~10–20M and data/compute budget is small, a dense model may outperform MoE. If you must use MoE, increase the data/compute budget to realize benefits.
- Sectors: Academia/education, startups with very limited budgets.
- Tools/workflows: Budget checklists; decision tree in training docs (“dense vs MoE”).
- Assumptions/dependencies: Your scale is similar to those tested; task/data regimes comparable.
Reporting and evaluation standards that improve reproducibility and governance
- What to do: Report active parameters, total parameters, activation sparsity s, router drop rate (if any), load-imbalance metrics, and whether dropless routing was used. Compare models at matched FLOPs.
- Sectors: Academia, open-source communities, policy/AI governance.
- Tools/workflows: Model cards and eval templates; CI checks to enforce reporting; logs for router stats.
- Assumptions/dependencies: Organizational willingness to standardize metrics; availability of the paper’s code to copy instrumentation patterns.

Long-Term Applications

These require additional research, systems development, or ecosystem maturation, but are grounded in the paper’s findings.

Memory-first accelerator and system co-design for MoE
- Idea: Hardware and runtime stacks that prioritize large parameter storage and low-overhead activation of a small subset per token. Expect substantial gains since quality rises with total parameters at fixed FLOPs.
- Sectors: Semiconductors, cloud hardware, systems software.
- Potential products: MoE-optimized memory fabrics, near-memory compute for expert FFNs, router-aware interconnect scheduling.
- Dependencies/assumptions: Vendor kernel support for dropless sparse ops; network/memory bandwidth optimized for many small FFNs.
Elastic MoE: runtime adaptation of n, k, and g under resource and latency constraints
- Idea: Dynamically adjust the number of experts activated and/or total experts loaded based on VRAM headroom, latency SLOs, or user tier to maximize quality at runtime.
- Sectors: Cloud serving, consumer apps with variable device capabilities.
- Potential products: “Elastic Experts” serving tier; autoscalers that add/remove inactive experts on the fly.
- Dependencies/assumptions: Fast expert swapping/streaming; router stability under changing expert pools; robust caching/quantization.
AutoMoE controllers that learn to pick (n, g) under budget
- Idea: Automated controllers that, given a target active-param budget, VRAM, and training time, select granularity and total experts using priors from this paper’s scaling trends.
- Sectors: AutoML platforms, enterprise MLOps.
- Potential products: Planner that outputs a training curriculum and a minimal HPO sweep for n only; budget-to-quality estimators.
- Dependencies/assumptions: Transferability of the paper’s trends to new corpora and larger scales.
Revisiting heterogeneity and shared experts with new objectives
- Idea: Although the paper finds no gains in general, specialized regimes (multi-task safety, tool-use, speech+text, grounded RL) may need different inductive biases and new routing/loss designs to unlock benefits.
- Sectors: Healthcare (safety-critical triage), finance (compliance), robotics (multi-modal control).
- Potential products: Domain-aware router losses, expert-regularization for cross-task transfer, safety-specialist experts with calibrated routing.
- Dependencies/assumptions: New training objectives; careful auxiliary-loss tuning; robust evaluation to avoid overfitting artifacts.
Edge/on-device MoE via expert paging, quantization, and distillation
- Idea: Store many experts off-device and page/stream/quantize them; keep active compute low per token. Alternatively, distill many-expert MoEs into edge-suitable students.
- Sectors: Mobile, IoT, robotics.
- Potential products: Expert caches, cold-start expert prefetchers, MoE-to-dense student distillation pipelines.
- Dependencies/assumptions: Efficient storage, streaming, and quantization; predictable routing to prefetch correctly.
MoE-aware safety, robustness, and fairness auditing
- Idea: Build auditing suites that measure expert-level specialization, load balance by demographic inputs, and the impact of dropless routing on failure modes.
- Sectors: Policy/governance, regulated industries.
- Potential products: Router fairness dashboards; expert-behavior probes; safety regression tests by expert assignment.
- Dependencies/assumptions: Access to router logits/assignments; privacy-preserving logging.
Policy standards for energy/compute reporting that reflect MoE realities
- Idea: Require reporting both active and total parameters, activation sparsity s, dropless/dropped-token stats, and FLOP-matched evaluations in model cards and regulatory filings.
- Sectors: Government, standards bodies, publishers.
- Potential products: Certification checklists; standardized compute/quality disclosures; benchmarks that fix per-token FLOPs.
- Dependencies/assumptions: Community consensus; tooling to verify claims; alignment with broader AI transparency efforts.

Notes on Assumptions and Dependencies Across Applications

External validity: Results are established up to ~300M active and ~6.6B total parameters, on a particular data mixture and token-choice routing. Extrapolation to trillion-parameter total scales or very different data distributions should be validated.
Systems preconditions: Dropless routing depends on block-sparse kernels (e.g., MegaBlocks) and compatible frameworks; parallelism and communication overheads can limit scalability as n grows.
Memory as a first-class constraint: Gains rely on hosting more total experts. If VRAM or memory bandwidth is tight (edge/embedded), benefits may be constrained without advanced offloading or quantization.
Hyperparameter ranges: Load-balancing settings are robust within “reasonable” ranges but can hurt at high n if loss-free bias is too large; routers and kernels should be tested per deployment.
Small-scale caveat: At tiny active scales and limited data/compute budgets, MoE may underperform dense—either increase the budget or prefer dense models.

View Paper Prompt View All Prompts

Glossary

Active expert count: The number of experts activated per token in an MoE layer; affects compute and memory cost independently of expert size. "Active expert count is independent of expert size."
Activation sparsity: A measure of how many total FFN parameters exist relative to the active FFN parameters, defined as s = n * g. "We hold MoE activation sparsity $s$ fixed for $s\in\{2, 4, 8, 16, 32, 64\}$ "
Affinity: A score measuring how well a token matches an expert, used by routers to choose experts. "compute an affinity between each (hidden state, expert) pair."
Auxiliary loss: An additional loss term added to the primary task loss to encourage behaviors like load balancing. "Auxiliary losses are a direct and explicit tool to induce load balancing."
Block-sparsity: A sparsity pattern that enables efficient computation by operating on blocks; used to avoid dropping tokens. "Dropless routing \citep{megablocks} uses block-sparsity to ensure that no tokens are dropped."
Capacity factor: A multiplicative buffer on each expert’s token capacity to mitigate temporary load imbalance. "A common capacity factor is 2."
Dropless routing: A routing strategy ensuring no tokens are discarded even under load imbalance, typically via block-sparse computation. "Dropless routing \citep{megablocks} uses block-sparsity to ensure that no tokens are dropped."
Expansion factor: An alternative (but overloaded) name for activation sparsity; sometimes also used for FFN width expansion. "This is sometimes referred to as expansion factor, but this term is overloaded and also used to refer to FFN dimension expansion."
Feed-forward network (FFN): The position-wise MLP sublayer in a Transformer that is replaced by experts in MoE layers. "replace the feed-forward network (FFN) in some or all transformer layers."
Fine-grained experts: Experts with reduced intermediate FFN dimension to increase the number of experts at fixed compute. "more recent work has explored fine-grained experts with decreased intermediate FFN dimension"
FLOP-matching: Controlling configurations so they use the same floating-point operations per step, enabling fair comparisons. "To FLOP-match all configurations on a per-timestep basis, we match active FFN parameters at each model scale."
FLOPs-per-token: The computational cost measured in floating-point operations required to process each token. "Increased active expert count incurs a higher FLOPs-per-token compute cost and memory usage"
Generalist (shared expert): An always-active expert whose output is combined with routed experts for all inputs. "Also referred to as shared experts, generalists are experts which are always activated, for all inputs."
Heterogeneous experts: An MoE configuration where experts differ in size or granularity within the same layer. "heterogeneous MoE layers consisting of experts with different granularity"
Load balancing auxiliary loss: A specific auxiliary loss that penalizes uneven routing so tokens are spread across experts. "A simple load balancing auxiliary loss \citep{shazeer2017outrageouslylargeneuralnetworks} penalizes overreliance on expert $E_i$ in each batch"
Loss-free load balancing: A mechanism that balances loads by adjusting per-expert biases instead of adding a loss term. "A loss-free mechanism avoids the challenge of balancing loss terms, though it does introduce other inductive biases to the LM."
Mixture of Experts (MoE): An architecture with many expert submodules where a router activates a subset per input to decouple capacity from compute. "Mixture of Experts (MoE) models decouple computational overhead from model capacity by adding experts which are conditionally activated by a learned router."
Router: The component that selects which experts to activate for each token based on affinity scores. "At each timestep, the router for a particular MoE layer selects a subset of experts to activate."
Router Z-loss: An auxiliary loss designed to stabilize router training, particularly at larger scales. "introduce an additional auxiliary loss, the Z-loss, designed to address instability in router training, especially at larger model scales:"
Token choice: A routing algorithm that selects experts independently for each token based on highest affinities. "One of the most common is token choice"
Token dropping: Discarding tokens that exceed an expert’s capacity during routing, passing them to the next layer without expert processing. "These remaining overflow tokens may be routed to the next best available expert, or they may be dropped altogether, passed directly to the next layer."
Top-1 routing: A routing regime where exactly one expert is activated per token, often underperforming compared to multi-expert activation. "Top-1 routing underperforms in FLOP-matched settings"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Slicing and Dicing: Configuring Optimal Mixtures of Experts

Summary

Slicing and Dicing: A Systematic Analysis of Mixture-of-Experts Configuration Space

Overview and Motivation

Detailed Experimental Design

Key Empirical Findings

1. Performance Is Predominantly Parameter-Driven

2. Expert Heterogeneity and Generalists Are Ineffective or Detrimental

3. Routing and Load Balancing: Marginal Gains, Little Sensitivity

4. MoE Advantage Emerges Only Above a Compute Floor

5. Fine-Grained Divisions Without Sparsity Harm Performance

Implications and Theoretical Reflection

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they study it?

What did they find?

Why does it matter?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Notes on Assumptions and Dependencies Across Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets