Quantitative metrics and principled selection for optimal specialization trade-off

Develop quantitative metrics to characterize when the number of experts n and the top-K selection in Mixture-of-Experts models should be considered "large" or "small", and determine a principled method to identify the optimal trade-off between expert specialization and collaboration across different model configurations.

Background

The ERC loss introduces a tunable parameter α that controls the strength of expert–router coupling and thereby the degree of specialization. Experiments reveal that overly strict specialization can reduce downstream performance, indicating a trade-off between specialization and effective collaboration among experts. The authors note that, across varying numbers of experts (n) and selection counts (K), the field lacks quantitative metrics to define when n and K are effectively "large" or "small" and lacks an automated procedure to determine the optimal specialization level, leaving current choices largely empirical.

References

However, we currently lack quantitative metrics to characterize "large" or "small" n and K across different models; as a result, determining the optimal trade-off remains largely empirical.

Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss  (2512.23447 - Lv et al., 29 Dec 2025) in Section 4.3 (The ERC loss is an effective tool for exploring expert specialization) — The optimal specialization degree