Designing an effective MoE load balancing loss
Develop a load balancing loss for mixture-of-experts router training with top-K expert selection that simultaneously encourages efficiency and expert specialization without hindering expressivity, addressing the open design challenge of balancing expert token allocations during training.
Sponsor
References
Devising an effective load balancing loss that simultaneously encourages efficiency and expert specialization, without hindering expressivity, remains an open design challenge that has driven much of the recent progress in MoEs~\citep{moe_sem_2, moe_mod_2_switch_1exp, moe_lbl_evo1, moe_lbl_evo2_zloss, moe_lbl_evo3, moe_deepseek_moe, moe_deepseek_demon, moe_similar_olmoe}.
— ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
(2509.19349 - Lange et al., 17 Sep 2025) in Appendix, Section “Mixture-of-Experts Load Balancing Loss”