Designing an effective MoE load balancing loss

Develop a load balancing loss for mixture-of-experts router training with top-K expert selection that simultaneously encourages efficiency and expert specialization without hindering expressivity, addressing the open design challenge of balancing expert token allocations during training.

Background

Mixture-of-experts (MoE) models route tokens to a subset of experts per layer, and the router requires an auxiliary load balancing loss (LBL) to prevent collapse and promote balanced expert usage. Small changes in LBL design can substantially impact efficiency and specialization.

Despite widespread use of the global-batch LBL, the paper emphasizes that identifying an LBL that achieves balanced routing while preserving model expressivity remains unresolved. This motivates systematic exploration to devise improved LBL formulations that deliver both efficiency gains and expert specialization.

References

Devising an effective load balancing loss that simultaneously encourages efficiency and expert specialization, without hindering expressivity, remains an open design challenge that has driven much of the recent progress in MoEs~\citep{moe_sem_2, moe_mod_2_switch_1exp, moe_lbl_evo1, moe_lbl_evo2_zloss, moe_lbl_evo3, moe_deepseek_moe, moe_deepseek_demon, moe_similar_olmoe}.

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution  (2509.19349 - Lange et al., 17 Sep 2025) in Appendix, Section “Mixture-of-Experts Load Balancing Loss”