Attribution of DynaMoE performance gains given missing standard MoE baselines

Ascertain whether the reported performance improvements of DynaMoE are primarily attributable to the proposed layer-wise expert scheduling strategies and dynamic percentile-threshold routing, or whether they instead result from comparisons against Mixture-of-Experts baselines that omit standard capacity constraints and auxiliary load-balancing losses (e.g., those used in Switch Transformer, GShard top-2 routing, and expert-choice routing). Conduct controlled ablations including these established baselines to isolate the contribution of scheduling and dynamic routing from the effects of missing regularization and capacity control.

Background

DynaMoE introduces dynamic token-level expert activation and layer-wise expert scheduling, and is evaluated against a dense MLP and a uniform MoE variant without auxiliary load-balancing losses. Standard MoE systems such as Switch Transformer and GShard typically include capacity factors and auxiliary balancing losses to prevent load collapse and manage overflow.

The authors explicitly note the absence of these standard baselines and regularization mechanisms in their comparisons, which creates ambiguity about the true source of the observed gains. Establishing whether improvements stem from DynaMoE’s scheduling and routing mechanisms or from the lack of baseline regularization requires controlled experiments against standard, load-balanced MoE baselines.

References

Without these, it is unclear whether DynaMoE gains arise from scheduling choices or simply from MoE without the regularization that published systems require for stability.

DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks  (2603.01697 - Gülmez, 2 Mar 2026) in Subsection "Limitations" (Section 7), Item 2: "Missing standard MoE baselines"