Attribution of DynaMoE performance gains given missing standard MoE baselines
Ascertain whether the reported performance improvements of DynaMoE are primarily attributable to the proposed layer-wise expert scheduling strategies and dynamic percentile-threshold routing, or whether they instead result from comparisons against Mixture-of-Experts baselines that omit standard capacity constraints and auxiliary load-balancing losses (e.g., those used in Switch Transformer, GShard top-2 routing, and expert-choice routing). Conduct controlled ablations including these established baselines to isolate the contribution of scheduling and dynamic routing from the effects of missing regularization and capacity control.
References
Without these, it is unclear whether DynaMoE gains arise from scheduling choices or simply from MoE without the regularization that published systems require for stability.