Does sparse routing reduce superposition in large-scale MoE LLMs?
Establish whether the observed decrease in superposition with increased routing sparsity in toy Mixture-of-Experts models extends to large-scale Mixture-of-Experts language models, where the router must manage many experts and diverse data distributions.
References
However, it remains an open question whether this trend holds in large-scale LLMs, where the router must balance millions of parameters and diverse data distributions.
— The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
(2604.02178 - Herbst et al., 2 Apr 2026) in Section 2, Related Work (MoE as a Path to Interpretability)