Persistence of the weight-activation gap at scale and across architectures

Determine whether the weight-activation gap—defined as the observed disconnect between weight-space orthogonality (low weight mean squared overlap) and activation-space orthogonality (high activation mean squared overlap with no significant correlation)—persists when scaling Mixture-of-Experts models to at least 1 billion parameters and when employing alternative architectures such as Mixtral and DeepSeek-MoE.

Background

The paper investigates orthogonality regularization in Mixture-of-Experts (MoE) models and finds that it fails to reduce weight mean squared overlap (MSO) and does not improve performance consistently. Crucially, the study identifies a weight-activation gap: activation-space MSO remains high (approximately 0.57) and shows no significant correlation with weight-space MSO (Pearson r = -0.293, p = 0.523) across seven regularization strengths, indicating a disconnect between weight and activation geometry.

Experiments are conducted on a NanoGPT-MoE model (~130M parameters, 8 experts, top-2 routing) across datasets including TinyStories, WikiText-103, and PTB. While results show high variance and inconsistent effects, the central phenomenon—the weight-activation gap—motivates a scalability question. The authors explicitly state uncertainty about whether this gap persists at larger scales (1B+ parameters) and in different architectures such as Mixtral and DeepSeek-MoE, noting that prior work reports router-level gains at larger scales and that their findings may be setup-specific.

References

Whether the weight-activation gap persists at larger scales (1B+ parameters) or with different architectures (e.g., Mixtral, DeepSeek-MoE) remains an open question.