Persistence of the weight-activation gap at scale and across architectures
Determine whether the weight-activation gap—defined as the observed disconnect between weight-space orthogonality (low weight mean squared overlap) and activation-space orthogonality (high activation mean squared overlap with no significant correlation)—persists when scaling Mixture-of-Experts models to at least 1 billion parameters and when employing alternative architectures such as Mixtral and DeepSeek-MoE.
Sponsor
References
Whether the weight-activation gap persists at larger scales (1B+ parameters) or with different architectures (e.g., Mixtral, DeepSeek-MoE) remains an open question.
— Geometric Regularization in Mixture-of-Experts: The Disconnect Between Weights and Activations
(2601.00457 - Kim, 1 Jan 2026) in Limitations, Scale