Dice Question Streamline Icon: https://streamlinehq.com

Underlying reason for variance adaptation’s effectiveness in deep networks

Determine the underlying reason for the effectiveness of variance adaptation—elementwise normalization of parameter updates by an exponential moving average of the uncentered squared gradients—in training deep neural networks.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper deconstructs matrix-whitening optimizers and argues that performance gains arise from two components: spectral normalization and variance adaptation. Experiments show that variance-adapted methods (e.g., SOAP, AdaMuon) consistently outperform their sign-descent counterparts across matrix-whitening families, suggesting variance adaptation is a crucial ingredient.

Despite extensive empirical evidence of its benefits, the mechanism by which variance adaptation improves deep network training remains unclear. Prior work offers explanations (e.g., batch stochasticity or oscillatory updates), but a definitive account has not been established.

References

We believe a fruitful open question is to identify the underlying reason behind variance-adaptation's effectiveness in deep networks.

What Really Matters in Matrix-Whitening Optimizers? (2510.25000 - Frans et al., 28 Oct 2025) in Discussion and Conclusion, Future work