Underlying reason for variance adaptation’s effectiveness in deep networks

Determine the underlying reason for the effectiveness of variance adaptation—elementwise normalization of parameter updates by an exponential moving average of the uncentered squared gradients—in training deep neural networks.

Background

The paper deconstructs matrix-whitening optimizers and argues that performance gains arise from two components: spectral normalization and variance adaptation. Experiments show that variance-adapted methods (e.g., SOAP, AdaMuon) consistently outperform their sign-descent counterparts across matrix-whitening families, suggesting variance adaptation is a crucial ingredient.

Despite extensive empirical evidence of its benefits, the mechanism by which variance adaptation improves deep network training remains unclear. Prior work offers explanations (e.g., batch stochasticity or oscillatory updates), but a definitive account has not been established.

References

We believe a fruitful open question is to identify the underlying reason behind variance-adaptation's effectiveness in deep networks.

— What Really Matters in Matrix-Whitening Optimizers? (2510.25000 - Frans et al., 28 Oct 2025) in Discussion and Conclusion, Future work

Underlying reason for variance adaptation’s effectiveness in deep networks

Background

References

Related Problems