Effectiveness of SOAP and Kron second-order momentum at high data-to-model ratios
Establish whether the second-order momentum maintained by the SOAP optimizer and the Kron (PSGD) optimizer becomes more effective as the data-to-model ratio increases, and ascertain whether this increased effectiveness leads to larger speedups in long-run training compared to optimizers such as Muon under high data-to-model regimes.
References
We conjecture that the second-order momentum maintained by Soap and Kron becomes more effective when the data-to-model ratio increases. In the long run, adaptivity to heterogeneity in parameter directions may lead to a larger speedup.
— Fantastic Pretraining Optimizers and Where to Find Them
(2509.02046 - Wen et al., 2 Sep 2025) in Empirical Findings, High data-to-model Ratio