Effectiveness of SOAP and Kron second-order momentum at high data-to-model ratios

Establish whether the second-order momentum maintained by the SOAP optimizer and the Kron (PSGD) optimizer becomes more effective as the data-to-model ratio increases, and ascertain whether this increased effectiveness leads to larger speedups in long-run training compared to optimizers such as Muon under high data-to-model regimes.

Background

The paper benchmarks eleven optimizers for LLM pretraining across multiple model sizes and data-to-model ratios. In higher data-to-model regimes (e.g., 16× Chinchilla), the authors observe SOAP and Kron overtaking Muon in performance. Based on these empirical findings, they propose a conjecture regarding the role of second-order momentum in these matrix-based optimizers.

Specifically, the authors suggest that second-order momentum—an adaptive mechanism in SOAP and Kron that accounts for curvature and heterogeneous parameter directions—may yield growing benefits as more data is used per parameter. This would imply distinct scaling behavior for these methods relative to first-order or other matrix-based approaches without explicit second-order momentum. Confirming or refuting this conjecture would clarify optimizer selection in high data-to-model regimes and inform future optimizer design.

References

We conjecture that the second-order momentum maintained by Soap and Kron becomes more effective when the data-to-model ratio increases. In the long run, adaptivity to heterogeneity in parameter directions may lead to a larger speedup.

— Fantastic Pretraining Optimizers and Where to Find Them (2509.02046 - Wen et al., 2 Sep 2025) in Empirical Findings, High data-to-model Ratio

Effectiveness of SOAP and Kron second-order momentum at high data-to-model ratios

Background

References

Related Problems