Conjecture: small β1 may cause instability at larger scales

Investigate whether reducing the fast momentum parameter β1 (including β1=0) in AdEMAMix leads to noisier training dynamics and loss spikes that become problematic for convergence at larger model scales.

Background

The authors examine the sensitivity of AdEMAMix to the fast momentum parameter β1 and observe that smaller β1 values produce noisier training curves with loss spikes in 110M-parameter models.

They conjecture that such instability could become problematic as model scale increases, suggesting the need to paper this behavior in larger-scale settings.

References

We observe that smaller β1 values yield noisier curves, with multiple loss spikes. At the 110M parameter scale, those spike do not significantly impact the convergence, we conjecture that they could become a problem at larger scales.

— The AdEMAMix Optimizer: Better, Faster, Older (2409.03137 - Pagliardini et al., 5 Sep 2024) in Appendix, Subsection “Hyperparameter sensitivity” (Figure caption and discussion for β1 sweep)

Conjecture: small β1 may cause instability at larger scales

Sponsor

Background

References

Related Problems