Conjecture: small β1 may cause instability at larger scales
Investigate whether reducing the fast momentum parameter β1 (including β1=0) in AdEMAMix leads to noisier training dynamics and loss spikes that become problematic for convergence at larger model scales.
References
We observe that smaller β1 values yield noisier curves, with multiple loss spikes. At the 110M parameter scale, those spike do not significantly impact the convergence, we conjecture that they could become a problem at larger scales.
                — The AdEMAMix Optimizer: Better, Faster, Older
                
                (2409.03137 - Pagliardini et al., 5 Sep 2024) in Appendix, Subsection “Hyperparameter sensitivity” (Figure caption and discussion for β1 sweep)