Loss landscape and gradient consistency underlying gains from very old gradients

Characterize the properties of the non-convex loss landscape and the temporal consistency of individual batch gradients during training that would explain the empirical gains from averaging very old gradients using the high-β3 slow EMA term in the AdEMAMix optimizer for large neural networks.

Background

AdEMAMix introduces a mixture of two exponential moving averages (EMAs), including a slow EMA with a large β (e.g., β3≈0.9999) that aggregates gradients over thousands of steps. Empirically, the authors observe that very old gradients can remain relevant and improve optimization outcomes across language modeling and vision tasks.

This observation challenges common practice that prioritizes recent gradients and raises theoretical questions about the structure of the loss landscape and the stability or consistency of gradients across long training horizons. The authors explicitly note that they do not answer these questions in the present work.

References

From a theoretical standpoint, our work raises several questions. First, given that we gain from averaging very old gradients, what can it reveal of the loss landscape and the consistency of one batch's gradient during training? Second, would our approach not decrease the variance up to a point that is harming generalization \citep{igrheavyball}? While no answer to those questions is given in this work, we provide a toy justification which indicates that large momentums can have a positive impact in noise-free non-convex settings (see Fig.~\ref{fig:toy_rosenbrock})---indicating the improvement of our approach is at least partially explainable without considering variance-reduction effects.

— The AdEMAMix Optimizer: Better, Faster, Older (2409.03137 - Pagliardini et al., 5 Sep 2024) in Section 2, Related Work (Works on understanding momentum)

Loss landscape and gradient consistency underlying gains from very old gradients

Background

References

Related Problems