Loss landscape and gradient consistency underlying gains from very old gradients
Characterize the properties of the non-convex loss landscape and the temporal consistency of individual batch gradients during training that would explain the empirical gains from averaging very old gradients using the high-β3 slow EMA term in the AdEMAMix optimizer for large neural networks.
References
From a theoretical standpoint, our work raises several questions. First, given that we gain from averaging very old gradients, what can it reveal of the loss landscape and the consistency of one batch's gradient during training? Second, would our approach not decrease the variance up to a point that is harming generalization \citep{igrheavyball}? While no answer to those questions is given in this work, we provide a toy justification which indicates that large momentums can have a positive impact in noise-free non-convex settings (see Fig.~\ref{fig:toy_rosenbrock})---indicating the improvement of our approach is at least partially explainable without considering variance-reduction effects.