Potential variance-reduction harms to generalization from long-horizon averaging
Determine whether, and under what conditions, averaging very old gradients via the high-β3 slow EMA in AdEMAMix reduces gradient variance to an extent that harms generalization performance, and quantify the trade-offs involved.
References
From a theoretical standpoint, our work raises several questions. First, given that we gain from averaging very old gradients, what can it reveal of the loss landscape and the consistency of one batch's gradient during training? Second, would our approach not decrease the variance up to a point that is harming generalization \citep{igrheavyball}? While no answer to those questions is given in this work, we provide a toy justification which indicates that large momentums can have a positive impact in noise-free non-convex settings (see Fig.~\ref{fig:toy_rosenbrock})---indicating the improvement of our approach is at least partially explainable without considering variance-reduction effects.