Potential variance-reduction harms to generalization from long-horizon averaging

Determine whether, and under what conditions, averaging very old gradients via the high-β3 slow EMA in AdEMAMix reduces gradient variance to an extent that harms generalization performance, and quantify the trade-offs involved.

Background

Momentum methods can reduce variance in stochastic optimization, which can influence generalization. AdEMAMix strengthens long-horizon averaging, potentially further reducing variance through its slow EMA over very old gradients.

The authors explicitly ask whether such variance reduction might harm generalization and state they do not answer this question in the paper, motivating a focused investigation into this trade-off.

References

From a theoretical standpoint, our work raises several questions. First, given that we gain from averaging very old gradients, what can it reveal of the loss landscape and the consistency of one batch's gradient during training? Second, would our approach not decrease the variance up to a point that is harming generalization \citep{igrheavyball}? While no answer to those questions is given in this work, we provide a toy justification which indicates that large momentums can have a positive impact in noise-free non-convex settings (see Fig.~\ref{fig:toy_rosenbrock})---indicating the improvement of our approach is at least partially explainable without considering variance-reduction effects.

— The AdEMAMix Optimizer: Better, Faster, Older (2409.03137 - Pagliardini et al., 5 Sep 2024) in Section 2, Related Work (Works on understanding momentum)

Potential variance-reduction harms to generalization from long-horizon averaging

Background

References

Related Problems