Precise Role of Exponential Moving Averages in Deep Learning Optimizers

Determine the precise theoretical role of exponential moving averages (EMA) within the Adam, Shampoo, and Prodigy optimizers, specifically characterizing how enabling EMA modifies their behavior relative to their interpretation as steepest descent under particular norms and how EMA affects robustness to mini-batch noise and training dynamics.

Background

The paper argues that, with exponential moving averages (EMA) disabled, Adam, Shampoo, and Prodigy each admit a clear first-order interpretation as steepest descent under suitably chosen norms, avoiding convexity or approximate second-order assumptions.

While the authors suggest that EMA can be viewed as smoothing the algorithm and improving robustness to mini-batch noise, they explicitly note that its precise role remains an open problem, inviting a rigorous characterization of how EMA changes these optimizers’ behavior and performance.

References

EMA can then be thought of as "smoothing out" the algorithm, or making it more robust to mini-batch noise, although nailing down the precise role of EMA is perhaps still an open problem.

— Old Optimizer, New Norm: An Anthology (2409.20325 - Bernstein et al., 30 Sep 2024) in Prologue

Precise Role of Exponential Moving Averages in Deep Learning Optimizers

Background

References

Related Problems