Precise Role of Exponential Moving Averages in Deep Learning Optimizers
Determine the precise theoretical role of exponential moving averages (EMA) within the Adam, Shampoo, and Prodigy optimizers, specifically characterizing how enabling EMA modifies their behavior relative to their interpretation as steepest descent under particular norms and how EMA affects robustness to mini-batch noise and training dynamics.
References
EMA can then be thought of as "smoothing out" the algorithm, or making it more robust to mini-batch noise, although nailing down the precise role of EMA is perhaps still an open problem.
— Old Optimizer, New Norm: An Anthology
(2409.20325 - Bernstein et al., 30 Sep 2024) in Prologue