EMA in the modular norm framework

Characterize the theoretical role of exponential moving averages within the modular norm optimization framework by deriving how EMA-based first and second moment accumulation interacts with layer-wise duality maps and induced operator norms, and establish principled guidance for choosing EMA parameters in norm-based optimizers.

Background

The modular norm framework treats gradients as dual vectors that must be mapped back to the primal via layer-specific duality maps, yielding norm-respecting steepest-descent updates. In practice, however, optimizers such as Adam and Shampoo rely critically on EMA for stability and performance.

The authors point out that this framework currently lacks tools to reason about EMA’s influence on geometry, stability, and step-size, and they cite related work suggesting EMA plays a deeper role than variance reduction, calling for a formal theoretical integration of EMA into the modular framework.

References

Bernstein and Newhouse acknowledge this gap in their “Norm Anthology” paper, noting that understanding EMA's role in the framework is “perhaps still an open problem”.

Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale  (2512.18373 - Nagwekar, 20 Dec 2025) in Subsubsection “Missing Theory on EMA” within Section “Limitations of the Modular Norm Framework”