Momentum: Look-ahead versus EMA updates

Determine whether evaluating gradients at look-ahead points (as in Nesterov’s accelerated gradient) yields theoretically superior momentum updates compared to tracking prior gradient information via exponential moving averages (Polyak’s heavy ball style) for stochastic optimization of deep neural networks, and identify the conditions under which each momentum formulation is provably preferable.

Background

The paper reviews classical momentum formulations, including Polyak’s heavy ball and Nesterov’s accelerated gradient, and interprets momentum both as inertia and as gradient averaging. It notes that modern practice commonly implements momentum via exponential moving averages (EMA), while Nesterov evaluates gradients at a look-ahead point.

Despite widespread use, the authors highlight that the theoretical choice between these formulations remains unsettled, motivating a precise analysis of when look-ahead gradient evaluation versus EMA-based tracking should be preferred in deep learning’s stochastic, non-convex settings.

References

While momentum remains crucial to deep learning, its exact usage still remains varied and open. Whether or not to perform momentum updates using look-ahead points or to simply track prior gradient information using a moving average is still theoretically open; modern momentum schemes focus on reducing the computational memory footprint and opt for EMA solutions.

Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale  (2512.18373 - Nagwekar, 20 Dec 2025) in Subsection “Momentum, Polyak’s Heavy Ball, and Nesterov's Method” within Section “Classical Methods for Neural Network Optimization”