Momentum: Look-ahead versus EMA updates
Determine whether evaluating gradients at look-ahead points (as in Nesterov’s accelerated gradient) yields theoretically superior momentum updates compared to tracking prior gradient information via exponential moving averages (Polyak’s heavy ball style) for stochastic optimization of deep neural networks, and identify the conditions under which each momentum formulation is provably preferable.
Sponsor
References
While momentum remains crucial to deep learning, its exact usage still remains varied and open. Whether or not to perform momentum updates using look-ahead points or to simply track prior gradient information using a moving average is still theoretically open; modern momentum schemes focus on reducing the computational memory footprint and opt for EMA solutions.