High-probability convergence for momentum NSGD and variance-reduced estimators under heavy-tailed noise
Establish high-probability convergence guarantees, with only logarithmic dependence on the failure probability, for Normalized SGD with momentum (i.e., the momentum variant defined by g_t = β_t g_{t-1} + (1−β_t)∇f(x_t, ξ_t)) and for Normalized SGD equipped with variance-reduced gradient estimators, under the bounded p-th central moment (p-BCM) noise model with p in (1,2]. Specifically, show that these algorithms, without using gradient clipping, converge to an ε-stationary point with a high-probability bound analogous in form to the minibatch Normalized SGD result proved in this paper.
Sponsor
References
For instance, it remains unclear whether our high-probability result can be extended to \nsgd\ with momentum or variance reduced gradient estimators.