High-probability convergence for momentum NSGD and variance-reduced estimators under heavy-tailed noise

Establish high-probability convergence guarantees, with only logarithmic dependence on the failure probability, for Normalized SGD with momentum (i.e., the momentum variant defined by g_t = β_t g_{t-1} + (1−β_t)∇f(x_t, ξ_t)) and for Normalized SGD equipped with variance-reduced gradient estimators, under the bounded p-th central moment (p-BCM) noise model with p in (1,2]. Specifically, show that these algorithms, without using gradient clipping, converge to an ε-stationary point with a high-probability bound analogous in form to the minibatch Normalized SGD result proved in this paper.

Background

The paper proves the first high-probability convergence guarantee for minibatch Normalized SGD (NSGD) under heavy-tailed noise modeled by bounded p-th central moments (p-BCM), without using gradient clipping. The result achieves a sample complexity matching the parameter-dependent lower bound up to logarithmic factors in the failure probability.

However, the authors were unable to extend this high-probability analysis to NSGD with momentum and to NSGD variants that use variance-reduced gradient estimators, primarily due to technical challenges arising from temporal correlations introduced by momentum and the structure of variance reduction. This limitation is discussed in the main text and in an appendix analyzing the difficulty of extending the argument.

References

For instance, it remains unclear whether our high-probability result can be extended to \nsgd\ with momentum or variance reduced gradient estimators.

From Gradient Clipping to Normalization for Heavy Tailed SGD (2410.13849 - Hübler et al., 17 Oct 2024) in Section 6 (Conclusion)