Long-run selection among critical points under SGD

Determine, for stochastic gradient descent with constant step-size applied to a smooth non-convex objective function f: R^d -> R, which critical points or connected components of the critical set of f are more likely to be observed in the long run by the algorithm, and quantify the relative likelihoods of visits (i.e., the asymptotic distribution over components).

Background

After discussing standard non-convex optimization guarantees that only control average gradient norms, the authors emphasize that these do not identify where SGD ultimately concentrates. They highlight that deep non-convex landscapes can contain many saddles and local minima, and that existing saddle-avoidance results do not quantify long-run selection among critical components. This motivates an explicit open question regarding the long-run distribution of SGD across critical points.

References

In particular, the following crucial question remains open: Which critical points of f {or components thereof} are more likely to be observed in the long run – and by how much?

— What is the long-run distribution of stochastic gradient descent? A large deviations analysis (2406.09241 - Azizian et al., 13 Jun 2024) in Section 1 (Introduction)

Long-run selection among critical points under SGD

Background

References

Related Problems