Dice Question Streamline Icon: https://streamlinehq.com

Long-run selection among critical points under SGD

Determine, for stochastic gradient descent with constant step-size applied to a smooth non-convex objective function f: R^d -> R, which critical points or connected components of the critical set of f are more likely to be observed in the long run by the algorithm, and quantify the relative likelihoods of visits (i.e., the asymptotic distribution over components).

Information Square Streamline Icon: https://streamlinehq.com

Background

After discussing standard non-convex optimization guarantees that only control average gradient norms, the authors emphasize that these do not identify where SGD ultimately concentrates. They highlight that deep non-convex landscapes can contain many saddles and local minima, and that existing saddle-avoidance results do not quantify long-run selection among critical components. This motivates an explicit open question regarding the long-run distribution of SGD across critical points.

References

In particular, the following crucial question remains open: Which critical points of f {or components thereof} are more likely to be observed in the long run – and by how much?

What is the long-run distribution of stochastic gradient descent? A large deviations analysis (2406.09241 - Azizian et al., 13 Jun 2024) in Section 1 (Introduction)