Efficient attainment of ρ-SOSPs by SGD with random initialization

Determine whether stochastic gradient descent (SGD) with random initialization can efficiently attain a ρ-second-order stationary point when optimizing the regularized expected risk of a neural network of the form f(W, b; θ) = E[gθ(Wx + b)] + λ||W||_F^2 under the smoothness and Hessian Lipschitz assumptions with Gaussian inputs, as considered in the derandomization framework for structure discovery.

Background

The paper’s central mechanism for structure discovery is to analyze solutions that are approximate second-order stationary points (ρ-SOSPs). Perturbed Gradient Descent (PGD) and Hessian Descent have theoretical guarantees to reach ρ-SOSPs under smoothness and Hessian Lipschitz assumptions. Gradient descent/SGD are known to avoid strict saddles in many settings, but efficient convergence to ρ-SOSPs for SGD is not established.

Within the introduction, the authors list SGD with random initialization as an example of a method that “attains a ρ-SOSP,” but explicitly note in a footnote that it is an open question whether this can be done efficiently. This highlights a gap between empirical observations of SGD’s behavior and current theory, motivating a precise complexity or efficiency guarantee for SGD to reach ρ-SOSPs in the neural network training regime described.

References

It is an open question whether this can be done efficiently, but empirical results on NNs strongly support this behavior.

A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond (2510.19382 - Tsikouras et al., 22 Oct 2025) in Section 1 (Introduction), footnote after “SGD with random initialization”