Dice Question Streamline Icon: https://streamlinehq.com

Efficient attainment of ρ-SOSPs by SGD with random initialization

Determine whether stochastic gradient descent (SGD) with random initialization can efficiently attain a ρ-second-order stationary point when optimizing the regularized expected risk of a neural network of the form f(W, b; θ) = E[gθ(Wx + b)] + λ||W||_F^2 under the smoothness and Hessian Lipschitz assumptions with Gaussian inputs, as considered in the derandomization framework for structure discovery.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper’s central mechanism for structure discovery is to analyze solutions that are approximate second-order stationary points (ρ-SOSPs). Perturbed Gradient Descent (PGD) and Hessian Descent have theoretical guarantees to reach ρ-SOSPs under smoothness and Hessian Lipschitz assumptions. Gradient descent/SGD are known to avoid strict saddles in many settings, but efficient convergence to ρ-SOSPs for SGD is not established.

Within the introduction, the authors list SGD with random initialization as an example of a method that “attains a ρ-SOSP,” but explicitly note in a footnote that it is an open question whether this can be done efficiently. This highlights a gap between empirical observations of SGD’s behavior and current theory, motivating a precise complexity or efficiency guarantee for SGD to reach ρ-SOSPs in the neural network training regime described.

References

It is an open question whether this can be done efficiently, but empirical results on NNs strongly support this behavior.

A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond (2510.19382 - Tsikouras et al., 22 Oct 2025) in Section 1 (Introduction), footnote after “SGD with random initialization”