Escaping Saddles with Stochastic Gradients (1803.05999v2)

Published 15 Mar 2018 in cs.LG, math.OC, and stat.ML

Abstract: We analyze the variance of stochastic gradients along negative curvature directions in certain non-convex machine learning models and show that stochastic gradients exhibit a strong component along these directions. Furthermore, we show that - contrary to the case of isotropic noise - this variance is proportional to the magnitude of the corresponding eigenvalues and not decreasing in the dimensionality. Based upon this observation we propose a new assumption under which we show that the injection of explicit, isotropic noise usually applied to make gradient descent escape saddle points can successfully be replaced by a simple SGD step. Additionally - and under the same condition - we derive the first convergence rate for plain SGD to a second-order stationary point in a number of iterations that is independent of the problem dimension.

Citations (160)

View on Semantic Scholar

Summary

The paper introduces the Correlated Negative Curvature assumption, showing how SGD's inherent noise can escape saddle points in non-convex optimization.
Under the CNC assumption, the paper proves dimension-free convergence to second-order stationary points for both modified and vanilla SGD methods.
The CNC assumption is validated for learning half-spaces and empirically supported by variance observations in neural networks, highlighting practical applicability.

Analysis of Escaping Saddles with Stochastic Gradients

This paper investigates the optimization of non-convex functions using stochastic gradient descent (SGD) and proposes an assumption called Correlated Negative Curvature (CNC) to address challenges related to saddle points. The authors aim to provide insights into the convergence properties of SGD towards second-order stationary points by leveraging the inherent noise of stochastic gradients rather than introducing explicit perturbation noise, which is typically isotropic.

Stochastic Gradient Descent and Second-Order Stationarity

SGD is a widely used technique in the training of large-scale neural networks, well-regarded for its practical efficiency and scalability. Although its convergence properties are understood for convex functions, non-convex functions pose additional difficulties due to saddle points. A saddle point in optimization is characterized by the gradient approaching zero but not corresponding to a true minimum. Prior approaches to circumvent saddle points involve the use of isotropic noise perturbations to push iterations away from these points. However, the authors propose that added isotropic noise may not be necessary.

Correlated Negative Curvature Assumption

The CNC assumption posits that the stochastic gradient exhibits a variance along the directions of negative curvature, specifically along the eigenvectors of the Hessian matrix associated with the smallest eigenvalues. This assumption is compelling, as achieving dimension-independent convergence becomes possible without relying on isotropic noise perturbations. This can be highly beneficial when dealing with high-dimensional spaces typical in deep learning applications, as it reduces computational dependency on the input dimension.

Under the CNC assumption, the authors derive convergence guarantees for both a modified gradient descent (termed CNC-PGD) and vanilla SGD (CNC-SGD). These methods leverage the stochastic gradient's inherent variance, allowing convergence to second-order stationary points while demonstrating a dimension-free complexity in SGD—an advancement over prior methods that exhibit poly-logarithmic or polynomial dependency on input dimensionality.

Theoretical Insights and Practical Implications

Contributions include:

Convergence of CNC-PGD: Under CNC, CNC-PGD reaches a second-order stationary point in polynomial time with high probability, eliminating dimensionality dependence seen in isotropic noise-based perturbation methods.
Convergence of CNC-SGD: CNC-SGD also achieves convergence to second-order stationary points without noise perturbations, which the authors emphasize is a first-of-its-kind result.
Validation in Learning Half-Spaces: The authors validate the CNC assumption for problem settings involving learning half-spaces—fundamental in machine learning applications—ensuring practical applicability.
Empirical Verification on Neural Networks: Variance behavior in neural networks suggests robustness of CNC assumption, reinforcing the efficacy of stochastic gradients independent of network width and depth.

The results hold practical significance for the broader field of machine learning, particularly deep learning. They highlight the potential for simplified optimization approaches that remain computationally desirable, providing theoretical support to what has been an empirical practice.

Future Directions

Given the promising results under the CNC assumption, further investigations might focus on formally proving the CNC condition in more general machine learning settings, potentially enriching understanding of stochastic optimization techniques in complex, high-dimensional landscapes. Moreover, exploring the CNC condition’s broader implications on the generalization properties of models trained under these regimes could unveil further insights into the role of stochastic noise in model robustness and reliability.

In conclusion, the paper advances theoretical insights into the optimization dynamics of SGD in non-convex settings, providing a foundation for efficient machine learning model training without the computational overhead introduced by isotropic noise perturbations. Through CNC, the authors contribute to a deeper understanding of the stochastic processes involved, warranting further exploration and potential refinement in future research.