An Alternative View: When Does SGD Escape Local Minima? (1802.06175v2)

Published 17 Feb 2018 in cs.LG

Abstract: Stochastic gradient descent (SGD) is widely used in machine learning. Although being commonly viewed as a fast but not accurate version of gradient descent (GD), it always finds better solutions than GD for modern neural networks. In order to understand this phenomenon, we take an alternative view that SGD is working on the convolved (thus smoothed) version of the loss function. We show that, even if the function $f$ has many bad local minima or saddle points, as long as for every point $x$, the weighted average of the gradients of its neighborhoods is one point convex with respect to the desired solution $x^*$, SGD will get close to, and then stay around $x^*$ with constant probability. More specifically, SGD will not get stuck at "sharp" local minima with small diameters, as long as the neighborhoods of these regions contain enough gradient information. The neighborhood size is controlled by step size and gradient noise. Our result identifies a set of functions that SGD provably works, which is much larger than the set of convex functions. Empirically, we observe that the loss surface of neural networks enjoys nice one point convexity properties locally, therefore our theorem helps explain why SGD works so well for neural networks.

Authors (3)

Robert Kleinberg (68 papers)
Yuanzhi Li (119 papers)
Yang Yuan (52 papers)

Citations (300)

View on Semantic Scholar

Summary

The paper offers a theoretical guarantee that SGD on a convolved loss function avoids poor local minima by leveraging one-point convexity.
It shows that an initial large step size followed by gradual reduction enables SGD to effectively navigate towards optimal solutions.
Empirical evidence supports that neural network loss landscapes naturally exhibit one-point convexity, validating the practical benefits of this approach.

An Alternative View: When Does SGD Escape Local Minima?

The paper "An Alternative View: When Does SGD Escape Local Minima?" by Kleinberg, Li, and Yuan investigates the behavior of Stochastic Gradient Descent (SGD) in escaping local minima, a challenge frequently encountered in training modern neural networks. The paper posits that when SGD operates on a convolved version of the loss function, characterized by one-point convexity, it can avoid poor local minima and approach favorable solutions with a constant probability.

Key Contributions

The authors offer a novel perspective on SGD by suggesting that it effectively performs optimization on a smoothed version of the objective function. The paper proves that despite the original function $f$ having numerous suboptimal local minima or saddle points, SGD can converge near a desirable solution $x^*$ provided the averaged gradients in the neighborhood are one-point convex with respect to $x^*$ .

The central theoretical result demonstrates that SGD will avoid sharp local minima, defined by small diameters, as long as sufficient gradient information is available in their vicinity. This theorem significantly extends the class of functions for which SGD is provably effective, beyond those that are merely convex. An empirical observation aligns with these theoretical claims, suggesting that local surfaces of neural network loss functions inherently exhibit these one-point convexity properties.

Implications and Applications

Practical implications of this work touch upon the design and training of neural networks, where the discoveries could guide the adjustment of learning rates to harness noise beneficially. A large initial step size helps escape poor minima, while a smaller step size, introduced later, refines the convergence to a flatter local minimum. Such training schedules are commonplace in modern neural networks such as Resnet and Densenet, aligning with the authors' insights about step size influence.

Theoretically, understanding SGD as working on a convolved, smoothed function opens pathways for limited yet impactful assumptions about the function's geometric properties, fostering more robust analyses and optimization strategies.

Specifically, the paper provides an essential theorem and corollary, presenting conditions under which SGD will remain proximal to the optimal solution over subsequent iterations. This informs training strategies emphasizing dynamic adjustment of hyperparameters like learning rates.

Future Directions

This inquiry lays the groundwork for additional research into how the geometry of loss surfaces affects optimization dynamics. Subsequent work might well explore deeper into the local and global geometries of function landscapes beyond neural network training contexts, possibly incorporating stochastic differential equations or probabilistic modeling perspectives, thereby pushing forward the boundary between empirical successes and formalized theoretical understanding in machine learning.

Considering adaptive optimization methods such as Adam or RMSProp within the framework established might further the understanding of how such strategies compare to SGD in terms of escaping poor solutions. Additionally, exploring the connections between the one-point convexity of landscapes and generalization performance remains a compelling avenue for future inquiry.

In conclusion, this paper contributes vital perspectives and results to the discourse on SGD's efficacy in training neural networks, offering insights that could improve both the theoretical understanding and practical performance of machine learning models.

PDF Markdown

Related Papers

YouTube

Show All Videos