Gradient Descent Can Take Exponential Time to Escape Saddle Points (1705.10412v2)

Published 29 May 2017 in math.OC, cs.LG, and stat.ML

Abstract: Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points - it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.

Citations (244)

View on Semantic Scholar

Summary

The paper demonstrates that standard GD can take exponential time (e^(Ω(d)) iterations) to escape multiple saddle points in a non-convex landscape.
It introduces perturbed GD, an enhanced method that escapes saddle points in polynomial time by adding controlled randomness.
The findings emphasize the necessity of perturbations for efficient optimization in machine learning and non-convex problems.

Overview of "Gradient Descent Can Take Exponential Time to Escape Saddle Points"

This paper explores a significant limitation of the gradient descent (GD) method in the context of non-convex optimization problems. The authors demonstrate that GD can take an exponential amount of time to escape saddle points under certain conditions, even with standard initialization techniques and non-pathological objective functions. In contrast, a modified version of GD, known as perturbed gradient descent (PGD), can overcome these limitations and efficiently find an approximate local minimizer in polynomial time. This work underscores the necessity of perturbations in achieving efficient optimization for non-convex problems and provides theoretical backing for the superiority of PGD over GD.

Key Findings

The paper builds on previous work demonstrating that GD almost certainly escapes saddle points over time with random initialization. However, no guarantees were provided on the number of iterations required. The authors extend this understanding by illustrating that GD can be exceptionally slow when encountering strict saddle points. Through a constructed counterexample, they indicate that GD necessitates exponentially many steps to escape a series of $d$ saddle points, while PGD only requires polynomially many. The construction involves designing a smooth function on $ℝ^d$ , where GD processes through several saddle points, exponentially increasing its escape time with each one.

Two critical theoretical results are presented:

Exponential Time of GD: Within the proposed function, GD with reasonable random initialization will take exponential time to reach a local minimum, where the duration is characterized as $e^{\Omega(d)}$ .
Polynomial Time of PGD: PGD is not impeded by saddle points and requires only $O(\text{poly}(d, 1/\epsilon))$ steps to find an approximate local minimizer.

Implications

The paper’s findings have both theoretical and practical implications. Theoretically, it emphasizes the necessity of considering non-convex landscapes in optimization and suggests that adding randomness in the form of perturbations is crucial for efficient convergence. Practically, it justifies the integration of perturbations into GD-based algorithms, particularly when dealing with non-convex objectives in machine learning applications.

Future Directions

The authors suggest multiple avenues for future exploration:

Stochastic Gradient Descent (SGD): Investigating whether similar exponential time limitations apply to SGD and understanding the role of inherent noise in SGD for escaping saddle points.
Special Structures: Identifying classes of non-convex functions where GD can still perform efficiently, as GD is known to be effective for certain problems with special structures.
Empirical Evaluations: Extending empirical tests across various non-convex landscapes to validate the theoretical conclusions in more practical scenarios.

Conclusion

This work provides a compelling theoretical basis for the modification and enhancement of gradient-based optimization methods, particularly for non-convex problems in machine learning and other domains. It highlights the limitations of conventional GD and establishes the importance of incorporating perturbations to achieve effective and efficient optimization outcomes. The results encourage the exploration and implementation of advanced optimization techniques that better navigate the intricacies of non-convex problem environments.

PDF Markdown