Equilibrated adaptive learning rates for non-convex optimization (1502.04390v2)

Published 15 Feb 2015 in cs.LG and cs.NA

Abstract: Parameter-specific adaptive learning rate methods are computationally efficient ways to reduce the ill-conditioning problems encountered when training large deep networks. Following recent work that strongly suggests that most of the critical points encountered when training such networks are saddle points, we find how considering the presence of negative eigenvalues of the Hessian could help us design better suited adaptive learning rate schemes. We show that the popular Jacobi preconditioner has undesirable behavior in the presence of both positive and negative curvature, and present theoretical and empirical evidence that the so-called equilibration preconditioner is comparatively better suited to non-convex problems. We introduce a novel adaptive learning rate scheme, called ESGD, based on the equilibration preconditioner. Our experiments show that ESGD performs as well or better than RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent.

PDF Abstract

Equilibrated Adaptive Learning Rates for Non-Convex Optimization

The paper "Equilibrated Adaptive Learning Rates for Non-Convex Optimization" investigates the ill-conditioning issues inherent in large-scale non-convex optimization scenarios, such as those encountered in training deep neural networks. The research acknowledges the prevalence of saddle points rather than local minima in these landscapes, which presents unique challenges for optimization.

Key Contributions

The authors introduce a novel adaptive learning rate strategy, Equilibrated Stochastic Gradient Descent (ESGD), which utilizes the equilibration preconditioner to address this issue. The traditional Jacobi preconditioner is cited as inadequate due to its problematic handling of negative curvatures, which are often present in the saddle points of non-convex functions. In contrast, the equilibration preconditioner is shown both theoretically and empirically to better suit non-convex optimization problems by providing a more consistent reduction in the condition number across indefinite matrices.

Theoretical Insights

The paper explores spectral properties, demonstrating that the equilibration preconditioner manages indefinite curvature more effectively by avoiding the problem of negative eigenvalues that can disrupt convergence. It establishes that this approach, akin to an absolute value adjustment of curvatures, avoids the pitfalls identified in previous methods.

Furthermore, the authors highlight an upper bound improvement on the condition number using the equilibration approach compared with the Jacobi preconditioner, emphasizing its efficacy in poorly conditioned, non-convex spaces.

Empirical Findings

Empirically, the paper presents strong results in favor of ESGD, which performs on par or better than existing methods like RMSProp, notably on benchmarks involving deep autoencoders with extensive parameters. The experiments demonstrate significantly enhanced convergence speeds over traditional stochastic gradient descent, underscoring the potential benefits of equilibrated learning rates in practical deep learning applications.

Particularly, on the MNIST and CURVES datasets, ESGD outperformed both RMSProp and Jacobi-based methods, confirming the theoretical propositions regarding the suitability of equilibration preconditioning for non-convex optimization.

Implications and Future Directions

The introduction of ESGD marks a step towards more robust optimization strategies for deep learning practitioners by offering a theoretically founded alternative to established methods like RMSProp. The research suggests a possible intrinsic link between effective adaptive learning rates and their capability to approximate equilibrium conditions in indefinite optimization landscapes.

Future research could explore further refinements to ESGD, its application to various neural architectures, and the potential integration of momentum or other enhancements. The interplay between equilibration and other adaptive techniques might offer richer insights, contributing to the development of more sophisticated optimization heuristics for deep learning.

The paper makes a significant theoretical and empirical contribution to the field, offering novel insights into the handling of non-convex optimization challenges, which are pivotal in advancing deep learning applications.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Yann N. Dauphin (18 papers)
Harm de Vries (29 papers)
Yoshua Bengio (601 papers)

Citations (373)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos