Sub-sampled Cubic Regularization for Non-convex Optimization (1705.05933v3)

Published 16 May 2017 in cs.LG, math.OC, and stat.ML

Abstract: We consider the minimization of non-convex functions that typically arise in machine learning. Specifically, we focus our attention on a variant of trust region methods known as cubic regularization. This approach is particularly attractive because it escapes strict saddle points and it provides stronger convergence guarantees than first- and second-order as well as classical trust region methods. However, it suffers from a high computational complexity that makes it impractical for large-scale learning. Here, we propose a novel method that uses sub-sampling to lower this computational cost. By the use of concentration inequalities we provide a sampling scheme that gives sufficiently accurate gradient and Hessian approximations to retain the strong global and local convergence guarantees of cubically regularized methods. To the best of our knowledge this is the first work that gives global convergence guarantees for a sub-sampled variant of cubic regularization on non-convex functions. Furthermore, we provide experimental results supporting our theory.

Citations (160)

View on Semantic Scholar

Summary

The paper introduces a novel approach called Sub-sampled Cubic Regularization (SCR) that reduces computational cost in non-convex optimization.
It employs concentration inequalities to approximate gradients and Hessians, ensuring strong convergence guarantees in large-scale learning.
Empirical results demonstrate significant speed-ups over classical methods, highlighting SCR's potential for scalable machine learning.

Sub-sampled Cubic Regularization for Non-convex Optimization

The paper "Sub-sampled Cubic Regularization for Non-convex Optimization" by Jonas Moritz Kohler and Aurelien Lucchi presents a novel approach to reduce the computational complexity associated with cubic regularization methods for non-convex optimization problems, which frequently arise in machine learning contexts. The authors address the well-known issue of high computational cost in large-scale learning while maintaining strong convergence guarantees intrinsic to cubic regularization techniques.

Problem Context and Background

Non-convex optimization is a challenging area due to the prevalence of saddle points and local minima that are not global optima. Many machine learning models, particularly deep neural networks, present these non-convex characteristics. Traditional optimization methods such as Stochastic Gradient Descent (SGD) offer robust convergence for convex functions but struggle with non-convex environments.

Cubic regularization methods enhance optimization procedures by allowing escape from strict saddle points, thus providing stronger convergence guarantees than more traditionally implemented first and second-order methods. The cubic model introduces a cubic term into the regularization framework, effectively managing non-linearities better than quadratic versions.

Contributions and Methodology

The authors introduce Sub-sampled Cubic Regularization (SCR) to mitigate the computational overhead inherent in cubic regularization. This approach employs a sub-sampling technique to approximate both the gradient and Hessian matrices, thereby reducing the cost associated with full gradient and Hessian computations. Key numerical precision is retained through concentration inequalities that provide a probabilistic guarantee on the accuracy of sampled gradients and Hessians.

SCR stands out as the first method in literature to extend global convergence guarantees to cubic regularization using sub-sampled information on non-convex functions, thereby marking a significant advancement in optimizing large-scale machine learning models. The authors supply both theoretical analysis and empirical evidence to support the viability of their approach.

Experimental Validation

The experimental validation illustrates the efficacy of SCR in reducing computation time while maintaining convergence properties. Empirical results reveal significant speed-ups over classical methods for both convex and non-convex objectives across multiple datasets. These results demonstrate SCR's potential as a powerful tool in practical machine learning scenarios, where scalability and computational efficiency are paramount.

Implications and Future Directions

The implications of this research are substantial for the development of optimization techniques in AI. By reducing computational overhead while securing convergence guarantees, SCR facilitates the application of complex models to large datasets, an ever-growing necessity in AI research. This development could influence fields that rely heavily on machine learning models, such as computer vision, natural language processing, and autonomous systems.

Future work could explore the application of SCR to other machine learning algorithms, particularly in neural network training, where escape from saddle points is crucial for effective learning. This opens avenues for further efficiency improvements and adaptations to different architecture systems or problem sizes.

In conclusion, this paper sets a successful precedent in overcoming computational barriers in non-convex optimization, providing ample opportunity for further exploration and application in various machine learning domains, especially those characterized by large and complex datasets.