A Linearly-Convergent Stochastic L-BFGS Algorithm (1508.02087v2)

Published 9 Aug 2015 in math.OC, cs.LG, math.NA, stat.CO, and stat.ML

Abstract: We propose a new stochastic L-BFGS algorithm and prove a linear convergence rate for strongly convex and smooth functions. Our algorithm draws heavily from a recent stochastic variant of L-BFGS proposed in Byrd et al. (2014) as well as a recent approach to variance reduction for stochastic gradient descent from Johnson and Zhang (2013). We demonstrate experimentally that our algorithm performs well on large-scale convex and non-convex optimization problems, exhibiting linear convergence and rapidly solving the optimization problems to high levels of precision. Furthermore, we show that our algorithm performs well for a wide-range of step sizes, often differing by several orders of magnitude.

Citations (222)

View on Semantic Scholar

Summary

The paper introduces a new stochastic L-BFGS algorithm incorporating variance reduction, proving its linear convergence for strongly convex and smooth functions.
Unlike many stochastic methods, the algorithm achieves linear convergence without requiring a diminishing step size and is robust to varying step size choices.
Numerical experiments show the algorithm outperforms existing methods like SVRG and SGD on standard machine learning benchmarks, demonstrating its practical value for large-scale problems.

An Overview of "A Linearly-Convergent Stochastic L-BFGS Algorithm"

Philipp Moritz, Robert Nishihara, and Michael I. Jordan introduce a new stochastic variant of the L-BFGS optimization algorithm, building upon existing methods in stochastic optimization. The authors provide a rigorous analysis of their algorithm, proving its linear convergence rate for strongly convex and smooth functions. This development is significant for large-scale optimization problems frequently encountered in machine learning.

Key Contributions

Algorithm Design: The paper proposes a stochastic version of the L-BFGS algorithm, incorporating variance reduction methods from stochastic gradient descent. This variant is designed to maintain the linear convergence properties of traditional L-BFGS while handling large datasets and high-dimensional parameter spaces.
Convergence Analysis: The authors prove that the proposed algorithm achieves a linear rate of convergence under the assumptions of strong convexity and smoothness. Notably, the algorithm does not require a diminishing step size to guarantee convergence, which differentiates it from many other stochastic methods.
Practical Performance: Experimentally, the proposed algorithm demonstrates high performance on large-scale optimization tasks, including both convex and non-convex problems. The ability to rapidly reach solutions with high precision across a wide range of step sizes is highlighted as a critical advantage.

Numerical Results and Claims

The paper provides numerical experiments on standard machine learning benchmarks like ridge regression, support vector machines, and matrix completion. The results clearly indicate that the stochastic L-BFGS variant outperforms existing methods such as SVRG, SQN, and traditional SGD in terms of optimization error reduction per data pass. The claimed robustness to step sizes further confirms the practicality of the algorithm for real-world applications.

Implications

The proposed algorithm presents significant implications for both theory and practice in optimization in machine learning:

Theoretical Implications: The paper contributes to the body of work on stochastic optimization by demonstrating that quasi-Newton methods, particularly L-BFGS, can achieve linear convergence rates similar to deterministic settings. This bridges a gap between the stochastic and deterministic optimization literature.
Practical Implications: Given its ability to handle large datasets efficiently, the authors' algorithm is well-suited for modern machine learning problems, where both data size and model complexity are increasing rapidly. The robustness to different step sizes simplifies the tuning process, which is a substantial practical benefit.

Future Directions

Potential future research directions include exploring modifications that may yield superlinear convergence rates, akin to certain deterministic quasi-Newton methods. Additionally, further work could investigate reducing the scaling constants associated with the algorithm's convergence bounds, especially in high-dimensional scenarios. Extending the method for more general classes of non-convex problems is another avenue that could broaden its applicability.

This paper contributes a noteworthy addition to stochastic optimization techniques, offering both a theoretically sound and practically valuable tool for large-scale machine learning applications.

PDF Markdown