A Stochastic Quasi-Newton Method for Large-Scale Optimization (1401.7020v2)

Published 27 Jan 2014 in math.OC, cs.LG, and stat.ML

Abstract: The question of how to incorporate curvature information in stochastic approximation methods is challenging. The direct application of classical quasi- Newton updating techniques for deterministic optimization leads to noisy curvature estimates that have harmful effects on the robustness of the iteration. In this paper, we propose a stochastic quasi-Newton method that is efficient, robust and scalable. It employs the classical BFGS update formula in its limited memory form, and is based on the observation that it is beneficial to collect curvature information pointwise, and at regular intervals, through (sub-sampled) Hessian-vector products. This technique differs from the classical approach that would compute differences of gradients, and where controlling the quality of the curvature estimates can be difficult. We present numerical results on problems arising in machine learning that suggest that the proposed method shows much promise.

Citations (458)

View on Semantic Scholar

Summary

The paper presents a novel stochastic quasi-Newton method that intermittently collects curvature information using sub-sampled Hessian-vector products.
It balances computational cost and convergence speed by updating curvature data periodically, outperforming standard SGD in large-scale settings.
Empirical tests and theoretical analysis confirm the method's efficiency and robustness, especially in ill-conditioned optimization scenarios.

Overview of "A Stochastic Quasi-Newton Method for Large-Scale Optimization" by R. H. Byrd, S.L. Hansen, J. Nocedal, and Y. Singer

The paper under analysis introduces a stochastic quasi-Newton method tailored for large-scale optimization, particularly in contexts common in machine learning where datasets are immense and continually expanding. These circumstances necessitate optimization algorithms that operate within the stochastic regime, updating models with only partial data samples at each iteration.

The authors tackle the inherent noise and instability issues that arise when extending classical quasi-Newton methods, like BFGS, to the stochastic domain. They introduce a methodology that collects curvature information intermittently, rather than at every iteration, by leveraging sub-sampled Hessian-vector products. This approach steers clear of the noise amplification that occurs when differencing stochastic gradients, which is a differentiation from traditional deterministic optimization techniques.

Methodological Contributions

Curvature Information Collection: The authors propose a novel mechanism to gather second-order information using limited memory BFGS updates at regular intervals, utilizing Hessian-vector products. This method is distinguished by its stability and efficiency, sidestepping the difficulties faced by those attempting to incorporate curvature data through direct gradient differences in stochastic settings.
Algorithm Framework: The stochastic quasi-Newton method introduces periodic updates of Hessian approximations. These updates are based on averages of past iterates, ensuring that the curvature estimates reflect influences from effectively larger batches of data than typical SGD implementations.
Scalability and Efficiency: The paper demonstrates that by updating curvature information less frequently while still managing the size of subsamples for Hessian-vector products, the computational overhead is minimized. There is a detailed balance struck between the sizes of gradient and Hessian samples to ensure that the method remains computationally competitive with traditional SGD, which only scales with the size of batches.

Empirical Evaluation

The authors present comprehensive numerical experiments across multiple large-scale machine learning problems, including synthetic datasets and real-world datasets like RCV1 and speech recognition tasks. Their findings suggest notable improvements over classical stochastic gradient descent methods, with the stochastic quasi-Newton method reaching lower objective values more rapidly. Particularly in ill-conditioned scenarios, the method shows robustness by effectively utilizing curvature information to shape more productive iterative steps.

Theoretical Insights

The paper contributes to the theoretical landscape by proving convergence properties under strong convexity and certain boundedness assumptions. The findings demonstrate that, with an appropriately chosen diminishing step size sequence, the stochastic quasi-Newton method exhibits convergence rates superior to standard SGD practices.

Implications and Future Perspectives

The implications of this method are multifaceted. Practically, it offers a more efficient avenue for training complex models in streaming data environments, crucial for application areas like online advertising and sensor networks where data access is continual and abundant. Theoretically, it provides a groundwork for further exploration into hybrid optimization methods that blend stochastic and deterministic strategies without the attendant drawbacks of traditional quasi-Newton adaptations for stochastic problems.

Looking forward, potential expansions might consider tackling non-convex scenarios more explicitly, optimally balancing between gradient and Hessian sample sizes in various application contexts, and further integrating parallel computation strategies to reduce computational lag.

The originality and applicability of the stochastic quasi-Newton method as presented in this paper offer a significant contribution to the computational toolkit available for large-scale machine learning optimization, with implications that resonate across both academic research and applied machine learning systems.

PDF Markdown