L-BFGS: Limited Memory Quasi-Newton Method
- L-BFGS is a quasi-Newton optimization algorithm that approximates the inverse Hessian using a limited set of correction pairs for efficient large-scale minimization.
- It employs a two-loop recursion to compute search directions and integrates extensions like regularization, nonmonotone strategies, and step-size learning to improve robustness.
- Advanced adaptations, including stochastic variants and structured compact representations, enhance its practical applicability in scientific computing and deep learning.
The Limited-Memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm is a class of quasi-Newton optimization methods designed for solving large-scale unconstrained minimization problems where the objective function is at least continuously differentiable. Unlike full-memory quasi-Newton methods, L-BFGS avoids the explicit formation and storage of approximations of the Hessian or its inverse by maintaining only a limited set of correction pairs, resulting in reduced computational and memory overhead. Over time, derivatives and enhancements—including regularization, stochastic variants, step-size learning, dense initialization, and structured curvature—have substantially extended the practical reach and robustness of L-BFGS.
1. Algorithmic Foundations and Two-Loop Recursion
L-BFGS iteratively approximates the inverse Hessian matrix to compute search directions for unconstrained minimization. At iteration , given , the algorithm computes a step as , where is an implicit inverse Hessian approximation. The update is based on storing recent pairs:
The search direction employs the "two-loop recursion," which applies a series of scalar-vector operations to a working vector :
- Set .
- For :
- Compute
- Initialize with
- For :
- The search direction is then
This approach achieves a per-iteration computational cost of and storage of , making it efficient for high-dimensional problems.
2. Globalization, Regularization, and Extensions
Classical Globalization
To ensure convergence, classical L-BFGS applies a line search—often using Wolfe or strong Wolfe conditions:
- Armijo:
- Curvature: for
Regularization
A regularized variant modifies as for , adjusting the initial diagonal scaling to . A trust-region-style ratio
guides adaptive control of to avoid costly or unstable line searches. Explicit regularization is shown to achieve global convergence under standard assumptions (Tankaria et al., 2021).
Nonmonotone and Hybrid Strategies
Incorporating nonmonotone ratios—replacing by —permits occasional increases in to reduce over-regularization. Hybrid strategies apply the strong Wolfe line search when regularization is minimal and the curvature condition fails.
Parameter Selection
Optimal performance is obtained with practical parameter choices such as memory or $7$, regularization update factors , , ratio thresholds , , and minimum regularization (Tankaria et al., 2021).
3. Adaptations to Stochastic and Large-Scale Environments
Online/Stochastic L-BFGS
In stochastic settings, online L-BFGS computes gradients on random mini-batches and updates correction pairs only when the mini-batch is unchanged. Safe-update rules enforce to ensure stability. The two-loop recursion and step-size selection (often via Armijo backtracking) are directly extended to the stochastic context.
Practical algorithmic structure involves storing queues of curvature pairs and updating step sizes using gradient mean and variance (Welford's method), maintaining cost. Empirically, oL-BFGS achieves faster convergence and lower memory footprints relative to full quasi-Newton or RES on problems such as SVMs and logistic regression for click-through rate estimation (Mokhtari et al., 2014, Yatawatta et al., 2019).
Application in Radio Interferometric Calibration and Deep Learning
Stochastic L-BFGS enables batchwise calibration of vast datasets (e.g., raw interferometric data), bypassing the need for data reduction via averaging. In deep learning, stochastic L-BFGS (with suitable activation curvature, e.g., ELU) achieves comparable verification accuracy to first-order methods, though wall-clock efficiency may favor SGD and Adam (Yatawatta et al., 2019).
4. Enhancements: Dense Initialization, Compact Representations, and Step-Size Learning
Dense Diagonal Initialization
Splitting into learned and orthogonal subspaces via eigendecomposition of the quasi-Newton update matrices allows assignment of distinct spectral estimates to the spectral subspaces. Trust-region methods using a shape-changing norm and such dense initialization outperform standard diagonal approaches and hybrid methods on nonconvex unconstrained problems (Brust et al., 2017).
Structured Compact Representations
Structured L-BFGS variants incorporate second-derivative information (when available), storing curvature triplets and exploiting compressed matrix factorizations. Efficient two-loop-style recursions that exploit fast solvers for known Hessian blocks preserve the cost and yield significant speedups (e.g., 50% fewer iterations in phase retrieval imaging) (2208.00057).
Step-Size Learning via Neural Networks
Instead of costly line searches, step sizes can be selected via policies learned from historical optimization runs. Neural architectures trained with backpropagation through time (TBPTT) digest inner products of current and prior gradients and directions. Learned policies outperform conventional optimizers (ADAM, RMSprop, heuristic L-BFGS variants) on tasks like MNIST and, upon warm-start retraining, transfer effectively to new problem classes such as CIFAR-10 (Egidio et al., 2020).
5. Robustness, Failure Modes, and Stabilization
Noise-Induced Instabilities
In computational settings with noisy gradients—such as electronic structure calculations—classical L-BFGS is prone to instability due to spurious or indefinite curvature information. Extraction of the "significant subspace" via diagonalization of overlap matrices and regularization of curvatures (e.g., Weinstein-type residual norms) circumvents noise accumulation, providing resilience against divergence (Schaefer et al., 2014).
Displacement Aggregation
Modified L-BFGS with displacement aggregation (AggMBFGS) detects and removes linearly dependent correction pairs, restructuring the remaining gradient differences so that the inverse Hessian approximation is preserved. This maintains the complexity of the two-loop recursion while reducing iterations and function evaluations. AggMBFGS yields lower relative errors and enhanced efficiency in large-scale eigenvalue computations and nonconvex optimization (Sahu et al., 2023).
6. Convergence Theory and Empirical Performance
Global convergence of L-BFGS—and its regularized and online extensions—rests on standard assumptions: twice continuous differentiability, compact level sets, strong convexity or positive-definite limit-point Hessian, and appropriate step-size selection (Wolfe or Armijo). Strong numerical results (e.g., superlinear convergence near minimizers, robust loss reduction, rapid curvature adaptation, empirical superiority in CUTEst test sets) support the choice of L-BFGS and its variants for large-scale, high-dimensional optimization in domains including deep learning, computational chemistry, imaging, and eigenvalue problems (Tankaria et al., 2021, Rafati et al., 2019, Sahu et al., 2023).
7. Practical Implementation and Guidelines
For effective deployment:
- Memory parameter (tradeoff between curvature fidelity and resource usage)
- Initial scaling via safeguarding, typically set by
- Regularization parameters (e.g., , , ) and nonmonotone windows () tuned empirically
- For stochastic applications, ensure curvature pairs are formed from identical mini-batches for secant condition validity
- Structured and dense initializations advantageous in trust-region and hybrid methods, especially in ill-conditioned or partially-known Hessian contexts
- Step-size learning, nonmonotone acceptance, and aggregation strategies further reduce designer burden and improve scalability
L-BFGS and its modern variants integrate second-order efficiency with the scalability and robustness needed for today's large-scale scientific and engineering computations.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free