2000 character limit reached

L-BFGS: Limited Memory Quasi-Newton Method

Updated 17 November 2025

L-BFGS is a quasi-Newton optimization algorithm that approximates the inverse Hessian using a limited set of correction pairs for efficient large-scale minimization.
It employs a two-loop recursion to compute search directions and integrates extensions like regularization, nonmonotone strategies, and step-size learning to improve robustness.
Advanced adaptations, including stochastic variants and structured compact representations, enhance its practical applicability in scientific computing and deep learning.

The Limited-Memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm is a class of quasi-Newton optimization methods designed for solving large-scale unconstrained minimization problems where the objective function $f:\mathbb{R}^n\rightarrow\mathbb{R}$ is at least continuously differentiable. Unlike full-memory quasi-Newton methods, L-BFGS avoids the explicit formation and storage of $n \times n$ approximations of the Hessian or its inverse by maintaining only a limited set of correction pairs, resulting in reduced computational and memory overhead. Over time, derivatives and enhancements—including regularization, stochastic variants, step-size learning, dense initialization, and structured curvature—have substantially extended the practical reach and robustness of L-BFGS.

1. Algorithmic Foundations and Two-Loop Recursion

L-BFGS iteratively approximates the inverse Hessian matrix to compute search directions for unconstrained minimization. At iteration $k$ , given $x_k$ , the algorithm computes a step $d_k$ as $d_k = - H_k \nabla f(x_k)$ , where $H_k$ is an implicit inverse Hessian approximation. The update is based on storing $m$ recent pairs:

$s_k = x_{k+1} - x_k$
$y_k = \nabla f(x_{k+1}) - \nabla f(x_k)$

The search direction employs the "two-loop recursion," which applies a series of scalar-vector operations to a working vector $q$ :

Set $q \leftarrow \nabla f(x_k)$ .
For $i = k-1,\ldots,k-m$ $i = k - 1, \dots, k - m$ :
- Compute $\rho_i = 1 / (y_i^\top s_i)$
- $\alpha_i \leftarrow \rho_i s_i^\top q$
- $q \leftarrow q - \alpha_i y_i$
Initialize $r \leftarrow \gamma_k q$ with $\gamma_k = (s_{k-1}^\top y_{k-1})/(y_{k-1}^\top y_{k-1})$
For $i = k-m,\ldots,k-1$ $i = k - m, \dots, k - 1$ :
- $\beta_i \leftarrow \rho_i y_i^\top r$
- $r \leftarrow r + s_i (\alpha_i - \beta_i)$
The search direction is then $d_k = -r$

This approach achieves a per-iteration computational cost of $\mathcal{O}(mn)$ and storage of $\mathcal{O}(mn)$ , making it efficient for high-dimensional problems.

2. Globalization, Regularization, and Extensions

Classical Globalization

To ensure convergence, classical L-BFGS applies a line search—often using Wolfe or strong Wolfe conditions:

Armijo: $f(x_k + \alpha_k d_k) \leq f(x_k) + c_1 \alpha_k \nabla f(x_k)^\top d_k$
Curvature: $\nabla f(x_k + \alpha_k d_k)^\top d_k \geq c_2 \nabla f(x_k)^\top d_k$ for $0 < c_1 < c_2 < 1$

Regularization

A regularized variant modifies $y_k$ as $\hat{y}_k(\mu) = y_k + \mu s_k$ for $\mu > 0$ , adjusting the initial diagonal scaling to $\gamma_k/(1 + \gamma_k \mu)$ . A trust-region-style ratio

$r_k(d_k(\mu), \mu) = \frac{f(x_k) - f(x_k + d_k(\mu))}{f(x_k) - q_k(d_k(\mu), \mu)}$

guides adaptive control of $\mu$ to avoid costly or unstable line searches. Explicit regularization is shown to achieve global convergence under standard assumptions (Tankaria et al., 2021).

Nonmonotone and Hybrid Strategies

Incorporating nonmonotone ratios—replacing $f(x_k)$ by $\max_{j=0,\ldots,M} f(x_{k-j})$ —permits occasional increases in $f$ to reduce over-regularization. Hybrid strategies apply the strong Wolfe line search when regularization is minimal and the curvature condition fails.

Parameter Selection

Optimal performance is obtained with practical parameter choices such as memory $m=5$ or $7$, regularization update factors $\gamma_1 \approx 0.1$ , $\gamma_2 \approx 10$ , ratio thresholds $\eta_1 \approx 10^{-2}$ , $\eta_2 \approx 0.9$ , and minimum regularization $\mu_{\min} \approx 10^{-3}$ (Tankaria et al., 2021).

3. Adaptations to Stochastic and Large-Scale Environments

Online/Stochastic L-BFGS

In stochastic settings, online L-BFGS computes gradients on random mini-batches and updates correction pairs only when the mini-batch is unchanged. Safe-update rules enforce $y_k^T s_k > \epsilon \|s_k\|^2$ to ensure stability. The two-loop recursion and step-size selection (often via Armijo backtracking) are directly extended to the stochastic context.

Practical algorithmic structure involves storing queues of $m$ curvature pairs and updating step sizes using gradient mean and variance (Welford's method), maintaining $\mathcal{O}(m n)$ cost. Empirically, oL-BFGS achieves faster convergence and lower memory footprints relative to full quasi-Newton or RES on problems such as SVMs and logistic regression for click-through rate estimation (Mokhtari et al., 2014, Yatawatta et al., 2019).

Application in Radio Interferometric Calibration and Deep Learning

Stochastic L-BFGS enables batchwise calibration of vast datasets (e.g., raw interferometric data), bypassing the need for data reduction via averaging. In deep learning, stochastic L-BFGS (with suitable activation curvature, e.g., ELU) achieves comparable verification accuracy to first-order methods, though wall-clock efficiency may favor SGD and Adam (Yatawatta et al., 2019).

4. Enhancements: Dense Initialization, Compact Representations, and Step-Size Learning

Dense Diagonal Initialization

Splitting $\mathbb{R}^n$ into learned and orthogonal subspaces via eigendecomposition of the quasi-Newton update matrices allows assignment of distinct spectral estimates $(\gamma_k, \gamma_k^\perp)$ to the spectral subspaces. Trust-region methods using a shape-changing norm and such dense initialization outperform standard diagonal approaches and hybrid methods on nonconvex unconstrained problems (Brust et al., 2017).

Structured Compact Representations

Structured L-BFGS variants incorporate second-derivative information (when available), storing curvature triplets $(s,u,v)$ and exploiting compressed matrix factorizations. Efficient two-loop-style recursions that exploit fast solvers for known Hessian blocks preserve the $\mathcal{O}(mn)$ cost and yield significant speedups (e.g., 50% fewer iterations in phase retrieval imaging) (2208.00057).

Step-Size Learning via Neural Networks

Instead of costly line searches, step sizes can be selected via policies learned from historical optimization runs. Neural architectures trained with backpropagation through time (TBPTT) digest inner products of current and prior gradients and directions. Learned policies outperform conventional optimizers (ADAM, RMSprop, heuristic L-BFGS variants) on tasks like MNIST and, upon warm-start retraining, transfer effectively to new problem classes such as CIFAR-10 (Egidio et al., 2020).

5. Robustness, Failure Modes, and Stabilization

Noise-Induced Instabilities

In computational settings with noisy gradients—such as electronic structure calculations—classical L-BFGS is prone to instability due to spurious or indefinite curvature information. Extraction of the "significant subspace" via diagonalization of overlap matrices and regularization of curvatures (e.g., Weinstein-type residual norms) circumvents noise accumulation, providing resilience against divergence (Schaefer et al., 2014).

Displacement Aggregation

Modified L-BFGS with displacement aggregation (AggMBFGS) detects and removes linearly dependent correction pairs, restructuring the remaining gradient differences so that the inverse Hessian approximation is preserved. This maintains the $\mathcal{O}(\tau d)$ complexity of the two-loop recursion while reducing iterations and function evaluations. AggMBFGS yields lower relative errors and enhanced efficiency in large-scale eigenvalue computations and nonconvex optimization (Sahu et al., 2023).

6. Convergence Theory and Empirical Performance

Global convergence of L-BFGS—and its regularized and online extensions—rests on standard assumptions: twice continuous differentiability, compact level sets, strong convexity or positive-definite limit-point Hessian, and appropriate step-size selection (Wolfe or Armijo). Strong numerical results (e.g., superlinear convergence near minimizers, robust loss reduction, rapid curvature adaptation, empirical superiority in CUTEst test sets) support the choice of L-BFGS and its variants for large-scale, high-dimensional optimization in domains including deep learning, computational chemistry, imaging, and eigenvalue problems (Tankaria et al., 2021, Rafati et al., 2019, Sahu et al., 2023).

7. Practical Implementation and Guidelines

For effective deployment:

Memory parameter $m \in [3,20]$ (tradeoff between curvature fidelity and resource usage)
Initial scaling $\gamma_k$ via safeguarding, typically set by $s_{k-1}^\top y_{k-1} / y_{k-1}^\top y_{k-1}$
Regularization parameters (e.g., $\mu_{\min}$ , $\gamma_1$ , $\gamma_2$ ) and nonmonotone windows ( $M$ ) tuned empirically
For stochastic applications, ensure curvature pairs are formed from identical mini-batches for secant condition validity
Structured and dense initializations advantageous in trust-region and hybrid methods, especially in ill-conditioned or partially-known Hessian contexts
Step-size learning, nonmonotone acceptance, and aggregation strategies further reduce designer burden and improve scalability

L-BFGS and its modern variants integrate second-order efficiency with the scalability and robustness needed for today's large-scale scientific and engineering computations.