Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

L-BFGS: Limited Memory Quasi-Newton Method

Updated 17 November 2025
  • L-BFGS is a quasi-Newton optimization algorithm that approximates the inverse Hessian using a limited set of correction pairs for efficient large-scale minimization.
  • It employs a two-loop recursion to compute search directions and integrates extensions like regularization, nonmonotone strategies, and step-size learning to improve robustness.
  • Advanced adaptations, including stochastic variants and structured compact representations, enhance its practical applicability in scientific computing and deep learning.

The Limited-Memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm is a class of quasi-Newton optimization methods designed for solving large-scale unconstrained minimization problems where the objective function f:RnRf:\mathbb{R}^n\rightarrow\mathbb{R} is at least continuously differentiable. Unlike full-memory quasi-Newton methods, L-BFGS avoids the explicit formation and storage of n×nn \times n approximations of the Hessian or its inverse by maintaining only a limited set of correction pairs, resulting in reduced computational and memory overhead. Over time, derivatives and enhancements—including regularization, stochastic variants, step-size learning, dense initialization, and structured curvature—have substantially extended the practical reach and robustness of L-BFGS.

1. Algorithmic Foundations and Two-Loop Recursion

L-BFGS iteratively approximates the inverse Hessian matrix to compute search directions for unconstrained minimization. At iteration kk, given xkx_k, the algorithm computes a step dkd_k as dk=Hkf(xk)d_k = - H_k \nabla f(x_k), where HkH_k is an implicit inverse Hessian approximation. The update is based on storing mm recent pairs:

  • sk=xk+1xks_k = x_{k+1} - x_k
  • yk=f(xk+1)f(xk)y_k = \nabla f(x_{k+1}) - \nabla f(x_k)

The search direction employs the "two-loop recursion," which applies a series of scalar-vector operations to a working vector qq:

  1. Set qf(xk)q \leftarrow \nabla f(x_k).
  2. For i=k1,,kmi = k-1,\ldots,k-m:
    • Compute ρi=1/(yisi)\rho_i = 1 / (y_i^\top s_i)
    • αiρisiq\alpha_i \leftarrow \rho_i s_i^\top q
    • qqαiyiq \leftarrow q - \alpha_i y_i
  3. Initialize rγkqr \leftarrow \gamma_k q with γk=(sk1yk1)/(yk1yk1)\gamma_k = (s_{k-1}^\top y_{k-1})/(y_{k-1}^\top y_{k-1})
  4. For i=km,,k1i = k-m,\ldots,k-1:
    • βiρiyir\beta_i \leftarrow \rho_i y_i^\top r
    • rr+si(αiβi)r \leftarrow r + s_i (\alpha_i - \beta_i)
  5. The search direction is then dk=rd_k = -r

This approach achieves a per-iteration computational cost of O(mn)\mathcal{O}(mn) and storage of O(mn)\mathcal{O}(mn), making it efficient for high-dimensional problems.

2. Globalization, Regularization, and Extensions

Classical Globalization

To ensure convergence, classical L-BFGS applies a line search—often using Wolfe or strong Wolfe conditions:

  • Armijo: f(xk+αkdk)f(xk)+c1αkf(xk)dkf(x_k + \alpha_k d_k) \leq f(x_k) + c_1 \alpha_k \nabla f(x_k)^\top d_k
  • Curvature: f(xk+αkdk)dkc2f(xk)dk\nabla f(x_k + \alpha_k d_k)^\top d_k \geq c_2 \nabla f(x_k)^\top d_k for 0<c1<c2<10 < c_1 < c_2 < 1

Regularization

A regularized variant modifies yky_k as y^k(μ)=yk+μsk\hat{y}_k(\mu) = y_k + \mu s_k for μ>0\mu > 0, adjusting the initial diagonal scaling to γk/(1+γkμ)\gamma_k/(1 + \gamma_k \mu). A trust-region-style ratio

rk(dk(μ),μ)=f(xk)f(xk+dk(μ))f(xk)qk(dk(μ),μ)r_k(d_k(\mu), \mu) = \frac{f(x_k) - f(x_k + d_k(\mu))}{f(x_k) - q_k(d_k(\mu), \mu)}

guides adaptive control of μ\mu to avoid costly or unstable line searches. Explicit regularization is shown to achieve global convergence under standard assumptions (Tankaria et al., 2021).

Nonmonotone and Hybrid Strategies

Incorporating nonmonotone ratios—replacing f(xk)f(x_k) by maxj=0,,Mf(xkj)\max_{j=0,\ldots,M} f(x_{k-j})—permits occasional increases in ff to reduce over-regularization. Hybrid strategies apply the strong Wolfe line search when regularization is minimal and the curvature condition fails.

Parameter Selection

Optimal performance is obtained with practical parameter choices such as memory m=5m=5 or $7$, regularization update factors γ10.1\gamma_1 \approx 0.1, γ210\gamma_2 \approx 10, ratio thresholds η1102\eta_1 \approx 10^{-2}, η20.9\eta_2 \approx 0.9, and minimum regularization μmin103\mu_{\min} \approx 10^{-3} (Tankaria et al., 2021).

3. Adaptations to Stochastic and Large-Scale Environments

Online/Stochastic L-BFGS

In stochastic settings, online L-BFGS computes gradients on random mini-batches and updates correction pairs only when the mini-batch is unchanged. Safe-update rules enforce ykTsk>ϵsk2y_k^T s_k > \epsilon \|s_k\|^2 to ensure stability. The two-loop recursion and step-size selection (often via Armijo backtracking) are directly extended to the stochastic context.

Practical algorithmic structure involves storing queues of mm curvature pairs and updating step sizes using gradient mean and variance (Welford's method), maintaining O(mn)\mathcal{O}(m n) cost. Empirically, oL-BFGS achieves faster convergence and lower memory footprints relative to full quasi-Newton or RES on problems such as SVMs and logistic regression for click-through rate estimation (Mokhtari et al., 2014, Yatawatta et al., 2019).

Application in Radio Interferometric Calibration and Deep Learning

Stochastic L-BFGS enables batchwise calibration of vast datasets (e.g., raw interferometric data), bypassing the need for data reduction via averaging. In deep learning, stochastic L-BFGS (with suitable activation curvature, e.g., ELU) achieves comparable verification accuracy to first-order methods, though wall-clock efficiency may favor SGD and Adam (Yatawatta et al., 2019).

4. Enhancements: Dense Initialization, Compact Representations, and Step-Size Learning

Dense Diagonal Initialization

Splitting Rn\mathbb{R}^n into learned and orthogonal subspaces via eigendecomposition of the quasi-Newton update matrices allows assignment of distinct spectral estimates (γk,γk)(\gamma_k, \gamma_k^\perp) to the spectral subspaces. Trust-region methods using a shape-changing norm and such dense initialization outperform standard diagonal approaches and hybrid methods on nonconvex unconstrained problems (Brust et al., 2017).

Structured Compact Representations

Structured L-BFGS variants incorporate second-derivative information (when available), storing curvature triplets (s,u,v)(s,u,v) and exploiting compressed matrix factorizations. Efficient two-loop-style recursions that exploit fast solvers for known Hessian blocks preserve the O(mn)\mathcal{O}(mn) cost and yield significant speedups (e.g., 50% fewer iterations in phase retrieval imaging) (2208.00057).

Step-Size Learning via Neural Networks

Instead of costly line searches, step sizes can be selected via policies learned from historical optimization runs. Neural architectures trained with backpropagation through time (TBPTT) digest inner products of current and prior gradients and directions. Learned policies outperform conventional optimizers (ADAM, RMSprop, heuristic L-BFGS variants) on tasks like MNIST and, upon warm-start retraining, transfer effectively to new problem classes such as CIFAR-10 (Egidio et al., 2020).

5. Robustness, Failure Modes, and Stabilization

Noise-Induced Instabilities

In computational settings with noisy gradients—such as electronic structure calculations—classical L-BFGS is prone to instability due to spurious or indefinite curvature information. Extraction of the "significant subspace" via diagonalization of overlap matrices and regularization of curvatures (e.g., Weinstein-type residual norms) circumvents noise accumulation, providing resilience against divergence (Schaefer et al., 2014).

Displacement Aggregation

Modified L-BFGS with displacement aggregation (AggMBFGS) detects and removes linearly dependent correction pairs, restructuring the remaining gradient differences so that the inverse Hessian approximation is preserved. This maintains the O(τd)\mathcal{O}(\tau d) complexity of the two-loop recursion while reducing iterations and function evaluations. AggMBFGS yields lower relative errors and enhanced efficiency in large-scale eigenvalue computations and nonconvex optimization (Sahu et al., 2023).

6. Convergence Theory and Empirical Performance

Global convergence of L-BFGS—and its regularized and online extensions—rests on standard assumptions: twice continuous differentiability, compact level sets, strong convexity or positive-definite limit-point Hessian, and appropriate step-size selection (Wolfe or Armijo). Strong numerical results (e.g., superlinear convergence near minimizers, robust loss reduction, rapid curvature adaptation, empirical superiority in CUTEst test sets) support the choice of L-BFGS and its variants for large-scale, high-dimensional optimization in domains including deep learning, computational chemistry, imaging, and eigenvalue problems (Tankaria et al., 2021, Rafati et al., 2019, Sahu et al., 2023).

7. Practical Implementation and Guidelines

For effective deployment:

  • Memory parameter m[3,20]m \in [3,20] (tradeoff between curvature fidelity and resource usage)
  • Initial scaling γk\gamma_k via safeguarding, typically set by sk1yk1/yk1yk1s_{k-1}^\top y_{k-1} / y_{k-1}^\top y_{k-1}
  • Regularization parameters (e.g., μmin\mu_{\min}, γ1\gamma_1, γ2\gamma_2) and nonmonotone windows (MM) tuned empirically
  • For stochastic applications, ensure curvature pairs are formed from identical mini-batches for secant condition validity
  • Structured and dense initializations advantageous in trust-region and hybrid methods, especially in ill-conditioned or partially-known Hessian contexts
  • Step-size learning, nonmonotone acceptance, and aggregation strategies further reduce designer burden and improve scalability

L-BFGS and its modern variants integrate second-order efficiency with the scalability and robustness needed for today's large-scale scientific and engineering computations.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Limited memory Broyden Fletcher Goldfarb Shanno (LBFGS) Algorithm.