Limited-Memory BFGS for Large-Scale Optimization

Updated 21 April 2026

Limited-Memory BFGS is a quasi-Newton optimization method that approximates the inverse Hessian using a few recent curvature pairs, making it suitable for high-dimensional unconstrained problems.
It achieves efficient per-iteration performance with O(mn) memory and computational cost, adapting to both deterministic and stochastic gradient settings for applications like machine learning and PDE-constrained optimization.
Recent enhancements such as adaptive memory strategies, regularization, and displacement aggregation improve convergence properties and robustness, particularly in nonconvex and noisy environments.

The Limited-Memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) Algorithm is a large-scale unconstrained optimization method in the quasi-Newton family, designed to approximate the Newton or BFGS step efficiently in high-dimensional settings. By storing only a limited number of curvature (iterate and gradient displacement) pairs, L-BFGS achieves $O(mn)$ per-iteration complexity and memory, with $m\ll n$ , rendering it practical for applications such as machine learning, PDE-constrained optimization, and electronic structure computation. Recent research details its algorithmic structure, convergence properties under various regularity conditions, extensions for noisy or nonconvex problems, memory management, and practical enhancements (Mokhtari et al., 2014, Mannel, 2024, Sahu et al., 2023, Berahas et al., 2019, Rafati et al., 2018, Tankaria et al., 2021, Ji, 13 Aug 2025).

1. Algorithmic Structure and Update Mechanism

L-BFGS is a variant of the BFGS quasi-Newton method, which approximates the inverse Hessian of the objective function using only first-order gradient information. At iteration $k$ , it maintains the most recent $m$ pairs of iterates and gradient differences:

$s_k = x_{k+1} - x_k$
$y_k = \nabla f(x_{k+1}) - \nabla f(x_k)$

The full-memory BFGS update for the inverse Hessian $H_{k+1}$ is

$H_{k+1} = (I - \rho_k s_k y_k^T) H_k (I - \rho_k y_k s_k^T) + \rho_k s_k s_k^T$

where $\rho_k = 1/(y_k^Ts_k)$ . In the limited-memory variant, only the most recent $m$ $m\ll n$ 0 pairs are retained, and $m\ll n$ 1 is never explicitly formed.

The search direction $m\ll n$ 2 is computed efficiently via the two-loop recursion. Given a "seed" matrix $m\ll n$ 3 (with $m\ll n$ 4 typically updated using curvature information), this recursion requires $m\ll n$ 5 operations per iteration.

For stochastic or online settings, as in large-scale learning, L-BFGS adapts by using stochastic gradients and corresponding curvature pairs; in this regime, it is sometimes referred to as oL-BFGS (Mokhtari et al., 2014).

2. Memory, Computational Complexity, and Variants

The primary advantage of L-BFGS is its per-iteration scalability:

Memory: $m\ll n$ 6 (store $m\ll n$ 7 $m\ll n$ 8-dimensional vectors each for $m\ll n$ 9 and $k$ 0).
Computational cost: $k$ 1 for the two-loop recursion and $k$ 2 for each gradient evaluation (or $k$ 3 for batch size $k$ 4).

This is in stark contrast to full-memory BFGS, which requires $k$ 5 memory and computation per step.

Several enhancements and practical variants exist:

Adaptive memory strategies, where memory parameter $k$ 6 is increased during training as the objective becomes more quadratic (Zocco et al., 2020).
Regularized L-BFGS (RL-BFGS), which modifies the curvature pairs by a regularization parameter to improve robustness and allow for trust-region-like acceptance rules (Tankaria et al., 2021).
Dense subspace initialization, splitting the space between subspace covered by curvature vectors and its orthogonal complement for better trust-region geometry (Brust et al., 2017).
Displacement aggregation: detecting and aggregating redundant steps to further restrict memory requirements while preserving superlinear convergence (Sahu et al., 2023, Berahas et al., 2019).

3. Convergence Theory

L-BFGS retains most convergence properties of full-memory BFGS on smooth objectives, with some caveats for nonconvex or nonsmooth functions.

Strong Convexity and Smoothness: If each $k$ 7 satisfies uniform Hessian bounds $k$ 8, then the L-BFGS or oL-BFGS method converges almost surely to the minimum for stochastic objectives under step-size decay ( $k$ 9, $m$ 0) and enjoys $m$ 1 expected convergence rate (Mokhtari et al., 2014).
Nonconvex Objectives: Globalization strategies, such as cautious updating and adaptive selection of which curvature pairs to retain, allow for global convergence to stationary points under weaker assumptions—requiring only continuity of the gradient (rather than Lipschitz) and Armijo or Wolfe-Powell line searches (Mannel, 2024).
Superlinear Convergence: With displacement aggregation and under standard Dennis–Moré-type regularity, local superlinear convergence is guaranteed if the memory window covers a basis for $m$ 2 (or, in practice, when the local behavior is sufficiently well captured by $m$ 3-recent pairs) (Sahu et al., 2023, Berahas et al., 2019).
Nonsmooth Functions: On specific nonsmooth convex functions, scaled memoryless BFGS (and more generally scaled L-BFGS) can stall at non-optimal points even when the objective is unbounded below. This failure is more frequent for small memory, larger scaling, and Armijo–Wolfe line searches (Asl et al., 2018).

4. Enhancements for Nonconvexity, Robustness, and Practicalities

Recent work has focused on extending L-BFGS to settings in which classical assumptions fail or in which practical performance requires further robustness:

Globalized L-BFGS: Adaptive, so-called "cautious updating," where only stored pairs satisfying $m$ 4 are used, guaranteeing globally bounded inverse Hessian approximations and stationary-point limits of cluster points under weak smoothness (Mannel, 2024).
Displacement Aggregation: By aggregating pairs when one step is in the span of others, the method retains the theoretical update equivalence to full-memory BFGS—crucial for achieving superlinear rates—while capping storage (Sahu et al., 2023, Berahas et al., 2019).
Operator-Based Improvements: Image and projection operator modifications of L-BFGS allow for accelerated quadratic termination even without exact line search, reducing iteration count at the cost of minor preprocessing or additional local solves (Ji, 13 Aug 2025).
Structured L-BFGS: When partial second-derivative structure is available, "structured" compact representations yield further efficiency and improved iteration counts in large-scale PDE and imaging applications (2208.00057).
Adaptive Memory: Memory can be increased during training to reflect the increased trust in curvature information near minimizers, improving solution quality in deep learning (Zocco et al., 2020).
Learning-Based Enhancements: Neural network-based step-size policies for L-BFGS eliminate repeated line searches, yielding improved or competitive convergence rates in practice (Egidio et al., 2020).

5. Applications in Machine Learning and Scientific Computing

L-BFGS and its variants are foundational for a range of large-scale optimization problems:

Supervised Learning: Empirical risk minimization (e.g., SVMs, logistic regression) with extremely high-dimensional feature vectors. L-BFGS achieves convergence in $m$ 5 of the samples and runtime required by stochastic gradient descent in sparse large-scale classification problems (Mokhtari et al., 2014).
Neural Network Training: L-BFGS (including online and mini-batch variants) is employed in deep learning, where it achieves faster and more stable convergence compared to first-order methods, especially in the vicinity of optima (Rafati et al., 2019, Rafati et al., 2018, Zocco et al., 2020).
Reinforcement Learning: L-BFGS, with or without line search, attains robust convergence and test performance in deep RL (e.g., DQN and policy gradient tasks) using significantly fewer environment steps than classic DQN or TRPO (Rafati et al., 2018, Rafati et al., 2019).
PDE-Constrained Optimization and Imaging Inverse Problems: Structured and trust-region variants of L-BFGS yield iteration reductions in large-scale systems where partial Hessian information is available (2208.00057).
Electronic Structure and Quantum Chemistry: Low-rank L-BFGS updates and trust-region integration enable efficient, robust mean-field orbital optimization, outperforming DIIS and augmented-Hessian methods in challenging SCF settings (Slattery et al., 2023).

6. Limitations and Advanced Topics

Nonsmooth Optimization: L-BFGS with conventional scaling is fragile on nonsmooth functions, with theoretical analysis showing that it may stall even when the gradient method succeeds. This highlights the need for careful scaling/safeguards or alternative strategies when applying L-BFGS to nonsmooth settings (Asl et al., 2018).
Hyperparameter Selection: Memory window ( $m$ 6), step-size, and initial Hessian scaling all impact efficiency and robustness. Practical heuristics suggest $m$ 7 for high-dimensional problems and periodic adaptation based on validation loss for deep learning tasks (Zocco et al., 2020, Rafati et al., 2019).
Numerical Stability: To ensure positive definiteness, updates with $m$ 8 are typically skipped. Dense subspace initializations and regularization further improve numerical properties in trust-region frameworks (Brust et al., 2017, Tankaria et al., 2021).

7. Numerical and Empirical Performance

Extensive empirical studies confirm that L-BFGS delivers significant reductions in wall-clock time, sample complexity, and iteration counts compared to both stochastic/first-order and full-memory second-order methods across a wide variety of domains (Mokhtari et al., 2014, Sahu et al., 2023, Rafati et al., 2019, Slattery et al., 2023).

Illustrative results:

Application	L-BFGS Convergence	Reference
Large-scale SVMs (n=10³, 10⁵)	$m$ 920,000 samples for O( $s_k = x_{k+1} - x_k$ 0) objective; $s_k = x_{k+1} - x_k$ 1SGD	(Mokhtari et al., 2014)
Deep RL (ATARI, DQN)	Robust learning; faster than TRPO; fewer steps than DQN	(Rafati et al., 2018, Rafati et al., 2019)
Large-scale imaging (LIBSVM)	Fewer iterations and time than classical L-BFGS	(2208.00057)
SCF in quantum chemistry	0 failures, robust convergence, fewer gradient builds than DIIS or augmented-Hessian	(Slattery et al., 2023)

The method's empirical performance is enhanced when domain-specific structure, displacement aggregation, adaptive memory management, or learned step-size policies are incorporated.

References: (Mokhtari et al., 2014, Mannel, 2024, Sahu et al., 2023, Berahas et al., 2019, Rafati et al., 2018, Tankaria et al., 2021, Brust et al., 2017, Ji, 13 Aug 2025, 2208.00057, Zocco et al., 2020, Asl et al., 2018, Slattery et al., 2023, Rafati et al., 2019, Egidio et al., 2020)