Parallel Quasi-Newton Methods

Updated 27 August 2025

Parallel quasi-Newton methods are a class of algorithms that extend traditional secant updates using block and multisecant formulations to capture richer curvature information.
They employ action-constrained, asynchronous, and distributed strategies to compute Hessian approximations and preconditioners efficiently on modern parallel hardware.
Key benefits include reduced CPU time and improved scalability, making these methods effective in applications like machine learning, variational inference, and scientific computing.

Parallel quasi-Newton methods are a class of algorithms for smooth unconstrained optimization and large-scale nonlinear systems that exploit parallel architectures to accelerate the computation and/or application of quasi-Newton matrix updates or preconditioners. These methods generalize the sequential secant equation updates of traditional quasi-Newton schemes to multivector (block, multisecant, or action-constrained) formulations, and reengineer storage, update, and application steps to enable efficient parallelization. The class includes both deterministic and stochastic variants, block-memory and limited-memory structures, and methods for both single-objective and multiobjective problems, as well as line-search, trust-region, and asynchronous or distributed computing environments.

1. Block, Multisecant, and Subspace Quasi-Newton Updates

Traditional quasi-Newton methods such as BFGS, DFP, or SR1 enforce a secant equation along a single direction per iteration, yielding a rank-one update to the Hessian or its inverse. Parallel quasi-Newton methods generalize this by enforcing the secant/interpolation condition on a subspace or set of directions—often captured by collective gradient and iterate difference blocks. The update seeks a new approximate Hessian $B_{k+1}$ or inverse $H_{k+1}$ such that:

For Hessian approximation: $B_{k+1} S_k = Y_k$
For inverse: $H_{k+1} Y_k = S_k$

where $S_k$ and $Y_k$ are $n \times q$ matrices containing multiple step and gradient difference vectors from recent iterations. This block multi-secant approach populates the curvature model with richer information per update, improving the quality of search directions and preconditioners, and is inherently suitable for parallel computation (e.g., forming $S_k$ via batch or multi-threaded gradient computations) (Lee et al., 9 Apr 2025, Gower et al., 2014, Gao et al., 2016).

Block update formulas depend on solving a constrained minimization—typically a least-change or minimal-norm adjustment—subject to the subspace secant condition and symmetry:

$B_{k+1} = B_k + \underbrace{\left( Y_k - B_k S_k \right) M_k^{-1} \left( Y_k - B_k S_k \right)^\top}_{\text{low-rank update}},$

with $M_k$ a carefully chosen block matrix (e.g., $M_k = S_k^\top B_k S_k$ ). When symmetry or positive definiteness are not preserved automatically, symmetrization and additional perturbations (e.g., diagonal shifts) are applied to restore them. The full-matrix or operator form of these updates is highly amenable to parallel BLAS-3 operations (Lee et al., 9 Apr 2025, Gao et al., 2016).

2. Action-Constrained and Automatic Preconditioning Schemes

For large-scale problems solved by Newton-Krylov or Newton-CG methods, the traditional scalar secant update is replaced by an action constraint for iteratively updating a preconditioner. The action-constrained quasi-Newton framework seeks an update such that the preconditioner exactly replicates the action of the system matrix $Q_{k+1}$ on a low-dimensional Krylov or sampling subspace $\mathcal{S}_k$ gathered from the previous Newton-CG solve:

For direct approximations: $G_{k+1} \mathcal{S}_k = Q_{k+1} \mathcal{S}_k$
For inverse approximations: $H_{k+1}(Q_{k+1} \mathcal{S}_k) = \mathcal{S}_k$

The resulting least-change/symmetrized update possesses a closed-form low-rank structure:

$G_{k+1} = Q_{k+1} + (I - W_k P_{\mathcal{S}_k})(G_k - Q_{k+1})(I - P_{\mathcal{S}_k} W_k)$

where $P_{\mathcal{S}_k}(W_k)$ is a projection onto the sampling subspace under weight $W_k$ (e.g., $W_k = Q_{k+1}$ or $I$ ), and these block operations are efficiently parallelizable. This approach supports both full-memory and limited-memory variants, and parallelizes as a set of independent BLAS-type operations within each update (Gower et al., 2014).

3. Parallelization Strategies: Block Linear Algebra, Asynchrony, and Distribution

Parallel quasi-Newton methods exploit multiple levels of parallelism:

Block Linear Algebra: Most update, two-loop recursion, and application steps are implemented as dense matrix-matrix or batched matrix-vector products, benefiting from multithreaded BLAS or distributed-memory linear algebra libraries (e.g., LAPACK, ScaLAPACK, or vendor-optimized GPU kernels). Block secant and action-constrained updates, in particular, require simultaneous operations over all columns of the subspace matrix, maximizing core utilization (Gower et al., 2014, Gao et al., 2016).
Asynchronous and Distributed Execution: Recent developments allow asynchrony in stochastic quasi-Newton methods: processors update local or shared parameters with potentially stale information, using lock-free or weakly synchronized protocols to maximize resource usage and minimize waiting. In AsySQN, for instance, each worker maintains local L-BFGS correction pairs and client-server variance-reduced gradients, yielding true asynchronous parallelism with global linear convergence rates (Tong et al., 2020).
Distributed Aggregation: For master–worker architectures, each worker computes Hessian-vector products or quasi-Newton updates using local data and transmits compact summaries (e.g., correction vectors or gBroyd steps) to the master. Only $O((\tau+1)n)$ words are communicated per worker per iteration—comparable to the cost of a gradient—while still aggregating curvature information to maintain global and local superlinear convergence rates (Du et al., 2023).

4. Limited Memory, Robustness, and Stability

Limited-memory extensions (L-BFGS, block-L-BFGS, LquNac) are crucial for large-scale problems. They retain only the most recent $L$ vector pairs (with mini-batch or block structure) and use two-loop recursions for efficient application to gradients. The developments for parallel and stochastic settings modify the recursion to group and vectorize block operations, further increasing parallelizability while reducing memory and bandwidth requirements (Gower et al., 2014, Gao et al., 2016).

Robustness to ill-conditioning, non-convexity, and lack of automatic positive definiteness is achieved by:

Diagonal or spectral perturbations to ensure the Hessian model or its inverse is positive semidefinite (PSD) (Lee et al., 9 Apr 2025).
Strict vector rejection or filtering strategies during multisecant/block updates to maintain well-conditioned matrices (Lee et al., 9 Apr 2025).
Defensive measures (e.g., restarts or “recompute-exact-Hessian” triggers) in cases where the quasi-Newton model degrades or fails to reduce the objective sufficiently (Köhler et al., 15 Aug 2025, Barnafi et al., 2022).
Trust-region frameworks with efficient LDL $^\top$ factorization updates, ensuring stability even with indefinite Hessian models (Brust et al., 2023).

5. Empirical Performance and Applications

Parallel quasi-Newton methods have been systematically benchmarked against first-order, Newton, and non-parallel quasi-Newton alternatives:

SVM and Large-Scale Logistic Regression: Action-constrained and parallel block quasi-Newton methods outperform Newton-CG without preconditioning and L-BFGS in wall clock time and variability metrics, often converging in a fraction of the time on moderate-sized datasets (Gower et al., 2014).
Classic Benchmark Problems: In nonlinear elasticity, cardiac mechanics, and ill-conditioned quadratic programs, parallel quasi-Newton schemes achieve over 50% reductions in CPU time compared to standard Newton-Krylov methods, especially when the cost of repeated Jacobian/Hessian assembly dominates and parallel reduction in communication is realized (Barnafi et al., 2022).
Bayesian Inference and Variational Methods: Pathfinder variational inference achieves one to two orders of magnitude fewer log-density and gradient evaluations compared to ADVI and dynamic HMC, with further efficiency and robustness gains obtained by repurposing the parallelization to independently explore multiple regions of the posterior, selecting the best via importance resampling (Zhang et al., 2021).
Distributed and Stochastic Optimization: Distributed adaptive greedy quasi-Newton schemes and asynchronous L-BFGS reach global linear and local superlinear convergence with only per-iteration communication similar to that of first-order methods, and are thus well suited for massive-scale machine learning (Du et al., 2023, Tong et al., 2020).

Performance gains are typically realized when the problem scale is sufficiently large to plausibly amortize block and parallel overheads, and when structural curvature can be exploited by block or stochastic update primitives derived from concurrent threads or processes.

6. Theoretical Guarantees: Convergence, Termination, and Flexibility

Parallel quasi-Newton methods have established a range of theoretical results:

Superlinear and Quadratic Convergence: When applied to quadratic or nearly quadratic problems with exact linesearch, limited-memory block or multisecant quasi-Newton methods generate search directions parallel to conjugate gradients and hence inherit finite termination in at most $n$ steps, and local superlinear convergence on general convex problems as the diagonal perturbation vanishes (Ek et al., 2018, Forsgren et al., 2015, Lee et al., 9 Apr 2025).
Globalization: Action-constrained and trust-region variants guarantee descent directions (or safeguard indefiniteness), enabling global convergence under standard assumptions and robustness to inexactness, nonlinearity, or saddle-points (Gower et al., 2014, Brust et al., 2023).
Flexibility and Integration: The subspace- or block-based structure naturally accommodates trust-region and active-set frameworks, inexact or inner-iteration adaptations, and can incorporate manifold or compositional objective features (as in multiobjective problems or distributed SQP/FETI-DP approaches) (Köhler et al., 15 Aug 2025, Peng et al., 2023).

Overall, parallel quasi-Newton methods unify the strengths of quasi-Newton curvature modeling with scalable, blockwise, and often distributed updating and application, making them attractive for a broad class of modern large-scale scientific and data-driven optimization problems.

Table: Representative Parallel Quasi-Newton Methods and Variants

Algorithm/Class	Key Parallel Feature	Primary Application Context
Action-constrained quNac (Gower et al., 2014)	Block update, BLAS-3, sampling subspace	Newton-Krylov preconditioning, SVM
Asynchronous Stochastic L-BFGS (AsySQN) (Tong et al., 2020)	Lock-free thread parallelism, variance reduction	Distributed stochastic optimization
Pathfinder (Zhang et al., 2021)	Embarrassingly parallel path evaluation	Variational Bayesian inference
Block BFGS (Gao et al., 2016)	Block secant, parallel linear algebra	Quadratic/nonconvex minimization, regression
Distributed adaptive GQN (Du et al., 2023)	Master–worker, compressed communication	Large-scale convex minimization
Multisecant/PSD-perturbed QN (Lee et al., 9 Apr 2025)	Block/multisecant update, diagonal PSD shift	Ill-conditioned and large-scale convex problems
FETI-DP SQP with QN Hessian (Köhler et al., 15 Aug 2025)	Distributed Hessian update/Schur elimination	Parallel nonlinear finite element analysis

These methods each embody distinct design principles for harnessing parallelism within the broad quasi-Newton framework, with the underlying curvature modeling enriched or stabilized by blockwise, action-constrained, asynchronous, or distributed computational primitives.