Papers
Topics
Authors
Recent
2000 character limit reached

Barzilai-Borwein Diagonal Quasi-Newton

Updated 28 December 2025
  • The BB-DQN method is a diagonal-type quasi-Newton approach that uses iterative secant-like updates to efficiently approximate coordinate-wise curvature.
  • It reduces computational cost and storage, accelerating convergence in large-scale, multiobjective, and deep learning optimization problems.
  • Empirical evaluations show up to 50× speedup and lower failure rates compared to full-matrix quasi-Newton methods, proving its practical effectiveness.

The Barzilai-Borwein Diagonal-Type Quasi-Newton method (BB-DQN) is a family of algorithms that use diagonal approximations inspired by the Barzilai-Borwein (BB) step-size rules to accelerate large-scale unconstrained optimization, convex composite problems, nonconvex multiobjective programs, and deep network training. The BB-DQN approach leverages iterative secant-like updates to construct diagonal or scalar matrix surrogates for curvature, maintaining low per-iteration cost, robust convergence properties, and empirically strong performance on challenging optimization tasks.

1. Mathematical Foundations

BB-DQN methods address optimization problems of the form: minxRnf(x),\min_{x \in \mathbb{R}^n} f(x), or, in multiobjective form,

minxRnf(x)=(f1(x),f2(x),...,fm(x)),\min_{x \in \mathbb{R}^n} f(x) = (f_1(x), f_2(x), ..., f_m(x))^\top,

with each fi:RnRf_i: \mathbb{R}^n \to \mathbb{R} continuously differentiable and possibly nonconvex (Liu, 20 Dec 2025). The optimality criterion in the multiobjective setting is weak Pareto-optimality: xx^* is Pareto-critical if for every direction dRnd \in \mathbb{R}^n there is an index ii such that fi(x)Td0\nabla f_i(x^*)^T d \ge 0.

The BB scalar step-size formulas are: αkBB1=skTskskTyk,αkBB2=skTykykTyk,\alpha_k^{\text{BB1}} = \frac{s_k^T s_k}{s_k^T y_k}, \qquad \alpha_k^{\text{BB2}} = \frac{s_k^T y_k}{y_k^T y_k}, with sk=xk+1xks_k = x_{k+1} - x_k, yk=f(xk+1)f(xk)y_k = \nabla f(x_{k+1}) - \nabla f(x_k) (Park et al., 2019, Robles-Kelly et al., 2022). Diagonal extensions construct Uk=diag(u1k,...,unk)U^k = \operatorname{diag}(u^k_1, ..., u^k_n) by component-wise secant equations, preserving O(n)O(n) complexity and reflecting coordinate-wise curvature.

2. Algorithmic Structure and Updates

At each iteration, BB-DQN algorithms execute the following core steps:

  1. Shared Diagonal Update: For multiobjective problems, a single diagonal matrix Bk=αk1IB_k = \alpha_k^{-1} I is maintained for all objectives (Liu, 20 Dec 2025). In convex composite (proximal-gradient) or deep learning settings, each coordinate uiku^k_i is estimated via:

uik=Π[β1,β2](sikyik+μuik1(sik)2+μ),u^k_i = \Pi_{[\beta_1, \beta_2]}\left( \frac{s^k_i y^k_i + \mu u^{k-1}_i}{(s^k_i)^2 + \mu} \right),

with projection onto scalar bounds derived from the BB formulas (Park et al., 2019). For multiobjective settings, descent directions are computed using a weighted sum of gradients, gk=i=1mλikfi(xk)g_k = \sum_{i=1}^m \lambda_i^k \nabla f_i(x_k), where λk\lambda^k solves a dual minimization over the simplex.

  1. Step-Size Selection and Safeguarding: Both scalar and diagonal step-sizes are protected by safeguards to ensure positive definiteness and to mitigate erratic variation, particularly under ill-conditioning. Safeguards include:

ωk=min{c0,c1gkc2},\omega_k = \min\{ c_0, c_1 \|g_k\|^{c_2} \},

with c0c_0, c1c_1, c2>0c_2 > 0 and projection of αk\alpha_k or uku^k into [ωk,ωk1][\omega_k, \omega_k^{-1}] (Liu, 20 Dec 2025, Park et al., 2019).

  1. Modified Wolfe Line Search: Step lengths tkt_k are determined to satisfy sufficient decrease and curvature criteria on all objectives:

fi(xk+tkdk)fi(xk)+σ1tkmaxjfj(xk)Tdk,f_i(x_k + t_k d_k) \le f_i(x_k) + \sigma_1 t_k \max_j \nabla f_j(x_k)^T d_k,

maxjfj(xk+tkdk)Tdkσ2maxjfj(xk)Tdk,\max_j \nabla f_j(x_k + t_k d_k)^T d_k \ge \sigma_2 \max_j \nabla f_j(x_k)^T d_k,

with 0<σ1<σ2<10 < \sigma_1 < \sigma_2 < 1 (Liu, 20 Dec 2025).

  1. Proximal Subproblem: For composite convex problems,

xk+1=proxg,Uk(xk(Uk)1f(xk)),x^{k+1} = \operatorname{prox}_{g, U^k}(x^k - (U^k)^{-1} \nabla f(x^k)),

enabling efficient, closed-form updates for most regularizers gg (Park et al., 2019).

3. Theoretical Properties

BB-DQN frameworks admit the following analytical guarantees:

  • Global Convergence:

For multiobjective and composite optimization, if level sets are bounded, gradients are Lipschitz continuous, and the diagonal surrogate matrices are uniformly bounded (aIBkbIaI \preceq B_k \preceq bI), the sequence {xk}\{x_k\} is well-defined, any accumulation point is Pareto-critical (for multiobjective case), and dk0\|d_k\| \to 0 (Liu, 20 Dec 2025).

  • R-linear Convergence (Strong Convexity):

If all objectives fif_i are twice continuously differentiable and satisfy UI2fi(x)LIU I \preceq \nabla^2 f_i(x) \preceq L I on the level set, then

xk+1xρxkx,\|x_{k+1} - x^*\| \le \rho \|x_k - x^*\|,

with ρ=1ωU2/(2L)(0,1)\rho = \sqrt{1 - \omega U^2/(2L)} \in (0,1) (Liu, 20 Dec 2025).

  • For Convex Composite Problems:

If ff is convex and LL-smooth and gg convex, then global convergence is guaranteed; the gradient mapping norm converges sublinearly (O(1/k)O(1/k)) or linearly under strong convexity (Park et al., 2019).

4. Practical Applications and Scalability

BB-DQN methods have been deployed in several contexts:

  • Large-scale Nonconvex Multiobjective Problems:

A shared diagonal surrogate dramatically reduces storage from O(mn2)O(m n^2) to O(n)O(n) and per-iteration arithmetic from O(mn2)O(m n^2) to O(mn)O(m n). Comparative experiments show BB-DQN achieves 5–50×\times speedups over full quasi-Newton M-BFGSMO on problems with n=500n = 500 ($3.4$ms vs $54$ms) and more consistent convergence (lower failure rates) (Liu, 20 Dec 2025).

  • Ill-conditioned Machine Learning Tasks:

In composite problems (quadratic programs, least-squares, logistic regression, and sparse regularized settings), diagonal BB-based metrics reduce iterations by 15–30% relative to scalar BB PG, and outperform FISTA in certain regimes. For instance, in QP with condition number 10410^4, the method required $16$ iterations versus $22$ for PG(BB) (Park et al., 2019).

  • Deep Learning Optimization:

In deep network training, the BB-DQN-style adaptive step-size (scalar update plugged into Adagrad/RMSprop) yields smoother and sometimes faster training error descent compared to Adam, Adadelta, and fixed decay schedules. On CIFAR-10, BB-Adagrad reduced test error to 20.83% versus 21.85% (Adam), 21.17% (Adadelta), and 24.22% (baseline); ImageNet and MNIST results are competitive or superior in training epochs and generalization (Robles-Kelly et al., 2022).

Setting Storage Cost Per-iteration Cost Empirical Speed-up
Multiobjective O(n+m)O(n+m) vs O(mn2)O(m n^2) O(mn)O(m n) vs O(mn2)O(m n^2) $5$–50×50\times (on nn large)
Convex PG O(n)O(n) O(n)O(n) $15$–30%30\% fewer its
Deep Networks O(n)O(n) added O(1)O(1) per layer Smoother, fewer epochs

5. Connections with Classic Quasi-Newton and Adaptive Methods

The BB-DQN paradigm generalizes classic quasi-Newton and adaptive gradient methods to high dimensions with minimal computational overhead and improved metric adaptation:

  • Quasi-Newton Context:

Standard quasi-Newton methods maintain n×nn\times n Hessian approximations (e.g., BFGS); scalar BB rules use a global curvature estimate. BB-DQN diagonalizes the update, addressing coordinate-wise curvature without incurring O(n2)O(n^2) cost (Park et al., 2019, Liu, 20 Dec 2025).

  • Proximal and Adaptive Methods:

Diagonal BB surrogates can be injected into variable metric proximal gradient iterations. Furthermore, scalar BB-DQN integration into adaptive optimizers (Adagrad, RMSprop) for deep learning exploits the secant approximation to produce layer-wise or coordinate-wise learning rate updates (Robles-Kelly et al., 2022).

  • Safeguarding:

All variants rely on bounding their diagonal surrogates to ensure nondegeneracy and optimize stability under ill-conditioning, with closed-form projection and regularization strategies implemented (Park et al., 2019, Liu, 20 Dec 2025).

6. Empirical Evaluation and Performance

Comprehensive numerical studies on synthetic benchmarks, machine learning datasets (MNIST, CIFAR, ImageNet), and large multiobjective test instances confirm the principal advantages of BB-DQN:

  • Iteration Count and Runtime:

Diagonal BB-DQN methods consistently reduce iteration counts by up to 27% (for heavily ill-conditioned QPs), with CPU times scaling linearly in nn (Park et al., 2019).

  • Reliability and Robustness:

Failure rates (non-convergence over random initializations) are lower or equal relative to competing full-matrix quasi-Newton methods, notably in multiobjective test suites (Liu, 20 Dec 2025).

  • Generalization in Deep Learning:

When embedded into Adagrad, BB step-sizes yield training error rates competitive with Adam/Adadelta and superior to heuristic decay schedules on deep network tasks, with more stable error trajectories (Robles-Kelly et al., 2022).

7. Limitations and Open Directions

The BB-DQN framework does not maintain full curvature information, which in some cases may slow local quadratic convergence compared to full quasi-Newton schemes, especially where off-diagonal Hessian entries are significant. Additionally, theoretical guarantees outside convex or strongly convex regimes rely on safeguarding and line-search conditions.

A plausible implication is that integrating block-diagonal or low-rank modifications may further bridge the gap between classic quasi-Newton performance and BB-DQN scalability. Empirical findings suggest continuing investigation of BB-DQN-based step-size adaptation in deep learning optimizers and highly nonconvex settings.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Barzilai-Borwein Diagonal-Type Quasi-Newton Method (BB-DQN).