Barzilai-Borwein Diagonal Quasi-Newton

Updated 28 December 2025

The BB-DQN method is a diagonal-type quasi-Newton approach that uses iterative secant-like updates to efficiently approximate coordinate-wise curvature.
It reduces computational cost and storage, accelerating convergence in large-scale, multiobjective, and deep learning optimization problems.
Empirical evaluations show up to 50× speedup and lower failure rates compared to full-matrix quasi-Newton methods, proving its practical effectiveness.

The Barzilai-Borwein Diagonal-Type Quasi-Newton method (BB-DQN) is a family of algorithms that use diagonal approximations inspired by the Barzilai-Borwein (BB) step-size rules to accelerate large-scale unconstrained optimization, convex composite problems, nonconvex multiobjective programs, and deep network training. The BB-DQN approach leverages iterative secant-like updates to construct diagonal or scalar matrix surrogates for curvature, maintaining low per-iteration cost, robust convergence properties, and empirically strong performance on challenging optimization tasks.

1. Mathematical Foundations

BB-DQN methods address optimization problems of the form: $\min_{x \in \mathbb{R}^n} f(x),$ or, in multiobjective form,

$\min_{x \in \mathbb{R}^n} f(x) = (f_1(x), f_2(x), ..., f_m(x))^\top,$

with each $f_i: \mathbb{R}^n \to \mathbb{R}$ continuously differentiable and possibly nonconvex (Liu, 20 Dec 2025). The optimality criterion in the multiobjective setting is weak Pareto-optimality: $x^*$ is Pareto-critical if for every direction $d \in \mathbb{R}^n$ there is an index $i$ such that $\nabla f_i(x^*)^T d \ge 0$ .

The BB scalar step-size formulas are: $\alpha_k^{\text{BB1}} = \frac{s_k^T s_k}{s_k^T y_k}, \qquad \alpha_k^{\text{BB2}} = \frac{s_k^T y_k}{y_k^T y_k},$ with $s_k = x_{k+1} - x_k$ , $y_k = \nabla f(x_{k+1}) - \nabla f(x_k)$ (Park et al., 2019, Robles-Kelly et al., 2022). Diagonal extensions construct $U^k = \operatorname{diag}(u^k_1, ..., u^k_n)$ by component-wise secant equations, preserving $O(n)$ complexity and reflecting coordinate-wise curvature.

2. Algorithmic Structure and Updates

At each iteration, BB-DQN algorithms execute the following core steps:

Shared Diagonal Update: For multiobjective problems, a single diagonal matrix $B_k = \alpha_k^{-1} I$ is maintained for all objectives (Liu, 20 Dec 2025). In convex composite (proximal-gradient) or deep learning settings, each coordinate $u^k_i$ is estimated via:

$u^k_i = \Pi_{[\beta_1, \beta_2]}\left( \frac{s^k_i y^k_i + \mu u^{k-1}_i}{(s^k_i)^2 + \mu} \right),$

with projection onto scalar bounds derived from the BB formulas (Park et al., 2019). For multiobjective settings, descent directions are computed using a weighted sum of gradients, $g_k = \sum_{i=1}^m \lambda_i^k \nabla f_i(x_k)$ , where $\lambda^k$ solves a dual minimization over the simplex.

Step-Size Selection and Safeguarding: Both scalar and diagonal step-sizes are protected by safeguards to ensure positive definiteness and to mitigate erratic variation, particularly under ill-conditioning. Safeguards include:

$\omega_k = \min\{ c_0, c_1 \|g_k\|^{c_2} \},$

with $c_0$ , $c_1$ , $c_2 > 0$ and projection of $\alpha_k$ or $u^k$ into $[\omega_k, \omega_k^{-1}]$ (Liu, 20 Dec 2025, Park et al., 2019).

Modified Wolfe Line Search: Step lengths $t_k$ are determined to satisfy sufficient decrease and curvature criteria on all objectives:

$f_i(x_k + t_k d_k) \le f_i(x_k) + \sigma_1 t_k \max_j \nabla f_j(x_k)^T d_k,$

$\max_j \nabla f_j(x_k + t_k d_k)^T d_k \ge \sigma_2 \max_j \nabla f_j(x_k)^T d_k,$

with $0 < \sigma_1 < \sigma_2 < 1$ (Liu, 20 Dec 2025).

Proximal Subproblem: For composite convex problems,

$x^{k+1} = \operatorname{prox}_{g, U^k}(x^k - (U^k)^{-1} \nabla f(x^k)),$

enabling efficient, closed-form updates for most regularizers $g$ (Park et al., 2019).

3. Theoretical Properties

BB-DQN frameworks admit the following analytical guarantees:

Global Convergence:

For multiobjective and composite optimization, if level sets are bounded, gradients are Lipschitz continuous, and the diagonal surrogate matrices are uniformly bounded ( $aI \preceq B_k \preceq bI$ ), the sequence $\{x_k\}$ is well-defined, any accumulation point is Pareto-critical (for multiobjective case), and $\|d_k\| \to 0$ (Liu, 20 Dec 2025).

R-linear Convergence (Strong Convexity):

If all objectives $f_i$ are twice continuously differentiable and satisfy $U I \preceq \nabla^2 f_i(x) \preceq L I$ on the level set, then

$\|x_{k+1} - x^*\| \le \rho \|x_k - x^*\|,$

with $\rho = \sqrt{1 - \omega U^2/(2L)} \in (0,1)$ (Liu, 20 Dec 2025).

For Convex Composite Problems:

If $f$ is convex and $L$ -smooth and $g$ convex, then global convergence is guaranteed; the gradient mapping norm converges sublinearly ( $O(1/k)$ ) or linearly under strong convexity (Park et al., 2019).

4. Practical Applications and Scalability

BB-DQN methods have been deployed in several contexts:

Large-scale Nonconvex Multiobjective Problems:

A shared diagonal surrogate dramatically reduces storage from $O(m n^2)$ to $O(n)$ and per-iteration arithmetic from $O(m n^2)$ to $O(m n)$ . Comparative experiments show BB-DQN achieves 5–50 $\times$ speedups over full quasi-Newton M-BFGSMO on problems with $n = 500$ ($3.4$ms vs $54$ms) and more consistent convergence (lower failure rates) (Liu, 20 Dec 2025).

Ill-conditioned Machine Learning Tasks:

In composite problems (quadratic programs, least-squares, logistic regression, and sparse regularized settings), diagonal BB-based metrics reduce iterations by 15–30% relative to scalar BB PG, and outperform FISTA in certain regimes. For instance, in QP with condition number $10^4$ , the method required $16$ iterations versus $22$ for PG(BB) (Park et al., 2019).

Deep Learning Optimization:

In deep network training, the BB-DQN-style adaptive step-size (scalar update plugged into Adagrad/RMSprop) yields smoother and sometimes faster training error descent compared to Adam, Adadelta, and fixed decay schedules. On CIFAR-10, BB-Adagrad reduced test error to 20.83% versus 21.85% (Adam), 21.17% (Adadelta), and 24.22% (baseline); ImageNet and MNIST results are competitive or superior in training epochs and generalization (Robles-Kelly et al., 2022).

Setting	Storage Cost	Per-iteration Cost	Empirical Speed-up
Multiobjective	$O(n+m)$ vs $O(m n^2)$	$O(m n)$ vs $O(m n^2)$	$5$– $50\times$ (on $n$ large)
Convex PG	$O(n)$	$O(n)$	$15$– $30\%$ fewer its
Deep Networks	$O(n)$ added	$O(1)$ per layer	Smoother, fewer epochs

5. Connections with Classic Quasi-Newton and Adaptive Methods

The BB-DQN paradigm generalizes classic quasi-Newton and adaptive gradient methods to high dimensions with minimal computational overhead and improved metric adaptation:

Quasi-Newton Context:

Standard quasi-Newton methods maintain $n\times n$ Hessian approximations (e.g., BFGS); scalar BB rules use a global curvature estimate. BB-DQN diagonalizes the update, addressing coordinate-wise curvature without incurring $O(n^2)$ cost (Park et al., 2019, Liu, 20 Dec 2025).

Proximal and Adaptive Methods:

Diagonal BB surrogates can be injected into variable metric proximal gradient iterations. Furthermore, scalar BB-DQN integration into adaptive optimizers (Adagrad, RMSprop) for deep learning exploits the secant approximation to produce layer-wise or coordinate-wise learning rate updates (Robles-Kelly et al., 2022).

Safeguarding:

All variants rely on bounding their diagonal surrogates to ensure nondegeneracy and optimize stability under ill-conditioning, with closed-form projection and regularization strategies implemented (Park et al., 2019, Liu, 20 Dec 2025).

6. Empirical Evaluation and Performance

Comprehensive numerical studies on synthetic benchmarks, machine learning datasets (MNIST, CIFAR, ImageNet), and large multiobjective test instances confirm the principal advantages of BB-DQN:

Iteration Count and Runtime:

Diagonal BB-DQN methods consistently reduce iteration counts by up to 27% (for heavily ill-conditioned QPs), with CPU times scaling linearly in $n$ (Park et al., 2019).

Reliability and Robustness:

Failure rates (non-convergence over random initializations) are lower or equal relative to competing full-matrix quasi-Newton methods, notably in multiobjective test suites (Liu, 20 Dec 2025).

Generalization in Deep Learning:

When embedded into Adagrad, BB step-sizes yield training error rates competitive with Adam/Adadelta and superior to heuristic decay schedules on deep network tasks, with more stable error trajectories (Robles-Kelly et al., 2022).

7. Limitations and Open Directions

The BB-DQN framework does not maintain full curvature information, which in some cases may slow local quadratic convergence compared to full quasi-Newton schemes, especially where off-diagonal Hessian entries are significant. Additionally, theoretical guarantees outside convex or strongly convex regimes rely on safeguarding and line-search conditions.

A plausible implication is that integrating block-diagonal or low-rank modifications may further bridge the gap between classic quasi-Newton performance and BB-DQN scalability. Empirical findings suggest continuing investigation of BB-DQN-based step-size adaptation in deep learning optimizers and highly nonconvex settings.