Diagonal Preconditioners for Improved Convergence

Updated 21 October 2025

Diagonal preconditioners are positive definite diagonal matrices that rescale linear systems to improve convergence by reducing condition numbers.
They are computed using methods like subgradient optimization, SDP formulations, and interior point approaches to achieve near-optimal conditioning.
Their applications span iterative solvers, stochastic optimization, and large-scale computations, offering scalable, efficient performance in practice.

A diagonal preconditioner is a positive definite diagonal matrix designed to rescale a linear system, optimization problem, or iterative procedure so that convergence rates are improved by reducing the (generalized) condition number of the operator involved. Diagonal preconditioning is a ubiquitous strategy for accelerating convergence in iterative solvers, enhancing robustness in first-order methods, stabilizing numerical optimization, and facilitating large-scale computations where storage or computational constraints preclude the use of dense or full-matrix preconditioners. In modern computational mathematics and data science, diagonal preconditioners are recognized for their scalability, efficiency, and amenability to both theoretical analysis and practical implementation.

1. Mathematical Foundations and Definitions

Let $A \in \mathbb{R}^{n \times n}$ be a symmetric positive definite (SPD) matrix. The classical (worst-case) condition number is $\kappa(A) = \lambda_{\max}(A)/\lambda_{\min}(A)$ , where $\lambda_{\max}$ and $\lambda_{\min}$ are the largest and smallest eigenvalues, respectively. Given a positive diagonal matrix $D = \operatorname{Diag}(d)$ with $d > 0$ , diagonal preconditioning transforms $A$ to $\tilde{A} = D^{1/2} A D^{1/2}$ or, equivalently, rescales the variables $x \mapsto D^{-1/2} x$ to obtain a better conditioned system.

The preconditioning objective is to select $D$ such that $\kappa(\tilde{A})$ is minimized. A parallel, “average-case” conditioning measure is the $\omega$ -condition number, defined as:

$\omega(A) = \frac{\operatorname{tr}(A) / n}{\det(A)^{1/n}}$

which is minimized for diagonal preconditioners corresponding to the so-called “equilibration” or “log-determinant maximization” (Ghadimi et al., 27 Sep 2025).

Diagonal preconditioning is not limited to square matrices; in overdetermined least-squares problems ( $A \in \mathbb{R}^{m \times n}, m \geq n$ ), one seeks diagonal $W$ so that $\kappa(A^T W A)$ is minimized (“inner scaling”), or, for SPD matrices, diagonal $S$ on both sides such that $SAS$ is optimally conditioned (“outer scaling”).

2. Algorithmic Approaches and Computational Methods

Several algorithmic frameworks have been developed for computing optimal or near-optimal diagonal preconditioners.

2.1 Affine Pseudoconvex Reformulation and Subgradient Methods:

The condition number minimization problem can be reformulated as a pseudoconvex optimization over $d > 0$ . By viewing the mapping $\mathcal{D}(d) = A \operatorname{Diag}(d)$ , the eigenvalues of $\mathcal{D}(d)$ coincide with those of $D^{1/2} A D^{1/2}$ , making the minimization of $\kappa(\mathcal{D}(d))$ equivalent to minimizing $\kappa(\tilde{A})$ (Ghadimi et al., 27 Sep 2025). The gradient of $\kappa(d)$ is

$\nabla \kappa(d) = \kappa(d)\left(\frac{1}{x_1^T D x_1}(x_1 \bullet x_1) - \frac{1}{x_n^T D x_n}(x_n \bullet x_n)\right)$

where $x_1$ and $x_n$ are eigenvectors corresponding to $\lambda_{\max}$ and $\lambda_{\min}$ , and $\bullet$ denotes the Hadamard (elementwise) square. The necessary optimality condition for a $\kappa$ -optimal diagonal preconditioner is $x_1 \bullet x_1 = x_n \bullet x_n$ .

A projected subgradient method,

$v_{k+1} = \Pi_{\hat\Omega}\left( v_k - t_k \frac{g_k}{\|g_k\|} \right)$

converges to the global minimizer due to pseudoconvexity, with each iteration requiring only the computation of the dominant eigenpairs (Ghadimi et al., 27 Sep 2025).

2.2 SDP and Matrix-Dictionary Methods:

Semidefinite programming offers an exact formulation: $\max \tau \quad \text{such that } M \tau \preceq D \preceq M, \, D \text{ diagonal}, \, D > 0$ This approach, while theoretically optimal, is prohibitively expensive for large $n$ .

Efficient alternatives, such as the matrix-dictionary approximation and cutting-plane SIP/column generation, reduce the search space to a low-dimensional subspace (Gao et al., 2023). The resulting SIP is solved via iterative linear programs enhanced by black-box eigenvalue computations, making them practical for very large sparse systems.

2.3 Interior Point & Bisection:

“Optimal Diagonal Preconditioning” (Qu et al., 2022) recasts the problem as a quasi-convex program, enabling efficient bisection or interior point algorithms. The bisection approach iteratively solves

$\min_{D > 0} \,\, t \quad \text{such that} \|AD^{-1}\|_2 \leq t$

and uses the Nesterov–Todd direction for improved convergence. For one-sided preconditioning, dual SDP reformulations are used.

2.4 Randomized and Sampling-based Acceleration:

For matrices available only via matrix-vector products, random projections or sampling techniques can be used to restrict the diagonal search space, with theoretical guarantees that random subspaces of modest size yield constant-factor approximations to the optimal preconditioner (Gao et al., 2023, Qu et al., 2022).

2.5 Classical Heuristics and Jacobi Preconditioning:

The Jacobi preconditioner, $D = \operatorname{diag}(A)$ , guarantees a condition number within a quadratic factor of optimal (Jambulapati et al., 2020). However, more sophisticated SDP-based or affine methods can yield at least a square-root improvement; for some matrices the Jacobi approach cannot be improved in worst-case scaling beyond the quadratic factor.

3. Practical Applications and Empirical Performance

Diagonal preconditioners appear across a spectrum of computational science and engineering domains.

Iterative Solvers for SPD Systems: Accelerating convergence in the preconditioned conjugate gradient method (PCG) for $Ax=b$ is a fundamental application. Numerical results demonstrate that optimally constructed diagonal preconditioners can reduce PCG iteration counts by an order of magnitude compared to heuristics (Ghadimi et al., 27 Sep 2025, Gao et al., 2023).
Optimization Algorithms: In stochastic gradient descent (SGD), diagonal scaling equates to coordinate-wise adaptive step sizes. AdaGrad and its variants use cumulative squared gradient histories to form the diagonal preconditioner, leading to robust regret bounds and efficient convergence (Xie et al., 13 Mar 2025). Notably, in many realistic regimes, diagonal adaptive methods—despite their computational thrift—can match or even outperform richer, more expensive full-matrix schemes.
Interior Point and First-Order Methods: In large convex programs, diagonal scaling (e.g., via the graph projection splitting variant of ADMM) substantially improves primal-dual residual balancing and accelerates convergence across synthetic and real-world problem instances (Takapoui et al., 2016).
Eigenvalue Problems and Electronic Structure: In electronic structure calculations with ill-conditioned generalized eigenproblems, hybrid global/local preconditioning strategies are used where a global diagonal or block preconditioner (e.g., LDL $^T$ factorization of $H - \epsilon S$ ) is amortized and refined via local iterative solves (Cai et al., 2013).
Large-Scale and Matrix-Free Environments: Recent algorithms utilizing SIP/cutting-plane and matrix-dictionary strategies achieve scalable, near-optimal diagonal preconditioning for matrices with dimension and nnz in the $10^7$ range, using only black-box access to matrix-vector products (Gao et al., 2023, Jambulapati et al., 2023).

4. Theoretical Guarantees and Optimality Conditions

Optimal diagonal preconditioning is possible due to the following facts:

For every SPD $A$ , $\exists D > 0$ (diagonal) achieving minimum condition number for $D^{1/2} A D^{1/2}$ (Ghadimi et al., 27 Sep 2025, Qu et al., 2022, Jambulapati et al., 2020).
Under pseudoconvexity of the reparameterized objective, any stationary point is a global optimum, enabling subgradient methods to avoid local minima traps (Ghadimi et al., 27 Sep 2025).
For average-case criteria (e.g., $\omega$ -condition number), convexity facilitates tractable optimization.
For the classical Jacobi scaling, $\kappa(\operatorname{diag}(A)^{1/2}A\operatorname{diag}(A)^{1/2}) \leq (\kappa^*)^2$ where $\kappa^*$ is the optimal achievable condition number by any diagonal scaling (Jambulapati et al., 2020, Jambulapati et al., 2023). There exist matrices where this bound is tight.
Application of an $\omega$ -optimal preconditioner to a matrix already $\kappa$ -optimally scaled yields further, sometimes dramatic, convergence gains (Ghadimi et al., 27 Sep 2025).

5. Extensions, Limitations, and Future Directions

Structured Preconditioners:

Diagonal preconditioning is a particular case of structured preconditioning; generalizations include block-diagonal, Kronecker-product, and variable-wise diagonal schemes. Recent unified analysis in adaptive optimization shows that more structured (e.g., layerwise, block, or diagonal) preconditioners may be not just computationally preferable, but can also be competitive or superior in actual convergence (Xie et al., 13 Mar 2025).

Limitations:

Off-diagonal couplings are entirely neglected in pure diagonal schemes; in the presence of strong correlations, block or full-matrix preconditioners may outperform.
In some settings, further improvements require incorporating off-diagonal structure, e.g., via incomplete Cholesky or block Jacobi.
The quality of diagonal scaling is fundamentally limited if the matrix is nearly singular or has global coupling structure invisible to coordinate-wise rescaling.

Scalability:

Modern developments leverage randomized and black-box based algorithms to enable application to matrices of extreme size, including highly sparse and streaming contexts (Gao et al., 2023, Jambulapati et al., 2023).

Research Directions:

Key avenues include plugin approaches for adaptive selection of candidate preconditioner subspaces, extensions to indefinite matrices, hybrid (variable/blockwise) scaling, integration with higher-order, deterministic and stochastic optimization algorithms, and statistical risk-aware preconditioning.

Approach	Storage/Compute	Structure Used	Optimality
Jacobi (diag)	$O(n)$	Diag(A)	Quadratic factor from optimal (Jambulapati et al., 2020)
SDP-based optimal diagonal	$O(n^3)$	Full	Optimal
Subgradient affine, SIP, CG	$O(n)$	Matrix-vector	Near-optimal; scalable (Gao et al., 2023, Ghadimi et al., 27 Sep 2025)
Block/variable-wise (OVDP)	Low	Operator Norms	Variable-level optimal (Naganuma et al., 2023)
Randomized/dictionary CG	$O(k)$	Subspaces	With $k \ll n$ , constant-factor approx. (Gao et al., 2023)

These approaches serve different computational and application regimes.

7. Impact and Empirical Findings

Empirical results across diverse contexts—large-scale linear systems, graph Laplacian solvers, first-order optimization, and machine learning—support the efficacy of optimal and near-optimal diagonal preconditioners in reducing both iteration count and wall-clock time. Notably:

Random sampling and column-generation strategies rapidly achieve up to $2\times$ improvement over Jacobi preconditioning for real-world sparse matrices (Gao et al., 2023).
In PCG for Hessian systems from logistic regression or interior-point methods, optimized diagonal scaling decimates the number of matrix-vector products required for convergence (Ghadimi et al., 27 Sep 2025).
Diagonal AdaGrad matches or outperforms more sophisticated preconditioners experimentally, especially in high dimensions (Xie et al., 13 Mar 2025).
Iterative schemes that sequentially combine $\kappa$ - and $\omega$ -optimal preconditioners result in order-of-magnitude gains for conjugate gradient approaches (Ghadimi et al., 27 Sep 2025).

Diagonal preconditioners, due to their minimal storage and computational overhead, intrinsic parallelizability, and amenability to rigorous theoretical analysis, have become a mainstay tool in scientific computing, optimization, and data science for enhancing the numerical stability and computational efficiency of iterative algorithms.