Pivoted-Cholesky Preconditioner

Updated 14 March 2026

The pivoted-Cholesky preconditioner is a deterministic, low-rank approximation method that clusters the spectrum of SPD kernel matrices to improve iterative solver convergence.
It employs a greedy, diagonal pivoting strategy to adaptively capture the dominant eigenvalues, reducing the condition number and accelerating PCG iterations.
This technique enables scalable and numerically stable computations in Gaussian process models and spatial statistics with proven convergence guarantees.

The pivoted-Cholesky preconditioner is a deterministic, low-rank approximation method for preconditioning large symmetric positive definite (SPD) kernel matrices. It targets acceleration of iterative solvers—primarily preconditioned conjugate gradients (PCG)—by clustering the spectrum of the preconditioned system near 1, thereby improving the convergence behavior. The pivoted-Cholesky construction exploits a greedy, column-wise, diagonal pivoting strategy to adaptively approximate the dominant eigenspace of the kernel matrix, enabling scalable, numerically stable solvers for kernel linear systems and log-determinant estimation in Gaussian process (GP) models and spatial statistics.

1. Mathematical Formulation and Algorithmic Structure

Given a noisy kernel (covariance) matrix $\tilde{K} = K + \sigma^2 I_n$ , the pivoted-Cholesky preconditioner of rank $k \ll n$ constructs a factor $L_k \in \mathbb{R}^{n \times k}$ such that

$\tilde{K} \approx P = L_k L_k^\top + \sigma^2 I.$

The algorithm proceeds as follows:

Initialization: Set $R^{(0)} \leftarrow \tilde{K}$ .
Iterative Updates for $i=1, \dots, k$ $i = 1, \dots, k$ :
- Pivot selection: $p_i = \arg\max_j R^{(i-1)}_{jj}$
- Column computation: $L_k[p_i, i] = \sqrt{R^{(i-1)}_{p_i p_i}}$ , and for all $j$ , $L_k[j, i] = R^{(i-1)}_{j p_i}/L_k[p_i, i]$
- Residual update: $R^{(i)} = R^{(i-1)} - L_k[:, i] L_k[:, i]^\top$
Termination: After $k$ steps, $L_k$ forms the low-rank factor. The preconditioner matrix is $P = L_k L_k^\top + \sigma^2 I_n$ .

The complexity to build $L_k$ is $O(n k^2 + k^3)$ , with $O(n k)$ storage. Each application of $P^{-1}$ to a vector during iterative solution requires $O(n k)$ computation via two triangular solves and one diagonal solve (Gyger et al., 2024, Gardner et al., 2018, Liu et al., 2015, Roos et al., 28 Jul 2025).

2. Incorporation in Iterative Solvers

In the PCG method, one solves systems of the form

$\tilde{K} x = b,$

using $P$ as a left-preconditioner, yielding $P^{-1} \tilde{K} x = P^{-1} b$ . Each CG iteration requires:

A matrix-vector product (MVM) with $\tilde{K}$ : $O(n^2)$ for dense, $O(n m + n n_\gamma)$ in structured settings.
Application of $P^{-1}$ : $O(n k)$ , two triangular solves with $L_k$ and a scale by $\sigma^2$ .

The preconditioner can also be efficiently inverted via the Woodbury identity, enabling direct application to vectors:

$(P)^{-1} = \sigma^{-2} I - \sigma^{-4} L_k (I + \sigma^{-2} L_k^\top L_k)^{-1} L_k^\top.$

Sampling and log-determinant calculations in stochastic Lanczos quadrature (SLQ) exploit this efficient structure (Gardner et al., 2018).

3. Theoretical Convergence and Preconditioning Guarantees

A rank- $k$ pivoted-Cholesky approximation captures the leading $k$ eigenvalues of $\tilde{K}$ . The eigenvalues $\hat{\lambda}_j$ of $P^{-1} \tilde{K}$ satisfy:

$\hat{\lambda}_1, \dots, \hat{\lambda}_k \approx 1$
Remaining $\hat{\lambda}_{k+1}, \dots, \hat{\lambda}_n$ lie in $[\sigma^2/(\lambda_{k+1}+\sigma^2), 1]$

Thus, the condition number

$\kappa(P^{-1} \tilde{K}) \approx \frac{\lambda_{k+1}+\sigma^2}{\sigma^2}$

decreases as $k$ increases, improving CG convergence. For exponentially decaying eigenvalues (e.g., analytic kernels), this clustering ensures rapid convergence, often with small $k$ (Gyger et al., 2024, Gardner et al., 2018, Liu et al., 2015).

Convergence rates for the maximum-norm residual have been made precise for kernels of minimal regularity. For SPD, Lipschitz-continuous kernels on compact $\Omega\subset\mathbb{R}^d$ and complete pivoting, the residual satisfies $\|K-L_kL_k^\top\|_\infty=O(k^{-1/d})$ , improving to $O(k^{-2/d})$ for kernels with $C^{1,1}$ regularity (Jeong et al., 16 Sep 2025).

4. Implementation Variants and Pivot Selection Strategies

The canonical variant uses greedy diagonal pivoting: at each step, select the pivot maximizing the diagonal of the current residual ("uncertainty sampling", "maximum entropy" in GP terms) (Roos et al., 28 Jul 2025). Recent work has introduced alternative, data-adaptive selection strategies:

PCov ("projected covariance"): Greedily maximizes reduction in $\mathrm{tr}(K-L L^\top)$ , targeting the A-optimal objective.
WPCov ("weighted PCov"): Incorporates data fit by monitoring $y$ -weighted projections.

Both alternatives preserve $O(nk^2)$ building cost, and empirical results demonstrate substantial reductions in required CG iterations versus standard pivoting, particularly on GP regression problems (Roos et al., 28 Jul 2025).

5. Computational Complexity and Empirical Performance

Step	Time Complexity	Storage
Build $L_k$	$O(n k^2 + k^3)$	$O(n k)$
Apply $P^{-1}$ (per iter)	$O(nk)$	$O(n k)$
Apply $\tilde{K}$ (per it)	$O(n^2)$ or structured

Empirical benchmarks consistently show that for $n$ up to $10^5$ , $k$ in the range $200$–$500$ is typical. For example, on a dense FSA system of size $n=10^5$ (Gyger et al., 2024):

No preconditioner: 190 iterations, 54 s
Pivoted-Cholesky (k=200, 500, 1000): Iteration counts $91, 52, 32$; wall times $52, 83, 266$ s.
Custom FITC preconditioner (as a comparator): 14 iterations, $\sim$ 10 s.

Increasing $k$ reduces iteration counts but raises overhead as $O(nk^2)$ ; above $k \sim 500$ diminishing returns are observed. In GPyTorch (Gardner et al., 2018), even $m=5$ suffices for an order-of-magnitude reduction in CG steps in some kernel regimes.

6. Stopping Criteria, Rank Selection, and Best Practices

Practical stopping criteria involve monitoring the maximum diagonal element of the residual:

Stop when $\max_i d_i \le \varepsilon_{\text{tol}}$ or the number of iterations reaches $k_{\max}$
In preconditioning, select $k$ such that $\|R_k\|_\infty \leq \delta \lambda_{\min}(K)$ for target $\kappa$

For kernels with regularity below $C^2$ , complete pivoting guarantees deterministic decay rates for the residual, outperforming unpivoted incomplete Cholesky or random Nyström methods in uniform norm error (Jeong et al., 16 Sep 2025). Rank selection can be guided by the required target condition number and domain dimension.

7. Practical Impact and Scope of Application

The pivoted-Cholesky preconditioner is broadly used in scalable GP inference, covariance approximation, kernel interpolation, and log-determinant approximation. Its rapid, deterministic convergence in the uniform norm, with provable guarantees under weak regularity, makes it state-of-the-art for kernel methods in scientific computing and machine learning. Nevertheless, for very large ranks or when the low-rank spectrum decays slowly (e.g., high-dimensional or rough kernels), overhead can become prohibitive; alternatives such as structure-exploiting or inducing-point preconditioners (e.g., FITC) can outperform in those regimes (Gyger et al., 2024). Recent improvements in pivot selection and adaptive rank determination further enhance its efficiency and effectiveness in practical settings (Roos et al., 28 Jul 2025, Liu et al., 2015, Gardner et al., 2018).