Inverse-Free Sparse Variational Gaussian Processes

Published 1 Apr 2026 in stat.ML and cs.LG | (2604.00697v1)

Abstract: Gaussian processes (GPs) offer appealing properties but are costly to train at scale. Sparse variational GP (SVGP) approximations reduce cost yet still rely on Cholesky decompositions of kernel matrices, ill-suited to low-precision, massively parallel hardware. While one can construct valid variational bounds that rely only on matrix multiplications (matmuls) via an auxiliary matrix parameter, optimising them with off-the-shelf first-order methods is challenging. We make the inverse-free approach practical by proposing a better-conditioned bound and deriving a matmul-only natural-gradient update for the auxiliary parameter, markedly improving stability and convergence. We further provide simple heuristics, such as step-size schedules and stopping criteria, that make the overall optimisation routine fit seamlessly into existing workflows. Across regression and classification benchmarks, we demonstrate that our method 1) serves as a drop-in replacement in SVGP-based models (e.g., deep GPs), 2) recovers similar performance to traditional methods, and 3) can be faster than baselines when well tuned.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces R-SVGP, an inverse-free framework that replaces matrix inversions with efficient matmul operations.
It proposes specialized natural gradient updates and preconditioning to enhance convergence and numerical stability.
Empirical results demonstrate up to 2.5× faster convergence and competitive accuracy on both regression and classification tasks.

Authoritative Summary of "Inverse-Free Sparse Variational Gaussian Processes" (2604.00697)

Introduction and Motivation

Sparse variational Gaussian processes (SVGPs) remain at the forefront of scalable Bayesian nonparametric modeling, but training them involves computational bottlenecks, particularly due to matrix inversions and Cholesky decompositions. Such operations are ill-suited for low-precision, massively parallel hardware typical in modern deep learning ecosystems. The paper addresses this by introducing a practical inverse-free SVGP framework, termed R-SVGP, which replaces all matrix inversions with matmul-based objectives and updates, enabling efficient exploitation of hardware acceleration.

Parameterization and Inverse-Free Bounds

The SVGP framework approximates the GP posterior using inducing points $\mathbf{Z}\in \mathbb{R}^{M\times D}$ with variational distribution $q(\mathbf{u})$ , yielding an ELBO involving predictive means, variances, and KL divergences. Standard approaches (M-SVGP, W-SVGP, L-SVGP) depend on matrix decompositions. R-SVGP introduces an auxiliary parameter $\mathbf{T}$ , intended to mimic $\tilde{\mathbf{K}}^{-1}$ but optimizes it using only matmuls.

The R-SVGP bound upper-bounds the L-SVGP predictive variance via:

$\sigma_n^{2(\text{L})} \leq k_{nn} - \mathbf{k}_{n\mathbf{u}}(2\mathbf{T}-\mathbf{T}\tilde{\mathbf{K}}\mathbf{T})\mathbf{k}_{\mathbf{u}n}$

with a corresponding closed-form KL upper bound. When $\mathbf{T}=\tilde{\mathbf{K}}^{-1}$ , R-SVGP recovers L-SVGP, ensuring that nothing is lost in optimality.

Matmul-Only Optimization: Natural Gradient Updates

Optimizing $\mathbf{T}$ directly via off-the-shelf methods (e.g., Adam) was empirically unstable. The paper derives a specialized natural gradient (NG) update for $\mathbf{T}$ , parameterized via its Cholesky factor $\mathbf{L}$ :

$\tilde{\nabla}\ell_{\mathbf{A}} = \mathbf{L}\big[\operatorname{tril}(\mathbf{L}^\top\mathbf{A}\mathbf{L}) - \tfrac12(\mathbf{I} + \operatorname{diag}(\mathbf{L}^\top\mathbf{A}\mathbf{L}))\big]$

This update avoids any decompositions or inverses, improving both convergence and numerical stability, a critical advance for inverse-free SVGPs.

Figure 1: Loss traces on Snelson and banana datasets demonstrating stable convergence for NG-updated R-SVGP variants, matching Cholesky-based baselines.

Preconditioning and Practical Optimization

The paper identifies that L-SVGP parameterizations inherently suffer from slow convergence compared to W-SVGP. Through inducing mean preconditioning, using a preconditioner $q(\mathbf{u})$ 0 in R-SVGP, the model recovers stability and performance parity with W-SVGP, while remaining inverse-free. This preconditioner is the equivalent of a single Newton-Schulz iteration for $q(\mathbf{u})$ 1.

Moreover, the authors propose simple heuristics for step-size schedules, stopping criteria (e.g., normalized residuals), and trace estimation. The R-SVGP bound evaluation leverages Hutchinson’s method for unbiased trace estimation, lowering the evaluation cost to quadratic in $q(\mathbf{u})$ 2.

Figure 2: NLPD/runtime on elevators and kin40k datasets across inducing point counts, highlighting R-SVGP’s competitive training efficiency.

Figure 3: NLPD/runtime with varying Hutchinson probe counts $q(\mathbf{u})$ 3 substantiating the runtime-accuracy tradeoff for trace estimation.

Empirical Results: Efficacy and Efficiency

Experiments were conducted on toy, UCI regression and classification datasets, as well as complex models (deep GPs, convolutional kernels). Key findings:

Efficacy: R-SVGP using NG updates and preconditioning matches or exceeds L/W-SVGP performance on both regression and classification tasks.
Efficiency: With well-tuned heuristics, R-SVGP exhibited up to $q(\mathbf{u})$ 4 faster convergence compared to L-SVGP and up to $q(\mathbf{u})$ 5 speedup over W-SVGP, especially for large $q(\mathbf{u})$ 6.
Generality: R-SVGP acts as a drop-in replacement for deep, multi-output, and convolutional GP models, retaining predictive performance even where variational flexibility is somewhat reduced.

Theoretical and Practical Implications

By eliminating decompositions and leveraging matmul-only computations, R-SVGP resolves the hardware mismatch hitherto limiting SVGP efficacy on accelerators, especially in low-precision regimes. The framework opens the possibility for further advances:

Fully inverse-free parameterizations targeting W-SVGP (rather than L-SVGP) could offer both flexibility and performance, and remain an active direction.
Randomized matmul techniques may further reduce NG update costs.
Deep and complex models (e.g., large-scale Bayesian deep learning) may benefit from the reduced memory and improved runtime of R-SVGP.

Conclusion

The paper introduces the first practical, stable, inverse-free SVGP, R-SVGP, which achieves competitive statistical and computational results with matmul-only operations. Its integration into both standard and sophisticated SVGP architectures marks a significant advance in scalable Bayesian inference on modern hardware. Future work includes expanding flexibility, tuning for deep models, and exploiting extreme low-precision regimes for maximal speedup.

Markdown Report Issue