Fast Gaussian Process Approximations

Updated 4 December 2025

Fast Gaussian Process Approximations are a class of techniques that reduce the standard O(n³) computational cost, enabling scalable analysis of large spatial, temporal, and multidimensional datasets.
They leverage methods such as multi-resolution/hierarchical basis functions, random Fourier features, and sparse graph-based models to balance accuracy and speed in inference.
These approaches underpin practical applications in spatial statistics, machine learning, and time series analysis, often reducing computation time by orders of magnitude.

Fast Gaussian process (GP) approximations comprise a broad class of computational methods designed to enable scalable inference and prediction with GPs in regimes where standard $\mathcal{O}(n^3)$ cost for Cholesky-based exact methods is prohibitive. These techniques exploit structured representations, hierarchical decompositions, kernel approximations, and stochastic algorithms to reduce both time and memory complexity, allowing efficient application to large spatial, temporal, and multidimensional datasets. They are now foundational across spatial statistics, machine learning, and time series analysis.

1. Multi-Resolution and Hierarchical Basis Approximations

A prominent and flexible approach deploys multi-resolution or hierarchical basis function expansions. The multi-resolution approximation (MRA) framework constructs a GP approximation as a sum of basis expansions across multiple resolutions. The MRA recursively approximates the target process $y_0(\cdot)\sim\mathcal{GP}(0,C_0)$ by $y_M=\sum_{m=0}^M \tau_m$ where each $\tau_m(s)=b_m(s)' \eta_m$ , with $\eta_m \sim N_{r_m}(0,\Lambda_m^{-1})$ and $b_m(s)$ a basis vector at resolution $m$ (Katzfuss et al., 2017). The basis construction leverages orthogonal decompositions and localized "knots" at each resolution, recursively updating the remainder covariance and imposing sparsity through either covariance tapering or domain blocking:

MRA-taper: Applies compactly supported tapers to each residual covariance $w_{m+1}$ with decreasing support, ensuring sparsity and continuity of the approximate process.
MRA-block: Recursively partitions the domain, yielding a block-diagonal structure in the precision with discontinuities at block boundaries.

Both specializations allow efficient assembly of the global prior and posterior, requiring only sparse matrix Cholesky and triangular solvers. Complexity is $\mathcal{O}(nM^2 r_0^3)$ (taper) with memory $\mathcal{O}(nMr_0)$ , quasilinear in $n$ for $M=O(\log n), r_0=O(1)$ .

Key theoretical properties include exact variance reproduction at knots, preservation of mean-square differentiability, and controlled locality-versus-smoothness trade-offs(Katzfuss et al., 2017). Empirically, MRA-block maximizes speed and sharp locality, while MRA-taper offers superior map smoothness.

2. Kernel Approximations via Spectral and Random Features

Spectral representations furnish efficient approximations for translation-invariant kernels and for complex structured kernels in latent force models. By expressing a positive-definite kernel $k(x,x')$ via Mercer eigenexpansion or Bochner's theorem, one can obtain low-rank approximations:

Random Fourier Features (RFF): The EQ kernel is written as $k(\tau,\tau')=\int p(\lambda) e^{i\lambda(\tau-\tau')}\, d\lambda$ , and the GP is approximated using randomly sampled frequencies and corresponding feature maps. In complex latent force kernels, convolution with Green's functions is solved analytically, yielding low-dimensional feature embeddings $v_d(t;\theta_d,\lambda)$ over time and output(Guarnizo et al., 2018).

Inference reduces to low-rank GP regression with kernel $K\approx \Phi_c\Phi_c^{\top}$ , where $\Phi_c$ is the feature matrix. This dramatically reduces computational costs to $O(NDQ^2 S^2)$ compared to $O((ND)^3)$ , with predictive performance approaching exact as $S\uparrow$ . Empirical results across LFM and multi-output GPs show speedups from $1.6\times$ (small problems) to $6\times$ (large $N$ , complex kernels) with minimal loss in predictive accuracy(Guarnizo et al., 2018).

In the case of Hilbert-space (Mercer) expansions, as in fast approximate multi-output GPs(Joukov et al., 2020, Mukherjee et al., 22 May 2025), the kernel is approximated by the first $n$ eigenvalues/functions, reducing all $N\times N$ kernel operations to $N\times n$ and $n\times n$ , with predictive and training complexities $\mathcal{O}(Nn+n^3)$ (single-output) and $\mathcal{O}(N n D + n^3 D + N D^2)$ (multi-output).

3. Graph-based, Sparse, and Block-Diagonal Covariance Structures

Sparse GP approximations exploit compactly supported kernels or conditional independence structures to obtain sparse precision or covariance matrices, facilitating scalable inference:

Compactly Supported Kernels: By constructing parametric families of kernels with compact spatial support (e.g., Wendland, polynomial, and Fourier-family kernels), one enforces sparsity in $K$ , yielding $\mathcal{O}(N\log N)$ arithmetic and storage for inference and likelihood evaluation(Barber, 2020). Maximum-likelihood estimation and convex projection to approximate target kernels are both supported with rigorous control of estimation error, and empirical accuracy matches or exceeds standard kernels at sub-quadratic cost.
Graph/DAG and Nearest-Neighbour GPs: Approximating the GP via a directed acyclic conditional dependence graph, with each node conditioned only on $q\ll n$ predecessors, reduces complexity to $O(nq^3)$ for Cholesky and $O(nq^2)$ for solves(Hoffmann et al., 2023). These correspond to Vecchia or NNGP factors. FFT-based exact methods for grid-aligned, stationary kernels provide $O(n\log n)$ solutions using circulant diagonalization.

4. Vecchia Approximations and Grouping

Vecchia-type approximations decompose the joint GP density into products of conditional densities with small conditioning sets. Orderings and grouping of these factors can yield dramatic accuracy and speed improvements:

Permutation and Grouping: Carefully selected orderings (e.g., maximum-minimum-distance, random, or middle-out) in the Vecchia factorization reduce the Kullback-Leibler divergence to the true GP by up to $100\times$ versus default coordinate orderings. Grouping increases effective neighbor sets, further reducing error while allowing blockwise parallel computation(Guinness, 2016).
Prediction and Joint Inference: Both theoretical analysis and empirical evidence show that response-first ordering and full conditioning deliver optimal trade-offs of linear complexity and joint predictive accuracy(Katzfuss et al., 2018). Grouped and space-filling orderings outperform covariance tapering and SPDE methods in both KL divergence and computational efficiency.

5. Iterative Linear Algebra and Log-Determinant Estimation

Iterative algorithms based on matrix-vector multiplications (MVMs) and stochastic estimators have become central tools for scalable GP learning:

Stochastic Lanczos Quadrature (SLQ) and Chebyshev Approximations: Log-determinants and their derivatives, essential for marginal likelihood optimization, are estimated using Lanczos/SLQ or Chebyshev polynomial expansions, with only $O(n)$ cost per probe. Lanczos is superior for kernels with rapidly decaying spectra (e.g., RBF), while surrogate RBF interpolants allow real-time queries in low hyperparameter dimensions(Dong et al., 2017).
Preconditioning and Conjugate Gradient: For large systems arising in full-scale approximations (FSA) and Vecchia–Laplace approximations, matrix-free preconditioned conjugate gradient methods, especially with FITC-based or VADU-type preconditioners, accelerate convergence and mitigate spectral pathologies. These techniques yield $O(n m^2)$ or $O(n\log^2 n)$ complexity (with proper structure; e.g., hierarchical matrices in FIFA-GP(Moran et al., 2020)), and enable accurate inference at $n$ up to $10^5$ – $10^6$ (Gyger et al., 23 May 2024, Kündig et al., 2023).

6. Specialized Approximate Inference for Non-Gaussian Models

When likelihoods are non-Gaussian, fast approximate inference is crucial:

Polynomial (Chebyshev) and Laplace-Matching Approximations: For GP factor models with non-conjugate (e.g., Poisson, binomial) likelihoods, second-order polynomial approximations yield closed-form Gaussian posteriors, reducing the problem to tractable computations and providing accurate initializations for variational schemes(Keeley et al., 2019).
Laplace Matching (LM): LM schemes transform arbitrary exponential-family likelihoods to approximate conjugate Gaussian terms via basis change and Laplace expansion, furnishing closed-form, two-way mappings between natural parameters and Gaussian moments(Hobbhahn et al., 2021). Implementations incur only negligible overhead ( $<0.1$ s for $n=10^5$ ), and efficient grouping yields natural inducing points.

7. Applications to Structured and Autocorrelated Data

Fast approximations must adapt when data exhibit temporal or spatial autocorrelation:

Thinning/Blocking: For autocorrelated time-series data, "thinning" partitions data into blocks at lags determined by the partial autocorrelation function, ensuring de-correlation within each block. Fast methods (twinGP, laGP, scaled Vecchia) are applied within each block for the spatial component, and a separate local GP models the temporal residual structure. This strategy prevents temporal overfitting and offers 10–100 $\times$ computational speedups over full spatio-temporal GPs, with predictive accuracy maintained(Chokhachian et al., 2 Dec 2025).

Empirical studies confirm that, across diverse datasets (e.g., atmospheric, motion-capture, satellite, audio), these methods maintain or improve prediction and uncertainty quantification relative to classic GPs at orders-of-magnitude reduced cost.

Summary Table: Fast Gaussian Process Approximation Methods

Method / Framework	Core Idea / Mechanism	Computational Scaling
MRA (multi-resolution) (Katzfuss et al., 2017)	Hierarchical basis, taper/block, sparse	$O(n M^2 r_0^3)$ , $M\sim\log n$
RFF & Spectral (Guarnizo et al., 2018)	Random features, low-rank kernel	$O(NDQ^2S^2)$ , $S\ll N$
Compact kernels (Barber, 2020)	Parametric compactly-supported kernels	$O(N\log N)$ , sparse
Gram/Vecchia DAG (Hoffmann et al., 2023, Katzfuss et al., 2018)	Sparse DAG, nearest-neighbor cond.	$O(nq^3)$
SLQ/Lanczos (Dong et al., 2017, Kündig et al., 2023)	Stochastic log-det, MVM-based	$O(n)$ per probe
H-matrix/FIFA-GP (Moran et al., 2020)	Hierarchical low-rank matrix	$O(n\log^2 n)$ (1D)
Thinning/Blocking (Chokhachian et al., 2 Dec 2025)	Block decorrelation for autocorrelation	Linear in $n$ (blockwise)

These advances furnish a toolkit for enabling GP inference and learning at massive scale, supporting flexible modeling of spatial, spatio-temporal, and high-dimensional datasets, as well as arbitrary non-Gaussian likelihoods. With ongoing developments in iterative methods, preconditioning, and structured kernel design, fast GP approximations continue to extend the boundary of feasible probabilistic modeling.