KISS-GP: Scalable Structured Gaussian Processes

Updated 20 February 2026

The paper introduces KISS-GP, a framework that replaces costly covariance computations with efficient structured kernel interpolation.
It achieves linear or near-linear computational complexity by using sparse inducing points on structured grids and local interpolation.
The method supports fast online prediction and hyperparameter learning with rigorously quantified error bounds and extensions for high dimensions.

Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP) is a scalable framework for Gaussian process (GP) modeling that combines sparse inducing-point approximations with structured matrix algebra to enable fast and accurate GP inference and learning. KISS-GP, instantiated via Structured Kernel Interpolation (SKI), serves as a unifying approach that generalizes inducing-point methods by replacing cost-prohibitive covariance computations with highly efficient kernel interpolation schemes. This method achieves linear or near-linear complexity in the training set size for low to moderate input dimensions and remains state-of-the-art for large-scale, uncertainty-aware regression tasks.

1. Core Formulation and Interpolation Structure

KISS-GP approximates the exact GP covariance matrix $K \in \mathbb{R}^{n\times n}$ , defined by $[K]_{ij} = k(x_i, x_j)$ for data $\{x_i\}$ , using a set of $m \ll n$ inducing points $U = \{u_j\}$ , typically placed on a regular multidimensional grid. The approximation is given by

$K_{\text{SKI}} = W K_{U,U} W^\top,$

where $K_{U,U}$ is the $m \times m$ Gram matrix computed on the grid and $W$ is a sparse $n \times m$ interpolation matrix. For each datapoint $x$ , the vector $w(x)$ contains at most $L$ nonzeros determined by a local interpolation scheme, such as Keys’ cubic convolution (with $L=4$ in each dimension). In low dimensions, the grid structure of $U$ induces Toeplitz or Kronecker algebra in $K_{U,U}$ .

For each $x \in \mathbb{R}^d$ , $w(x)$ contains local interpolation weights computed along each axis, and $K_{\text{SKI}}$ approximates $k(x, x')$ for any pair $(x, x')$ . This generalizes classic inducing-point Nyström or SoR/FITC approximations, but with strictly local, highly sparse interpolation and structured grids to enable fast algebraic computations (Wilson et al., 2015, Wilson et al., 2015).

2. Computational and Algorithmic Features

Matrix-Vector Product Complexity

Given the sparsity in $W$ (each row has $O(L)$ nonzeros) and the structured algebra for $K_{U,U}$ (Toeplitz in 1D, Kronecker in higher dimensions), a matrix-vector multiply with $K_{\text{SKI}}$ costs

$O(nL + m\log m)$

with further reduction possible if Kronecker or circulant embeddings are used. Solving systems involving $K_{\text{SKI}} + \sigma^2 I$ via iterative methods such as conjugate gradients achieves overall linear in $n$ scaling, provided $m$ grows sublinearly with $n$ (Wilson et al., 2015, Wilson et al., 2015).

Exact Inference with Interpolation Approximation

Inference and hyperparameter learning proceed by solving linear systems in the SKI-approximated matrix, evaluating the log-determinant (using FFTs or Kronecker algebra), and performing marginal likelihood optimization (Wilson et al., 2015, Wilson et al., 2015). In streaming or online settings, the Woodbury identity allows SKI to be updated for new data with $O(1)$ time per step (in $n$ ), updating only small $m\times m$ statistics and caches. The SKI approximation incurs interpolation error but no further variational or algebraic bias (Stanton et al., 2021).

Test-Time Prediction

After training, the predictive mean and variance at a test point $x_*$ are evaluated using only the $m$ -dimensional $w(x_*)$ vector:

$\mu(x_*) = w(x_*)^\top K_{U,U} W^\top \alpha, \quad v(x_*) = k(x_*, x_*) - w(x_*)^\top \nu_U w(x_*),$

where precomputed terms $\alpha$ , $\nu_U$ require only $O(m)$ or less, yielding $O(1)$ test-time complexity (Wilson et al., 2015, Bishnoi et al., 2020).

3. Error Bounds, Scalability, and the "Price of Linear Time"

Rigorous error analyses demonstrate that for cubic convolution interpolation with grid size $m$ , the element-wise kernel error decays as $O(m^{-3/d})$ , and the Gram matrix spectral norm error as $O(n m^{-3/d})$ . To guarantee uniform spectral-norm error $\leq \epsilon$ , $m$ must scale as $n^{d/3}/\epsilon^{d/3}$ . This leads to two computational regimes (Moreno et al., 1 Feb 2025):

For $d \leq 3$ , any fixed error tolerance allows $m = O(n^{d/3})$ , so $K_{\text{SKI}}$ can be computed in $O(n)$ time for all $n$ .
For $d > 3$ , $m = O(n^{d/3})$ grows faster than $n$ , and one must accept increasing error for strict linear-time scaling.

These theoretical results establish precise trade-offs between approximation error, grid size, and computational complexity, and quantify downstream effects on hyperparameter gradients and posterior predictions (Moreno et al., 1 Feb 2025).

4. Extensions to High-Dimensional and Structured Inputs

The canonical Cartesian grid in KISS-GP leads to exponential scaling in $d$ , limiting scalability in high dimensions. Several advances mitigate this:

Sparse Grids: By using sparse grid constructions, the number of inducing points grows only polynomially: $|G_{\ell,d}| = O(2^\ell \ell^{d-1})$ , and fast matrix-vector multiplication algorithms exploit the structure for near-linear cost in $m$ , restoring scalability to $d \sim 10$ (Yadav et al., 2023).
Simplicial (Permutohedral) Lattices: Simplex-GP places inducing points on the permutohedral lattice, interpolating each data point from only $d+1$ simplex corners, replacing the $2^d$ cost with $O(d^2)$ per MVM and reducing memory requirements dramatically (Kapoor et al., 2021).
Kernel Product Decomposition: For product kernels, SKIP expresses $K_{XX}$ as a Hadamard product of $d$ one-dimensional SKI matrices, utilizing fast low-rank merges and MVMs to avoid exponential grid size, giving $O(d (n + m \log m))$ scaling (Gardner et al., 2018).

These variants maintain the interpolation-Kronecker paradigm while achieving tractability for moderate-to-high dimensional GPs.

5. Implementation and Empirical Performance

KISS-GP is implemented in libraries such as GPyTorch, leveraging efficient GPU routines for structured algebra and sparse interpolation. It supports:

Fast training on arbitrarily large datasets (e.g., $n > 10^8$ with $m \sim 10^5$ ) with RBF and other stationary kernels (Wilson et al., 2015, Bishnoi et al., 2020).
Accurate GP regression and kernel learning, matching or exceeding inducing-point and variational methods in practical benchmarks, with correct uncertainty quantification.
Applicability to applications such as glass property prediction ( $d=37$ , $n\sim 10^5$ ) (Bishnoi et al., 2020), large-scale 3D magnetic field mapping (Menzen et al., 2023), and rapid online Bayesian optimization (Stanton et al., 2021).
Actual wall-clock speedups by orders of magnitude compared to Cholesky-based GP or even standard inducing-point methods, with runtime and memory matching theoretical guarantees.

Empirical results confirm the flexibility of interpolation (cubic, linear, simplicial), robustness to hyperparameter optimization, and suitability for both batch and online learning contexts.

6. Practical Parameter Choices and Usage

Grid selection: For convolutional cubic interpolation and bounded domains, set $m \sim n^{d/3}$ for fixed error (more precisely, $m \gtrsim [(n D^3)/\epsilon]^{d/3}$ where $D$ is the diameter of the data range) (Moreno et al., 1 Feb 2025).
Interpolation rule: Use cubic (Keys’) in 1–3D, simplicial for higher $d$ , or product/decomposed interpolation as appropriate; adapt grid resolution as dictated by the kernel lengthscale.
Structured algebra: Always exploit Toeplitz, Kronecker, or BTTB structures for $K_{U,U}$ ; use circulant embeddings and FFTs to accelerate log-determinant and MVMs when possible.
Hyperparameter learning: Gradient-based optimization of the SKI-approximate marginal likelihood, with gradients controlled in their error as $m$ increases (Moreno et al., 1 Feb 2025).
For online/streaming settings, maintain only $O(m^2)$ matrices and apply Woodbury rank-one updates for strictly constant-time extensions (Stanton et al., 2021).

7. Limitations, Theoretical Insights, and Future Directions

The fundamental limitation is the curse of dimensionality—Cartesian grids and polynomial interpolation induce exponential cost and storage in $d$ . Recent research has focused on sparse grids, low-rank product decompositions, and simplicial lattices to circumvent this (Kapoor et al., 2021, Yadav et al., 2023, Gardner et al., 2018). Nonetheless, the interpolation error decays only algebraically with $m$ , and high- $d$ regimes intrinsically require loosening error tolerances, motivating active research on more expressive kernel representations, adaptive grids, and automatic dimension reduction. The rigorous error analysis quantifies the “price of linear time” and the precise scaling laws necessary to achieve linearly scalable, accurate GP inference in structured and unstructured domains (Moreno et al., 1 Feb 2025).