Papers
Topics
Authors
Recent
Search
2000 character limit reached

KISS-GP: Scalable Structured Gaussian Processes

Updated 20 February 2026
  • The paper introduces KISS-GP, a framework that replaces costly covariance computations with efficient structured kernel interpolation.
  • It achieves linear or near-linear computational complexity by using sparse inducing points on structured grids and local interpolation.
  • The method supports fast online prediction and hyperparameter learning with rigorously quantified error bounds and extensions for high dimensions.

Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP) is a scalable framework for Gaussian process (GP) modeling that combines sparse inducing-point approximations with structured matrix algebra to enable fast and accurate GP inference and learning. KISS-GP, instantiated via Structured Kernel Interpolation (SKI), serves as a unifying approach that generalizes inducing-point methods by replacing cost-prohibitive covariance computations with highly efficient kernel interpolation schemes. This method achieves linear or near-linear complexity in the training set size for low to moderate input dimensions and remains state-of-the-art for large-scale, uncertainty-aware regression tasks.

1. Core Formulation and Interpolation Structure

KISS-GP approximates the exact GP covariance matrix KRn×nK \in \mathbb{R}^{n\times n}, defined by [K]ij=k(xi,xj)[K]_{ij} = k(x_i, x_j) for data {xi}\{x_i\}, using a set of mnm \ll n inducing points U={uj}U = \{u_j\}, typically placed on a regular multidimensional grid. The approximation is given by

KSKI=WKU,UW,K_{\text{SKI}} = W K_{U,U} W^\top,

where KU,UK_{U,U} is the m×mm \times m Gram matrix computed on the grid and WW is a sparse n×mn \times m interpolation matrix. For each datapoint xx, the vector w(x)w(x) contains at most LL nonzeros determined by a local interpolation scheme, such as Keys’ cubic convolution (with L=4L=4 in each dimension). In low dimensions, the grid structure of UU induces Toeplitz or Kronecker algebra in KU,UK_{U,U}.

For each xRdx \in \mathbb{R}^d, w(x)w(x) contains local interpolation weights computed along each axis, and KSKIK_{\text{SKI}} approximates k(x,x)k(x, x') for any pair (x,x)(x, x'). This generalizes classic inducing-point Nyström or SoR/FITC approximations, but with strictly local, highly sparse interpolation and structured grids to enable fast algebraic computations (Wilson et al., 2015, Wilson et al., 2015).

2. Computational and Algorithmic Features

Matrix-Vector Product Complexity

Given the sparsity in WW (each row has O(L)O(L) nonzeros) and the structured algebra for KU,UK_{U,U} (Toeplitz in 1D, Kronecker in higher dimensions), a matrix-vector multiply with KSKIK_{\text{SKI}} costs

O(nL+mlogm)O(nL + m\log m)

with further reduction possible if Kronecker or circulant embeddings are used. Solving systems involving KSKI+σ2IK_{\text{SKI}} + \sigma^2 I via iterative methods such as conjugate gradients achieves overall linear in nn scaling, provided mm grows sublinearly with nn (Wilson et al., 2015, Wilson et al., 2015).

Exact Inference with Interpolation Approximation

Inference and hyperparameter learning proceed by solving linear systems in the SKI-approximated matrix, evaluating the log-determinant (using FFTs or Kronecker algebra), and performing marginal likelihood optimization (Wilson et al., 2015, Wilson et al., 2015). In streaming or online settings, the Woodbury identity allows SKI to be updated for new data with O(1)O(1) time per step (in nn), updating only small m×mm\times m statistics and caches. The SKI approximation incurs interpolation error but no further variational or algebraic bias (Stanton et al., 2021).

Test-Time Prediction

After training, the predictive mean and variance at a test point xx_* are evaluated using only the mm-dimensional w(x)w(x_*) vector:

μ(x)=w(x)KU,UWα,v(x)=k(x,x)w(x)νUw(x),\mu(x_*) = w(x_*)^\top K_{U,U} W^\top \alpha, \quad v(x_*) = k(x_*, x_*) - w(x_*)^\top \nu_U w(x_*),

where precomputed terms α\alpha, νU\nu_U require only O(m)O(m) or less, yielding O(1)O(1) test-time complexity (Wilson et al., 2015, Bishnoi et al., 2020).

3. Error Bounds, Scalability, and the "Price of Linear Time"

Rigorous error analyses demonstrate that for cubic convolution interpolation with grid size mm, the element-wise kernel error decays as O(m3/d)O(m^{-3/d}), and the Gram matrix spectral norm error as O(nm3/d)O(n m^{-3/d}). To guarantee uniform spectral-norm error ϵ\leq \epsilon, mm must scale as nd/3/ϵd/3n^{d/3}/\epsilon^{d/3}. This leads to two computational regimes (Moreno et al., 1 Feb 2025):

  • For d3d \leq 3, any fixed error tolerance allows m=O(nd/3)m = O(n^{d/3}), so KSKIK_{\text{SKI}} can be computed in O(n)O(n) time for all nn.
  • For d>3d > 3, m=O(nd/3)m = O(n^{d/3}) grows faster than nn, and one must accept increasing error for strict linear-time scaling.

These theoretical results establish precise trade-offs between approximation error, grid size, and computational complexity, and quantify downstream effects on hyperparameter gradients and posterior predictions (Moreno et al., 1 Feb 2025).

4. Extensions to High-Dimensional and Structured Inputs

The canonical Cartesian grid in KISS-GP leads to exponential scaling in dd, limiting scalability in high dimensions. Several advances mitigate this:

  • Sparse Grids: By using sparse grid constructions, the number of inducing points grows only polynomially: G,d=O(2d1)|G_{\ell,d}| = O(2^\ell \ell^{d-1}), and fast matrix-vector multiplication algorithms exploit the structure for near-linear cost in mm, restoring scalability to d10d \sim 10 (Yadav et al., 2023).
  • Simplicial (Permutohedral) Lattices: Simplex-GP places inducing points on the permutohedral lattice, interpolating each data point from only d+1d+1 simplex corners, replacing the 2d2^d cost with O(d2)O(d^2) per MVM and reducing memory requirements dramatically (Kapoor et al., 2021).
  • Kernel Product Decomposition: For product kernels, SKIP expresses KXXK_{XX} as a Hadamard product of dd one-dimensional SKI matrices, utilizing fast low-rank merges and MVMs to avoid exponential grid size, giving O(d(n+mlogm))O(d (n + m \log m)) scaling (Gardner et al., 2018).

These variants maintain the interpolation-Kronecker paradigm while achieving tractability for moderate-to-high dimensional GPs.

5. Implementation and Empirical Performance

KISS-GP is implemented in libraries such as GPyTorch, leveraging efficient GPU routines for structured algebra and sparse interpolation. It supports:

  • Fast training on arbitrarily large datasets (e.g., n>108n > 10^8 with m105m \sim 10^5) with RBF and other stationary kernels (Wilson et al., 2015, Bishnoi et al., 2020).
  • Accurate GP regression and kernel learning, matching or exceeding inducing-point and variational methods in practical benchmarks, with correct uncertainty quantification.
  • Applicability to applications such as glass property prediction (d=37d=37, n105n\sim 10^5) (Bishnoi et al., 2020), large-scale 3D magnetic field mapping (Menzen et al., 2023), and rapid online Bayesian optimization (Stanton et al., 2021).
  • Actual wall-clock speedups by orders of magnitude compared to Cholesky-based GP or even standard inducing-point methods, with runtime and memory matching theoretical guarantees.

Empirical results confirm the flexibility of interpolation (cubic, linear, simplicial), robustness to hyperparameter optimization, and suitability for both batch and online learning contexts.

6. Practical Parameter Choices and Usage

  • Grid selection: For convolutional cubic interpolation and bounded domains, set mnd/3m \sim n^{d/3} for fixed error (more precisely, m[(nD3)/ϵ]d/3m \gtrsim [(n D^3)/\epsilon]^{d/3} where DD is the diameter of the data range) (Moreno et al., 1 Feb 2025).
  • Interpolation rule: Use cubic (Keys’) in 1–3D, simplicial for higher dd, or product/decomposed interpolation as appropriate; adapt grid resolution as dictated by the kernel lengthscale.
  • Structured algebra: Always exploit Toeplitz, Kronecker, or BTTB structures for KU,UK_{U,U}; use circulant embeddings and FFTs to accelerate log-determinant and MVMs when possible.
  • Hyperparameter learning: Gradient-based optimization of the SKI-approximate marginal likelihood, with gradients controlled in their error as mm increases (Moreno et al., 1 Feb 2025).
  • For online/streaming settings, maintain only O(m2)O(m^2) matrices and apply Woodbury rank-one updates for strictly constant-time extensions (Stanton et al., 2021).

7. Limitations, Theoretical Insights, and Future Directions

The fundamental limitation is the curse of dimensionality—Cartesian grids and polynomial interpolation induce exponential cost and storage in dd. Recent research has focused on sparse grids, low-rank product decompositions, and simplicial lattices to circumvent this (Kapoor et al., 2021, Yadav et al., 2023, Gardner et al., 2018). Nonetheless, the interpolation error decays only algebraically with mm, and high-dd regimes intrinsically require loosening error tolerances, motivating active research on more expressive kernel representations, adaptive grids, and automatic dimension reduction. The rigorous error analysis quantifies the “price of linear time” and the precise scaling laws necessary to achieve linearly scalable, accurate GP inference in structured and unstructured domains (Moreno et al., 1 Feb 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP).