Inducing Variables in Gaussian Processes

Updated 27 May 2026

Inducing variables are latent constructs in Gaussian processes that summarize full GP posteriors using a lower-dimensional set of pseudo-points.
They enable scalable inference by reducing computational complexity from O(N³) to O(NM²) while maintaining predictive accuracy.
Their selection, parametrization, and design—ranging from point evaluations to inter-domain projections—directly impact performance in both standard and deep GP models.

Inducing variables are latent constructs in Gaussian process (GP) modeling introduced to enable scalable inference by summarizing the information present in the full GP with a lower-dimensional set of “pseudo-points.” In standard and deep Gaussian processes, inducing variables can take the form of point-evaluations, inter-domain projections, or more general functionals, and are central to state-of-the-art scalable variational and fully Bayesian inference frameworks. Their selection, parametrization, and inference directly impact both predictive performance and computational complexity.

1. Formulation and Role of Inducing Variables in Sparse GPs

In the standard sparse variational GP (SVGP) approach, a collection of $M$ inducing inputs $Z = \{ z_m\}_{m=1}^M$ is chosen in the input space, and their corresponding function values $u = \{ f(z_m)\}_{m=1}^M$ define the inducing variables. The joint prior over training function values $f$ and $u$ in a GP with kernel $k$ is given by

$p(f, u) = p(f \mid u) \, p(u),$

where

$p(u) = \mathcal{N}(u \mid 0, K_{ZZ}), \ p(f \mid u) = \mathcal{N}(f \mid K_{XZ} K_{ZZ}^{-1} u, K_{XX} - K_{XZ} K_{ZZ}^{-1} K_{ZX}),$

with $K_{AB}$ denoting the matrix $[k(a, b)]$ over sets $Z = \{ z_m\}_{m=1}^M$ 0 (Xu et al., 2024, Uhrenholt et al., 2020, Tiao et al., 2023, Tsitsvero et al., 2022, Panos et al., 2018, Rossi et al., 2020).

The inducing variables act as a low-rank summary of the GP posterior, reducing the required computational effort from $Z = \{ z_m\}_{m=1}^M$ 1 to $Z = \{ z_m\}_{m=1}^M$ 2, with $Z = \{ z_m\}_{m=1}^M$ 3 the data size and $Z = \{ z_m\}_{m=1}^M$ 4.

2. Variational and Bayesian Inference with Inducing Variables

2.1 Stochastic Variational Inference

The variational framework introduces an approximate posterior, typically Gaussian,

$Z = \{ z_m\}_{m=1}^M$ 5

and seeks to maximize the evidence lower bound (ELBO):

$Z = \{ z_m\}_{m=1}^M$ 6

Both the inducing locations $Z = \{ z_m\}_{m=1}^M$ 7 and variational parameters $Z = \{ z_m\}_{m=1}^M$ 8 are optimized jointly (Tsitsvero et al., 2022, Tiao et al., 2023, Panos et al., 2018).

2.2 Bayesian Treatments

A fully Bayesian approach places priors on the inducing inputs $Z = \{ z_m\}_{m=1}^M$ 9 and kernel hyperparameters, treating them as random variables. Stochastic gradient Hamiltonian Monte Carlo (SGHMC) is used to jointly sample from the posterior over $u = \{ f(z_m)\}_{m=1}^M$ 0, improving uncertainty quantification and predictive accuracy over point-estimate or purely variational approaches (Rossi et al., 2020). This Bayesian treatment also addresses the sensitivity to inducing point placement (Uhrenholt et al., 2020).

2.3 Extensions: Point-Process Priors and Variable-Size Sets

To make the number and selection of inducing points part of the model, point-process priors (e.g., $u = \{ f(z_m)\}_{m=1}^M$ 1) are used, with the inclusion of each candidate location parameterized by $u = \{ f(z_m)\}_{m=1}^M$ 2. The variational posterior $u = \{ f(z_m)\}_{m=1}^M$ 3 is then fit under the ELBO, allowing the model to determine both number and placement of inducing points (Uhrenholt et al., 2020).

3. Generalizations: Inter-Domain and Orthogonal Inducing Variables

Inducing variables are not restricted to point evaluations. Inter-domain inducing variables are defined as linear functionals,

$u = \{ f(z_m)\}_{m=1}^M$ 4

where $u = \{ f(z_m)\}_{m=1}^M$ 5 are chosen basis functions in the RKHS of $u = \{ f(z_m)\}_{m=1}^M$ 6. These generalize classical inducing points and can be constructed from, for example, spherical harmonics (zonal functions) or neural-style features (Tiao et al., 2023).

Orthogonally-decoupled GP frameworks decompose the kernel as $u = \{ f(z_m)\}_{m=1}^M$ 7, with $u = \{ f(z_m)\}_{m=1}^M$ 8 spanned by the inducing feature map and $u = \{ f(z_m)\}_{m=1}^M$ 9 covering the orthogonal complement. Two families of inducing variables $f$ 0 (principal) and $f$ 1 (orthogonal) are introduced, enabling the simultaneous, independent enrichment of the mean and variance approximations at reduced computational cost (Tiao et al., 2023).

4. Scalable Approximations and Subspace Inducing Inputs

For high-dimensional data, the cost of kernel computations involving full-dimensional $f$ 2 becomes prohibitive. A scalable alternative is to constrain inducing inputs to a low-rank subspace, $f$ 3, with $f$ 4 and $f$ 5 a data-driven orthonormal basis (e.g., top right singular vectors). This reduces the computational cost of kernel evaluations from $f$ 6 to $f$ 7 per iteration while retaining flexibility (Panos et al., 2018). Numerically stable “kernel-preconditioned” parametrizations of $f$ 8, such as $f$ 9 and $u$ 0 with $u$ 1 diagonal, avoid instabilities and reduce the requirement for explicit regularization.

5. Inducing Variables in Deep Gaussian Processes

Deep Gaussian processes (DGPs) stack multiple GPs, with each hidden layer $u$ 2 having its own set of $u$ 3 inducing inputs $u$ 4 and latent outputs $u$ 5. The full joint prior becomes

$u$ 6

with $u$ 7 Gaussian (Xu et al., 2024). Inducing variables at each layer render otherwise intractable integrals and marginalizations computationally feasible at $u$ 8 cost.

Recent advances address the challenge of accurate posterior inference over these multi-layer inducing variables, including

Probabilistic selection of per-layer $u$ 9 via point-process priors (Uhrenholt et al., 2020);
Fully Bayesian DGPs using SGHMC for all $k$ 0 (Rossi et al., 2020);
Denoising Diffusion Variational Inference (DDVI), in which the posterior $k$ 1 is represented as the terminal distribution of a reverse-time diffusion SDE parameterized by a neural network score function. This enables expressive, multimodal posteriors and provides a tractable, explicit ELBO (Xu et al., 2024).

6. Algorithmic Details and Empirical Comparisons

6.1 Classical and Advanced Inference Schemes

DSVI (Doubly Stochastic VI): Takes $k$ 2, efficient but potentially biased for complex posteriors.
IPVI (Implicit VI): Uses neural-network samplers, adversarial losses, does not yield explicit ELBO.
DDVI: Employs reverse-time diffusion processes, optimizing an explicit path-space KL and corresponding ELBO, judged to outperform DSVI and IPVI in both expressivity and stability (Xu et al., 2024).

6.2 Empirical Findings

DDVI-based DGPs achieve superior test RMSE, NLL, and calibration on UCI regression benchmarks (datasets up to $k$ 3) and achieve state-of-the-art accuracy on MNIST, Fashion-MNIST, CIFAR-10, SUSY, and HIGGS datasets (Xu et al., 2024).
Probabilistic selection methods empirically show that as inducing points become less informative, the model prunes unnecessary points, trading off sparsity and predictive fit (Uhrenholt et al., 2020).
Subspace inducing inputs offer computational speed-ups of $k$ 4– $k$ 5 without significant accuracy loss in extreme multi-label and high-dimensional settings (Panos et al., 2018).
Mean-field variational GP approaches, when contrasted with Bayesian and DDVI methods, can exhibit underestimation of uncertainty and higher test errors, especially when inducing variable posteriors are complex (Tsitsvero et al., 2022, Rossi et al., 2020, Xu et al., 2024).

7. Selection, Initialization, and Design of Inducing Variables

Proper initialization and adaptive learning of $k$ 6 are critical for model performance. Random selection from training points or guided methods (e.g., k-means, farthest-point) can be employed; variationally optimized $k$ 7 tend to outperform fixed sets, especially in representing out-of-distribution or unseen test points (Tsitsvero et al., 2022). In inter-domain and orthogonally-decoupled setups, expressive basis design (e.g., spherical harmonics, neural network features) is important to cover both mean and covariance structures efficiently (Tiao et al., 2023).

Point-process-based variational inference provides an additional layer of adaptivity, allowing the model to jointly infer both which locations to include and the number of inducing variables, automatically matching model complexity to data structure (Uhrenholt et al., 2020).

References:

(Xu et al., 2024): "Sparse Inducing Points in Deep Gaussian Processes: Enhancing Modeling with Denoising Diffusion Variational Inference"
(Uhrenholt et al., 2020): "Probabilistic selection of inducing points in sparse Gaussian processes"
(Tiao et al., 2023): "Spherical Inducing Features for Orthogonally-Decoupled Gaussian Processes"
(Tsitsvero et al., 2022): "Learning inducing points and uncertainty on molecular data by scalable variational Gaussian processes"
(Panos et al., 2018): "Fully Scalable Gaussian Processes using Subspace Inducing Inputs"
(Rossi et al., 2020): "Sparse Gaussian Processes Revisited: Bayesian Approaches to Inducing-Variable Approximations"