GPyTorch: Scalable GP Inference Platform

Updated 10 February 2026

GPyTorch is an open-source library designed for scalable Gaussian process inference and Bayesian deep learning, leveraging PyTorch’s dynamic computation graphs.
It employs innovative BBMM and pivoted-Cholesky preconditioning algorithms to reduce computational complexity and accelerate GPU-based matrix operations.
The modular architecture supports both exact and approximate inference, custom kernels, and seamless integration with Bayesian optimization frameworks like BoTorch and Ax.

GPyTorch is an open-source library for Gaussian process (GP) inference and Bayesian deep learning constructed atop the PyTorch ecosystem. It is designed to enable efficient, scalable, and modular GP modeling that supports both exact and approximate inference, leveraging GPU hardware via specialized matrix-matrix computation routines. GPyTorch's architecture exploits PyTorch’s dynamic computation graphs, automatic differentiation, and tensor abstractions, thereby facilitating easy integration with neural network modules, and providing seamless interoperability for hybrid models, Bayesian optimization, and adaptive experimentation workflows (Gardner et al., 2018, Chang, 2019).

1. Core Architecture and Modular Components

GPyTorch implements its functionality as a collection of PyTorch modules, including subclasses of torch.nn.Module and torch.autograd.Function. Kernel hyperparameters, inducing-point locations, and latent function parameters are all represented as PyTorch Tensor objects and registered as parameters, enabling native use of PyTorch optimizers and autograd.

Key architectural components include:

Kernel Modules (gpytorch.kernels): Wide support for stationary kernels (RBF, Matern, RationalQuadratic, Periodic), as well as combinators (ScaleKernel, AdditiveKernel, ProductKernel) for composite covariance structures. Kernels output symbolic LazyTensor representations rather than forming dense matrices.
Mean Modules (gpytorch.means): Supports ZeroMean, ConstantMean, LinearMean, and arbitrary user-defined mean functions.
Likelihoods (gpytorch.likelihoods): Includes ExactGaussianLikelihood as well as likelihoods for classification and count data (SoftmaxLikelihood, PoissonLikelihood, BernoulliLikelihood), compatible with both exact and variational inference procedures.
Inference Modules:
- ExactGP and ExactMarginalLogLikelihood for exact inference via BBMM and conjugate gradients.
- ApproximateGP and VariationalELBO for scalable variational inference with stochastic mini-batching and variational strategies.
Linear Algebra Utilities:
- LazyTensor (symbolic representations deferring instantiation and exploiting algebraic structure such as Kronecker, Toeplitz).
- Specialized matmul, solve, and log-determinant routines that are dispatched through BBMM.
- CUDA-accelerated kernels for batched and block-structured matrix operations.
Integration Points: GPyTorch models serve as surrogates in higher-level Bayesian optimization frameworks such as BoTorch and Ax (Chang, 2019).

This modular structure allows complex GPs (deep kernels, multi-task, or hybrid neural-GP models) to be composed efficiently, with GPU-acceleration available for all major inference workflows.

2. Blackbox Matrix-Matrix Multiplication (BBMM) Inference Algorithm

BBMM is a key methodological innovation in GPyTorch, reducing the cost of GP inference on an $n \times n$ covariance $K=K(X,X)+\sigma^2 I$ from $O(n^3)$ to $O(n^2)$ per inference call. BBMM supports the three dominant operations in GP inference:

Linear solves: $K^{-1} b$ ,
Log-determinant: $\log |K|$ ,
Trace terms: $\operatorname{tr}(K^{-1}\frac{\partial K}{\partial \theta})$ for hyperparameter gradients.

Rather than explicit Cholesky decomposition, BBMM employs a modified, batched conjugate gradient (CG) routine ("mBCG") that solves multiple linear systems in parallel (e.g., for both targets and random probe vectors), while accumulating tridiagonal Lanczos coefficients for stochastic estimation of log-determinants and traces via stochastic Lanczos quadrature (SLQ).

A sketch of the core mBCG routine:

def mBCG(K_matmul, B0, max_iters):
    V = B0                # initial residuals: [b0, b1, ..., bt]
    U = zeros_like(B0)    # solution matrix
    for j in range(max_iters):
        alpha = (V.T @ V) / (V.T @ K_matmul(V))
        U = U + V @ diag(alpha)
        R = V - K_matmul(V) @ diag(alpha)
        beta = (R.T @ R) / (V.T @ V)
        V = R + V @ diag(beta)
        # record Lanczos tridiagonal stats for each system
    return U, {"alphas": ..., "betas": ...}

The predictive mean and variance are computed through CG-solves with training targets $y$ for $\alpha = K^{-1}y$ and test-time kernel vectors, with variance estimated via SLQ using the tridiagonalizations. Gradients and log-marginal likelihood terms are likewise reduced to stochastic trace estimations with $t\approx 10$ probes and $p\approx 20$ CG steps, yielding overall $O(n^2)$ time (Gardner et al., 2018).

3. Preconditioning and Pivoted-Cholesky Acceleration

To improve convergence of CG solves within BBMM, GPyTorch implements a low-rank pivoted-Cholesky preconditioner. The preconditioner $P = L_kL_k^T + \sigma^2 I$ , with $L_k$ constructed from $k$ -step pivoted Cholesky approximation of $K$ , reduces the effective condition number of $P^{-1}K$ , dramatically accelerating convergence. For RBF kernels, theoretical bounds indicate that for moderate $k$ (often $k \leq 5$ ), the preconditioner shrinks the condition number $\kappa(P^{-1}K)$ exponentially with $k$ , and in practice this is sufficient for most datasets.

Computational costs for preconditioning steps:

Construction: $O(k^2 n)$ ,
Preconditioner solves (via Woodbury identity): $O(n k^2)$ ,
Sampling: $O(n k)$ ,
Log-determinant: $O(n k^2)$ .

Use of the preconditioner is fully automated and can be controlled by user-configurable parameters (Gardner et al., 2018).

4. Extensibility: Scalable Approximations and Custom Kernels

GPyTorch’s BBMM inference algorithm requires only user-provided matrix-matrix multiplication routines for the kernel and its parameter derivatives. As a result, scalable GP approximations and compositional/custom kernels are rapidly integrated with minimal boilerplate.

Examples include:

Bayesian linear regression: Kernel $K=MM^T+\sigma^2 I$ , with matrix-multiplies $K\cdot V = M(M^T V)$ .
Subset of Regressors (SoR)/SGPR: $K\approx K_{XU}K_{UU}^{-1}K_{UX}+\sigma^2 I$ ; multiplies implemented via small-matrix solves and products.
Structured Kernel Interpolation (SKI): $K\approx W K_{UU}W^T+\sigma^2 I$ , with fast kernel-matrix multiplies via structured $K_{UU}$ (Toeplitz, Kronecker).
Kernel Composition: Sums, products, and other compositions exploit modular kernel APIs and fuse their matrix-multiplies.

For both exact and approximate GPs (variational inference), users may inherit from the appropriate model base class and only need to define the relevant matrix-multiply routines (Gardner et al., 2018, Chang, 2019).

5. Performance Benchmarks and GPU Acceleration

GPyTorch exploits CUDA acceleration through direct integration with PyTorch tensor ops. All large linear algebra computations, and in particular BBMM batch matmuls, are offloaded to the GPU. Empirically, GPyTorch delivers substantial runtime improvements compared to conventional CPU and even GPU-Cholesky approaches.

Representative timings (Gardner et al., 2018):

Model	$n$	CPU (s)	GPU-BBMM (s)	Speedup
Exact (RBF)	3,000	1.0	0.03	33×
SGPR (m=300)	50,000	40	4.0	10×
SKI (m=10k)	500,000	600	40	15×

Typical speedups relative to CPU-Cholesky range from 5× (small $n$ ) up to 32× (large $n$ ); even GPU-Cholesky is outperformed by a large margin for GP inference workloads (Gardner et al., 2018, Chang, 2019).

6. Public APIs and Code Snippets

GPyTorch models closely follow PyTorch conventions. Users implement GPs as subclasses of gpytorch.models.ExactGP or gpytorch.models.ApproximateGP, define mean and kernel modules, and use standard training loops. CG and BBMM are enabled by default; inference settings (e.g., max CG iterations, number of trace samples) are controlled via PyTorch context managers:

import torch
import gpytorch

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super().__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ZeroMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel()
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

train_x = torch.randn(3000, 10).cuda()
train_y = torch.randn(3000).cuda()
likelihood = gpytorch.likelihoods.GaussianLikelihood().cuda()
model = ExactGPModel(train_x, train_y, likelihood).cuda()

model.train(); likelihood.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for epoch in range(50):
    optimizer.zero_grad()
    out = model(train_x)
    loss = -mll(out, train_y)
    loss.backward()
    optimizer.step()

Pivoted-Cholesky preconditioners and approximate GP models are exposed via high-level API calls and class inheritance. No explicit mention of CG/BBMM is required in user code (Gardner et al., 2018, Chang, 2019).

7. Limitations and Practical Considerations

Several practical limitations and considerations are noted for large-scale and ill-conditioned problems:

Memory: BBMM stores $n \times t$ blocks for each CG iteration (with $t\approx 10$ ); for dense $K$ , memory usage scales as $O(n^2)$ . Practical limits for exact inference are $n\approx 3,000-4,000$ for a single 12GB GPU.
Convergence: For ill-conditioned kernels or large $n$ , more CG iterations or higher preconditioner rank may be required.
Numerical Stability: CG-based solves often avoid the need for explicit "jitter". If jitter is required for stability, it must be reflected in the kernel multiply routine.
Non-Gaussian Likelihoods: Fully compatible with variational ELBO schemes (e.g., SVGP, classification), but the user must formulate KL terms as mBCG calls; standard variational families are shipped.
Use Case Scope: Best suited for regression and classification tasks with moderate $n$ ; inducing-point/structured approximations are still needed for very large-scale datasets (Gardner et al., 2018).

In summary, GPyTorch enables scalable GP inference and Bayesian modeling via a unifying, GPU-accelerated computational framework, with design choices grounded in LazyTensor representation, BBMM routines, and PyTorch-native modularity. Its integration with BoTorch and Ax extends its applicability to Bayesian optimization and experimental design (Gardner et al., 2018, Chang, 2019).

Markdown Report Issue Upgrade to Chat

References (2)

GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration (2018)

Bayesian Hyperparameter Optimization with BoTorch, GPyTorch and Ax (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GPyTorch Software Platform.