GPyTorch: Scalable GP Inference Platform
- GPyTorch is an open-source library designed for scalable Gaussian process inference and Bayesian deep learning, leveraging PyTorch’s dynamic computation graphs.
- It employs innovative BBMM and pivoted-Cholesky preconditioning algorithms to reduce computational complexity and accelerate GPU-based matrix operations.
- The modular architecture supports both exact and approximate inference, custom kernels, and seamless integration with Bayesian optimization frameworks like BoTorch and Ax.
GPyTorch is an open-source library for Gaussian process (GP) inference and Bayesian deep learning constructed atop the PyTorch ecosystem. It is designed to enable efficient, scalable, and modular GP modeling that supports both exact and approximate inference, leveraging GPU hardware via specialized matrix-matrix computation routines. GPyTorch's architecture exploits PyTorch’s dynamic computation graphs, automatic differentiation, and tensor abstractions, thereby facilitating easy integration with neural network modules, and providing seamless interoperability for hybrid models, Bayesian optimization, and adaptive experimentation workflows (Gardner et al., 2018, Chang, 2019).
1. Core Architecture and Modular Components
GPyTorch implements its functionality as a collection of PyTorch modules, including subclasses of torch.nn.Module and torch.autograd.Function. Kernel hyperparameters, inducing-point locations, and latent function parameters are all represented as PyTorch Tensor objects and registered as parameters, enabling native use of PyTorch optimizers and autograd.
Key architectural components include:
- Kernel Modules (
gpytorch.kernels): Wide support for stationary kernels (RBF, Matern, RationalQuadratic, Periodic), as well as combinators (ScaleKernel, AdditiveKernel, ProductKernel) for composite covariance structures. Kernels output symbolicLazyTensorrepresentations rather than forming dense matrices. - Mean Modules (
gpytorch.means): Supports ZeroMean, ConstantMean, LinearMean, and arbitrary user-defined mean functions. - Likelihoods (
gpytorch.likelihoods): Includes ExactGaussianLikelihood as well as likelihoods for classification and count data (SoftmaxLikelihood, PoissonLikelihood, BernoulliLikelihood), compatible with both exact and variational inference procedures. - Inference Modules:
ExactGPandExactMarginalLogLikelihoodfor exact inference via BBMM and conjugate gradients.ApproximateGPandVariationalELBOfor scalable variational inference with stochastic mini-batching and variational strategies.
- Linear Algebra Utilities:
LazyTensor(symbolic representations deferring instantiation and exploiting algebraic structure such as Kronecker, Toeplitz).- Specialized matmul, solve, and log-determinant routines that are dispatched through BBMM.
- CUDA-accelerated kernels for batched and block-structured matrix operations.
- Integration Points: GPyTorch models serve as surrogates in higher-level Bayesian optimization frameworks such as BoTorch and Ax (Chang, 2019).
This modular structure allows complex GPs (deep kernels, multi-task, or hybrid neural-GP models) to be composed efficiently, with GPU-acceleration available for all major inference workflows.
2. Blackbox Matrix-Matrix Multiplication (BBMM) Inference Algorithm
BBMM is a key methodological innovation in GPyTorch, reducing the cost of GP inference on an covariance from to per inference call. BBMM supports the three dominant operations in GP inference:
- Linear solves: ,
- Log-determinant: ,
- Trace terms: for hyperparameter gradients.
Rather than explicit Cholesky decomposition, BBMM employs a modified, batched conjugate gradient (CG) routine ("mBCG") that solves multiple linear systems in parallel (e.g., for both targets and random probe vectors), while accumulating tridiagonal Lanczos coefficients for stochastic estimation of log-determinants and traces via stochastic Lanczos quadrature (SLQ).
A sketch of the core mBCG routine:
1 2 3 4 5 6 7 8 9 10 11 |
def mBCG(K_matmul, B0, max_iters): V = B0 # initial residuals: [b0, b1, ..., bt] U = zeros_like(B0) # solution matrix for j in range(max_iters): alpha = (V.T @ V) / (V.T @ K_matmul(V)) U = U + V @ diag(alpha) R = V - K_matmul(V) @ diag(alpha) beta = (R.T @ R) / (V.T @ V) V = R + V @ diag(beta) # record Lanczos tridiagonal stats for each system return U, {"alphas": ..., "betas": ...} |
The predictive mean and variance are computed through CG-solves with training targets for and test-time kernel vectors, with variance estimated via SLQ using the tridiagonalizations. Gradients and log-marginal likelihood terms are likewise reduced to stochastic trace estimations with probes and CG steps, yielding overall time (Gardner et al., 2018).
3. Preconditioning and Pivoted-Cholesky Acceleration
To improve convergence of CG solves within BBMM, GPyTorch implements a low-rank pivoted-Cholesky preconditioner. The preconditioner , with constructed from -step pivoted Cholesky approximation of , reduces the effective condition number of , dramatically accelerating convergence. For RBF kernels, theoretical bounds indicate that for moderate (often ), the preconditioner shrinks the condition number exponentially with , and in practice this is sufficient for most datasets.
Computational costs for preconditioning steps:
- Construction: ,
- Preconditioner solves (via Woodbury identity): ,
- Sampling: ,
- Log-determinant: .
Use of the preconditioner is fully automated and can be controlled by user-configurable parameters (Gardner et al., 2018).
4. Extensibility: Scalable Approximations and Custom Kernels
GPyTorch’s BBMM inference algorithm requires only user-provided matrix-matrix multiplication routines for the kernel and its parameter derivatives. As a result, scalable GP approximations and compositional/custom kernels are rapidly integrated with minimal boilerplate.
Examples include:
- Bayesian linear regression: Kernel , with matrix-multiplies .
- Subset of Regressors (SoR)/SGPR: ; multiplies implemented via small-matrix solves and products.
- Structured Kernel Interpolation (SKI): , with fast kernel-matrix multiplies via structured (Toeplitz, Kronecker).
- Kernel Composition: Sums, products, and other compositions exploit modular kernel APIs and fuse their matrix-multiplies.
For both exact and approximate GPs (variational inference), users may inherit from the appropriate model base class and only need to define the relevant matrix-multiply routines (Gardner et al., 2018, Chang, 2019).
5. Performance Benchmarks and GPU Acceleration
GPyTorch exploits CUDA acceleration through direct integration with PyTorch tensor ops. All large linear algebra computations, and in particular BBMM batch matmuls, are offloaded to the GPU. Empirically, GPyTorch delivers substantial runtime improvements compared to conventional CPU and even GPU-Cholesky approaches.
Representative timings (Gardner et al., 2018):
| Model | CPU (s) | GPU-BBMM (s) | Speedup | |
|---|---|---|---|---|
| Exact (RBF) | 3,000 | 1.0 | 0.03 | 33× |
| SGPR (m=300) | 50,000 | 40 | 4.0 | 10× |
| SKI (m=10k) | 500,000 | 600 | 40 | 15× |
Typical speedups relative to CPU-Cholesky range from 5× (small ) up to 32× (large ); even GPU-Cholesky is outperformed by a large margin for GP inference workloads (Gardner et al., 2018, Chang, 2019).
6. Public APIs and Code Snippets
GPyTorch models closely follow PyTorch conventions. Users implement GPs as subclasses of gpytorch.models.ExactGP or gpytorch.models.ApproximateGP, define mean and kernel modules, and use standard training loops. CG and BBMM are enabled by default; inference settings (e.g., max CG iterations, number of trace samples) are controlled via PyTorch context managers:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
import torch import gpytorch class ExactGPModel(gpytorch.models.ExactGP): def __init__(self, train_x, train_y, likelihood): super().__init__(train_x, train_y, likelihood) self.mean_module = gpytorch.means.ZeroMean() self.covar_module = gpytorch.kernels.ScaleKernel( gpytorch.kernels.RBFKernel() ) def forward(self, x): mean_x = self.mean_module(x) covar_x = self.covar_module(x) return gpytorch.distributions.MultivariateNormal(mean_x, covar_x) train_x = torch.randn(3000, 10).cuda() train_y = torch.randn(3000).cuda() likelihood = gpytorch.likelihoods.GaussianLikelihood().cuda() model = ExactGPModel(train_x, train_y, likelihood).cuda() model.train(); likelihood.train() optimizer = torch.optim.Adam(model.parameters(), lr=0.1) mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model) for epoch in range(50): optimizer.zero_grad() out = model(train_x) loss = -mll(out, train_y) loss.backward() optimizer.step() |
Pivoted-Cholesky preconditioners and approximate GP models are exposed via high-level API calls and class inheritance. No explicit mention of CG/BBMM is required in user code (Gardner et al., 2018, Chang, 2019).
7. Limitations and Practical Considerations
Several practical limitations and considerations are noted for large-scale and ill-conditioned problems:
- Memory: BBMM stores blocks for each CG iteration (with ); for dense , memory usage scales as . Practical limits for exact inference are for a single 12GB GPU.
- Convergence: For ill-conditioned kernels or large , more CG iterations or higher preconditioner rank may be required.
- Numerical Stability: CG-based solves often avoid the need for explicit "jitter". If jitter is required for stability, it must be reflected in the kernel multiply routine.
- Non-Gaussian Likelihoods: Fully compatible with variational ELBO schemes (e.g., SVGP, classification), but the user must formulate KL terms as mBCG calls; standard variational families are shipped.
- Use Case Scope: Best suited for regression and classification tasks with moderate ; inducing-point/structured approximations are still needed for very large-scale datasets (Gardner et al., 2018).
In summary, GPyTorch enables scalable GP inference and Bayesian modeling via a unifying, GPU-accelerated computational framework, with design choices grounded in LazyTensor representation, BBMM routines, and PyTorch-native modularity. Its integration with BoTorch and Ax extends its applicability to Bayesian optimization and experimental design (Gardner et al., 2018, Chang, 2019).