Pre-trained Gaussian Processes

Updated 20 February 2026

Pre-trained Gaussian Processes are models that set their priors, kernels, or hyperparameters using task-relevant datasets to embed domain knowledge.
They integrate methods like prior mean pre-training, deep kernel construction, and hierarchical hyperparameter tuning to boost data efficiency and model calibration.
Applications in Bayesian optimization, computer vision, and time-series forecasting illustrate their potential for robust few-shot learning and faster inference.

Pre-trained Gaussian Processes (GPs) are Gaussian process models whose prior (or hyperparameters, mean, or kernel) are set through analysis of large, task-relevant datasets, often to encode domain knowledge or transfer learning from related tasks. Their emergence has enabled more data-efficient Bayesian inference, accelerated model selection, principled uncertainty quantification, and robust few-shot learning across diverse application domains including Bayesian optimization, vision, psychiatry, and time-series analysis.

1. Core Methodologies of Pre-trained Gaussian Processes

Pre-training in the context of GPs denotes the process of constructing a prior mean, kernel, or hyperparameter distribution informed by multi-task or historical data, rather than relying on uninformed initializations or manual specification. Several major approaches are prevalent:

Prior Mean Pre-training: The mean function $m(x)$ is set from a pre-trained model, e.g., deep neural network or a zero-shot classifier, to inject strong prior signal into the GP regression or classification task. For example, using CLIP’s zero-shot classifier as the GP mean (Miao et al., 2024), or using predictions from a pre-trained DNN (Ortega et al., 2024).
Kernel Pre-training and Deep Kernels: The kernel $k(x, x')$ may combine features from several pre-trained encoders (e.g., vision transformers, self-supervised networks) via learnable or fixed linear and nonlinear transforms, resulting in ensemble deep kernel GPs (Miao et al., 2024).
Hyperparameter Pre-training: Hierarchical or empirical Bayes strategies fit priors $p(\theta)$ for kernel (and mean) hyperparameters using a collection of related datasets, e.g., prior tasks or trajectories (Wang et al., 2021, Kenney et al., 2024, Fan et al., 2022).
Universal Prior for Transfer Learning: HyperBO+ provides pre-training of hierarchical GP hyper-priors that generalize across task domains or search spaces (Fan et al., 2022).
Neural Surrogate Priors: DeepRV and related approaches encode the distribution of GP sample paths as decoder-only neural surrogates, pre-trained to mimic the distribution of function values under a target set of kernel hyperparameters (Navott et al., 27 Mar 2025).

2. Mathematical Foundations and Pre-training Objectives

The common theoretical basis for pre-trained GPs is the construction of a prior that encodes realistic inductive biases about the function class or task at hand. For instance:

KL-based Prior Matching: Given i.i.d. functions $f_1, ..., f_N$ from unknown $\mathcal{GP}(\mu^*,k^*)$ , fit model $\mathcal{GP}(\mu,k)$ by minimizing

$L(\mu, k) = D_{\mathrm{KL}}(\mathcal{GP}(\mu^*, k^*),\; \mathcal{GP}(\mu, k))$

where $D_{\mathrm{KL}}$ is the functional KL divergence. Practical approximations include (a) empirical KL using matched input grids, or (b) negative log-likelihood aggregate over many tasks (Wang et al., 2021).

Hierarchical Empirical Bayes: Hyperpriors for kernel parameters are inferred via $N$ related tasks, typically maximizing marginal likelihood across all tasks for hyperparameter prior parameters $a$ :

$\hat{a} = \arg\max_a \prod_{i=1}^N p((\hat{\theta}_i, \hat{\sigma}_i); a)$

where $(\hat{\theta}_i, \hat{\sigma}_i)$ are task-wise MLE estimates (Fan et al., 2022, Kenney et al., 2024).

Decoder-based Emulation: Deep generative models approximate the Cholesky factorization of the GP covariance, enabling efficient sampling of function draws $f = g_\theta(z, c)$ , where $\theta$ is learned to minimize $L_2$ error against analytic GP draws for a range of kernel hyperparameters $c$ (Navott et al., 27 Mar 2025).

3. Integration with Pre-trained Models and Deep Networks

A crucial axis of development is the unification of GP priors with pre-trained deep models:

Fixed Mean Gaussian Processes: The mean function of the GP inherits from the outputs of a pretrained DNN, and the GP kernel is trained to capture the covariance of prediction errors. Only the GP variances are learned, ensuring identity of predictions while delivering post-hoc uncertainty estimates at scale (Ortega et al., 2024).
Activation-level GPs: The Gaussian Process Activation function (GAPA) constructs a separate one-dimensional GP for each neuron’s activation, with mean fixed to that neuron’s output in a pre-trained network. Both empirical and variational pre-training strategies exist to set the kernel hyperparameters rapidly, yielding tractable and mean-preserving uncertainty quantification (Bergna et al., 28 Feb 2025).
Ensemble Deep Kernels: For low-shot image classification, ensemble kernels aggregate features from several distinct pre-trained encoders, with per-feature scaling weights learned to optimize predictive likelihood. The mean is specified by a pre-trained classifier such as CLIP, yielding an analytically tractable vector-valued GP (Miao et al., 2024).

4. Pre-training Algorithms and Modified Inference Workflows

Pre-training of GPs typically follows a two-stage process:

Offline Prior Learning: Hyperparameters or prior functions are fitted to a collection of related tasks (multi-task regression, BO trajectories, past time series) via either MLE, variational inference, or direct empirical estimation (e.g., moments of kernel hyperparameters across tasks) (Wang et al., 2021, Fan et al., 2022, Kenney et al., 2024).
Plug-in Inference: The learned prior (or hyper-prior) is fixed and used as a plug-in prior in subsequent Bayesian inference or Bayesian optimization, enforcing strict separation between pre-training and task-specific learning (Wang et al., 2021).
Variational and Sparse GP Methods: For fixed-mean GPs or activation-level GPs, variational inference over the remaining GP parameters allows for scalability to large datasets, leveraging mini-batch gradients and sparse inducing point representations (Ortega et al., 2024, Bergna et al., 28 Feb 2025).
Neural Decoders in Probabilistic Programming: Encoder-less decoder-based surrogates such as DeepRV replace the GP's exact sample path generation; the prior is pre-trained and then used at inference time within probabilistic programming frameworks, yielding significant runtime reductions (Navott et al., 27 Mar 2025).

Approach	Pre-training Target	Inference Style
KL/likelihood fitting	Mean/kernels	Plug-in GP regression/BO
HyperBO/HyperBO+	Hyperpriors	Hierarchical BO
Fixed-mean GP	Mean from DNN output	Sparse VI, fixed mean
GAPA	Activation means	Analytic, per-neuron GP
DeepRV	Sample path emulator	Surrogate in PPL, MCMC

5. Theoretical Guarantees and Empirical Results

Pre-trained GPs deliver both theoretical and empirical benefits across various metrics:

Statistical Consistency: Under standard regularity conditions, per-dataset GP hyperparameter MLE becomes consistent, and the second-stage hyper-prior MLE converges to the true hyper-prior as $N \to \infty$ (Fan et al., 2022).
Regret Bounds in BO: HyperBO demonstrates that plug-in pre-trained GPs provide bounded posterior error and near-zero simple regret (as task count increases) when used for acquisition in Bayesian optimization (Wang et al., 2021).
Uncertainty Quantification and Calibration: Methods including ensemble-GP with CLIP mean (Miao et al., 2024), fixed-mean GPs (Ortega et al., 2024), and GAPA (Bergna et al., 28 Feb 2025) report improved expected calibration error (ECE), thresholded adaptive calibration error (TACE), and area under ROC for OOD detection compared to deterministic or Laplace-based baselines.
Computational Speedup: Decoder-based priors (DeepRV) enable 1.5–4× faster MCMC inference in realistic spatial statistics settings, with effective sample size per second up to 5× higher than full GP MCMC, and negligible loss in posterior accuracy (Navott et al., 27 Mar 2025).
Few-shot and Transfer Learning Gains: In low-shot image classification, pre-trained mean + deep kernel GP ensembles achieve up to +4.5% accuracy improvement over strong CLIP-based baselines and outperform all deterministic ensemble methods (Miao et al., 2024).
Universal Prior Generalization: HyperBO+ enables universal prior learning across different input domains, outperforming non-informative and task-specific prior baselines by sharing statistical strength (Fan et al., 2022).

6. Application Domains and Notable Use Cases

Pre-trained GPs have proven effective in several real-world scenarios:

Bayesian Optimization: Pre-trained GP priors and hierarchical hyperpriors yield rapid convergence and significant sample-efficiency over meta-learning, RL-trained acquisition, and hand-tuned priors in hyperparameter search, neural architecture search, and scientific simulation (Wang et al., 2021, Fan et al., 2022).
Computer Vision: Few-shot and low-shot classification models utilizing pre-trained means and composite kernels draw on CLIP, DINO, and other foundation models, producing robust accuracy increases and calibrated uncertainties (Miao et al., 2024).
Time-series Forecasting: Pre-trained structured GPs (sGP) combine physics-informed means and data-driven residual priors, achieving improved calibration and accurate extrapolation for battery health prediction (Kenney et al., 2024).
Medical and Spatial Statistics: DeepRV neural surrogate GPs accelerate and scale Bayesian disease mapping, closely matching gold-standard MCMC accuracy (Navott et al., 27 Mar 2025).
Post-hoc Uncertainty for Deep Learning: Both fixed-mean GPs and GAPA deliver scalable, architecture-agnostic uncertainty quantification atop pre-trained DNNs, with empirically validated gains across regression, classification, and OOD detection benchmarks (Ortega et al., 2024, Bergna et al., 28 Feb 2025).

7. Limitations and Considerations

Several critical considerations for pre-trained GPs include:

Pre-training Data Requirements: Hierarchical or multi-task pre-training demands access to representative and sufficiently large datasets of related tasks or trajectories, which may not be available in resource-constrained settings (Fan et al., 2022, Kenney et al., 2024).
Separation of Training Stages: The plug-in nature of pre-training, while ensuring no information leakage, makes adaptation to radically novel test distributions challenging; approaches such as universal priors and cross-domain adaptation partially address this (Fan et al., 2022).
Computation vs. Flexibility: Decoder-based emulators and activation-level GPs significantly reduce computational cost at inference, but may discard fine-grained uncertainty or rise to bias when extrapolating far outside the training regime (Navott et al., 27 Mar 2025, Bergna et al., 28 Feb 2025).
Calibration Trade-offs: While pre-trained GPs frequently deliver better-calibrated uncertainty than deterministic or Laplace methods, under- or overconfidence can occur depending on how closely the pre-training set matches the target domain (Kenney et al., 2024).
Architectural Choices: Modelers must balance sparsity (number of inducing points), kernel expressivity, and hyperparameter parameterization to avoid underfitting, overfitting, and excessive computational burden (Ortega et al., 2024, Bergna et al., 28 Feb 2025).

References

Bayesian Exploration of Pre-trained Models for Low-shot Image Classification (Miao et al., 2024)
Post-Hoc Uncertainty Quantification in Pre-Trained Neural Networks via Activation-Level Gaussian Processes (Bergna et al., 28 Feb 2025)
Pre-trained Gaussian Processes for Bayesian Optimization (Wang et al., 2021)
DeepRV: pre-trained spatial priors for accelerated disease mapping (Navott et al., 27 Mar 2025)
Fixed-Mean Gaussian Processes for Post-hoc Bayesian Deep Learning (Ortega et al., 2024)
Predicting Battery Capacity Fade Using Probabilistic Machine Learning Models With and Without Pre-Trained Priors (Kenney et al., 2024)
HyperBO+: Pre-training a universal prior for Bayesian optimization with hierarchical Gaussian processes (Fan et al., 2022)