HyperBO: Pre-trained GP for Bayesian Optimization
- HyperBO is a data-driven Bayesian optimization framework that pre-trains GP priors using multi-task evaluations to automate prior selection and improve sample efficiency.
- The framework employs empirical KL divergence or negative log-marginal likelihood losses to accurately learn prior parameters, thereby enhancing hyperparameter tuning performance.
- HyperBO+ generalizes this approach via a hierarchical GP model, enabling efficient optimization across heterogeneous domains while ensuring strong theoretical regret bounds.
HyperBO is a data-driven framework for Bayesian optimization (BO) that leverages pre-training of Gaussian process (GP) priors using observations from related tasks, with the aim of automating prior selection and substantially improving empirical sample efficiency in black-box function optimization. The approach originally targets homogeneous search spaces and is later extended in the form of HyperBO+ to universalize hierarchical GP prior learning across heterogeneous domains. Both HyperBO and HyperBO+ have demonstrated strong theoretical guarantees and empirical performance, particularly in large-scale hyperparameter tuning for machine learning models (Wang et al., 2021, Fan et al., 2022).
1. Motivation and Background
Bayesian optimization frameworks require specifying a GP prior—mean function, covariance kernel, and observation noise—that encodes structural assumptions about the target black-box function. Traditional BO methods necessitate expert design for these priors, which is both difficult and prone to misspecification, especially for complex objectives such as the validation error surfaces in neural network hyperparameter tuning. HyperBO addresses this by utilizing repositories of prior evaluations (multi-task data) to empirically learn a suitable GP prior. The key insight is to pre-train the GP's mean and kernel parameters across a set of related tasks, then fix these learned parameters to serve as the prior during optimization on new tasks (Wang et al., 2021).
A limitation of the original HyperBO formulation is the restriction to matched input domains—i.e., all related tasks must operate over the same set of hyperparameters and domain boundaries. HyperBO+ generalizes the pre-training approach using a hierarchical Bayesian model that decouples prior learning from input space homogeneity, thereby supporting BO over a wider range of functional domains (Fan et al., 2022).
2. Methodological Formulation
HyperBO formalizes pre-training as learning a parametric GP prior so that it approximates the distribution of functions observed in a collection of related tasks. Model fitting employs either a finite-dimensional functional KL divergence loss
or, in practice, its empirical analogue: either empirical KL divergence (EKL) for matched input grids, or average negative log-marginal-likelihood (NLL) across the observed tasks for generic data. These losses are minimized with respect to the GP parameters via L-BFGS or Adam, often parameterizing the mean by a shallow neural network and composing the kernel with trainable feature warping (Wang et al., 2021).
HyperBO+ extends the model to a hierarchical GP, suitable for multiple, possibly heterogeneous input domains. For each domain , with search space , a set of functions is observed. Each function is modeled as an i.i.d. draw from a per-domain GP with kernel parameters and noise , and these hyperparameters themselves are assumed i.i.d. draws from a global hyperprior (Fan et al., 2022).
The meta-training in HyperBO+ proceeds in two steps:
- Per-domain GP fitting: Estimate for each domain by maximizing the joint marginal likelihood over local task data.
- Global hyperprior estimation: Fit the hyperprior parameters using MLE, treating per-domain MLEs as i.i.d. samples.
The resulting hierarchical prior is deployed for BO in any domain matching the structure seen in training.
3. BO Adaptation and Algorithmic Implementation
Given a pre-trained (HyperBO) or hierarchical (HyperBO+) prior, BO proceeds as follows. For a new objective :
- The GP prior (mean, kernel, noise) remains fixed—no per-task hyperparameter re-optimization occurs.
- Each BO iteration samples either the fixed prior (HyperBO) or hyperparameters from (HyperBO+), computes the corresponding GP posterior on observed data, and evaluates an acquisition function (e.g., probability of improvement, expected improvement, UCB).
- For HyperBO+, acquisition functions are aggregated via marginal likelihood weighting over posterior samples to approximate posterior inference over GP hyperparameters.
- The next evaluation point is selected by maximizing this weighted acquisition function, then the process repeats after updating the data (Wang et al., 2021, Fan et al., 2022).
Computational Complexity: Pre-training is cubic in the number of data points per task but scales linearly with the number of tasks; BO adaptation scales with the number of posterior samples (typically ), with no reliance on inducing points or substantial approximation for moderate budgets () (Fan et al., 2022).
4. Theoretical Properties
Both HyperBO and HyperBO+ offer non-asymptotic theoretical guarantees:
- Posterior consistency: Given sufficient multi-task data (number of tasks and per-task points ), the pre-trained GP's posterior mean and covariance approach the true underlying GP as and grow.
- Regret bounds: HyperBO achieves simple regret under standard GP-UCB or PI acquisitions, converging to standard minimax rates as the amount of training data increases—analogous results apply to HyperBO+ once the hierarchical prior is consistent (Wang et al., 2021, Fan et al., 2022).
- Asymptotics of hierarchical fitting: ML-II estimation for per-domain GPs and hyperprior parameters is consistent under mild regularity assumptions for stationary Matérn kernels and sufficient coverage of each domain (Fan et al., 2022).
This suggests the utility of HyperBO frameworks persists as more related tasks and data accumulate, leading to near-optimal regret rates in BO.
5. Empirical Evaluations
Benchmark datasets:
- Hyperparameter tuning (PD1): A large multi-task benchmark covering optimizer tuning in deep learning, with 24 tasks each comprising extensive sweeps over learning rates, momentum, and decay parameters, yielding over 50,000 runs.
- Classical ML tuning (HPO-B): 16 diverse hyperparameter tuning domains from OpenML, with dimensionalities spanning 2–18, and 6 million logged evaluations.
- Synthetic benchmarks: Custom super-datasets with controlled domain and hyperprior configurations for ablation (Wang et al., 2021, Fan et al., 2022).
Main findings:
- HyperBO methods are 3–16× more sample efficient (measured by normalized simple regret ) than standard or non-informative GP-based BO, and greatly accelerate reaching small regret thresholds.
- HyperBO+ matches the performance of an oracle GP with the true hyperprior on synthetic domains and outperforms HyperBO, especially on tasks with domains not seen in pre-training. The method demonstrates strong generalization, with ablations (e.g., z-HyperBO+) confirming performance continues to improve absent test-domain data in training.
- Empirical regret and negative log-likelihood (NLL) correlate with pre-training loss, affirming the effectiveness of empirical KL/NLL-based prior fitting.
- Discrete mixture variants (x-HyperBO+) are slightly less performant, indicating the value of smoothing in hyperprior modeling.
- Sensitivity analysis shows increasing the number of pre-training tasks reduces estimator variance and improves BO outcomes (Wang et al., 2021, Fan et al., 2022).
6. Practical Guidance and Limitations
When to use HyperBO or HyperBO+:
- Employ HyperBO when all tasks share identical or aligned domains (matching hyperparameter schemas and bounds); this achieves strong empirical results with simpler pipeline and per-domain fits.
- Use HyperBO+ for meta-BO across heterogeneous domains—distinct hyperparameter sets, dimensions, or types—and when a single, universal prior is required for BO in previously unseen spaces (Fan et al., 2022).
Kernel and prior recommendations:
- Stationary Matérn ( or $5/2$) and squared-exponential kernels suffice for most hyperparameter tuning tasks.
- Gamma priors are advocated for positive kernel parameters (lengthscales, variances), with Normal priors for offsets.
- Non-Gaussianity or non-stationarity in target functions may require further adaptation, e.g., via BoTorch/Vizier input/output warping.
Limitations and extensions:
- The current framework presumes constant mean and stationary kernel structures; further research could enable the meta-learning of richer kernel mixtures or neural process priors.
- Acquisition function and kernel architectural meta-learning, as well as scaling to high-dimensional spaces (), remain active areas for extension, potentially leveraging sparse/embedding GPs (Fan et al., 2022).
7. Relationship to Broader Transfer BO and Impact
HyperBO and HyperBO+ situate within a broader research trajectory seeking to automate and generalize BO through transfer learning and meta-learning, encompassing approaches such as multi-task BO and few-shot BO (Fan et al., 2022). They distinguish themselves by rigorous modeling of prior uncertainty transfer, closed-form or tractable parameter estimation, practical scalability, and strong empirical guarantees. Their introduction of universal hierarchical GP priors in BO, as realized in HyperBO+, marks a substantive advance, enabling systematic sample-efficient optimization across previously disjoint or heterogeneous search spaces. Both frameworks have influenced subsequent work in Bayesian optimization, meta-learning of kernel components, and application-specific BO system design.
Key references:
- HyperBO: "Pre-trained Gaussian Processes for Bayesian Optimization" (Wang et al., 2021)
- HyperBO+: "HyperBO+: Pre-training a universal prior for Bayesian optimization with hierarchical Gaussian processes" (Fan et al., 2022)