Extensions to Classical Bayesian Optimization

Updated 16 August 2025

Extensions to classical Bayesian optimization are methodological enhancements that address constraints, nonstationarity, multi-objectivity, and high evaluation costs.
They employ advanced surrogate modeling techniques such as correlated Gaussian processes, hybrid models, and sequential Monte Carlo to improve prediction fidelity.
These approaches drive scalability and efficiency through transfer learning, cost-aware planning, and pathwise acquisition functions for optimizing expensive black-box functions.

Bayesian optimization (BO) is a suite of methodologies for global optimization of expensive black-box functions under uncertainty, typically using Gaussian process (GP) surrogates. Classical BO frameworks, however, often prove limited in practical settings due to assumptions of unconstrained spaces, independence of objectives and constraints, stationarity, and cost-insensitivity. Over the past decade, a broad array of extensions has been developed to overcome these limitations. These advances address multi-objective and constrained problems, robustness to dynamic or nonstationary objectives, transfer learning across tasks, complex surrogate/cost structures, and scalability to high-dimensional spaces. Below, key classes of such extensions are organized and detailed, with explicit reference to their mathematical formulations and implementation considerations as found in leading research.

1. Multi-Objective and Constrained Bayesian Optimization

Classical BO primarily aims to optimize a single objective without constraints. In practical scenarios—such as engineering and design problems—one must tackle simultaneous objectives and inequality constraints, often with expensive-to-evaluate functions.

Unified Treatment of Objectives and Constraints

A central challenge in constrained and multi-objective BO is defining a mechanism to balance objectives and enforce constraints. The extended domination rule (Feliot et al., 2015) introduces a unified domination mapping ψ(·) that penalizes infeasible points by assigning +∞ to their objectives, and encodes constraint violations for Pareto comparisons:

$\psi(y_o, y_c) = \begin{cases} (y_o, 0), & y_c \leq 0 \ (+\infty, \max(y_c, 0)), & \text{otherwise} \end{cases}$

This rule ensures feasible solutions dominate infeasible ones and enables generalized Pareto or dominance-based approaches in the presence of constraints.

Extended Hyper-Volume Improvement

The expected hyper-volume improvement (EHVI) criterion measures the expected gain in the dominated region of objective-constraint space upon new evaluation:

$\text{EHVI}_n(x) = \int_{G_n} P(\xi(x) \triangleright y) \, dy$

where $G_n$ is the non-dominated portion and $\triangleright$ is the extended domination relation defined by ψ(·).

Sequential Monte Carlo Estimation

Computing EHVI in high-dimensional objective/constraint spaces is intractable analytically. Sequential Monte Carlo (SMC), particularly the “subset simulation” framework, enables tractable particle-based approximation of these high-dimensional integrals, balancing computational tractability with statistical fidelity. Removal, resampling, and MCMC-based move steps ensure particles remain representative of the ever-shrinking feasible/dominated region as new data accrue (Feliot et al., 2015).

2. Surrogate Modeling under Correlated Outputs and Constraints

Classical constrained BO assumes independent GPs for objectives and constraints, which fails when correlations are present (frequent in process engineering, manufacturing, or environmental monitoring).

Bivariate Gaussian process models (Li et al., 30 May 2025) explicitly encode the joint distribution:

$\begin{bmatrix} y(x) \ z(x) \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mu_y \ \mu_z \end{bmatrix}, \begin{bmatrix} \sigma_y^2 & \rho \sigma_y \sigma_z \ \rho \sigma_y \sigma_z & \sigma_z^2 \end{bmatrix} \right)$

with covariance between objective $y(x)$ and constraint $z(x)$ parameterized by the correlation coefficient $\rho$ . Expected constrained improvement then becomes a non-factorizing acquisition function involving both marginal and joint CDFs:

$\text{ECI}(x) = (t_1(x) + t_2(x)) t_3(x)$

with $t_1(x)$ , $t_2(x)$ , $t_3(x)$ incorporating means, variances, and bivariate probabilities. Such surrogates enable exploitation of cross-target statistical dependencies, but at increased computational cost.

3. Handling Dynamics, Nonstationarity, and Robustness

Spatiotemporal Modeling

Dynamic optimization tasks require models that can capture nonstationary objectives, i.e., $f(x, t)$ , with time $t$ explicitly encoded as a GP input. By using separable spatiotemporal kernels:

$K((x, t), (x', t')) = K_s(x, x') \cdot K_t(t, t')$

the induced time length-scale informs the planning window for feasible query times (Nyikosa et al., 2018). Acquisition optimization is then restricted to regions where predictions remain informative within the learned temporal horizon.

Extensions for Robustness

Robust BO incorporates worst-case optimization, for instance through min-max objectives:

$\min_{\theta \in \Theta} \max_{\zeta \in Z} f(\theta, \zeta)$

where $\zeta$ models environmental uncertainty. Acquisition functions such as Entropy Search and Knowledge Gradient are redefined to maximize information about the min-max optimum rather than just the overall maximum (Weichert et al., 2021). These modifications directly integrate worst-case and safety considerations into the learning loop.

4. Surrogate-Enriched and Hybrid Bayesian Optimization

Bioprocess engineering and similar domains benefit from hybrid surrogates: rather than using a zero-mean GP, one sets the prior mean $m(\mathbf{x})$ to a mechanistic/empirical model output, with the GP modeling only the residuals:

$y(\mathbf{x}) = m(\mathbf{x}) + r(\mathbf{x}), \quad r(\mathbf{x}) \sim \mathcal{GP}(0, k(\mathbf{x},\mathbf{x}'))$

This approach enables incorporation of domain knowledge, better uncertainty modeling under data scarcity, and improved extrapolative capabilities (Siska et al., 14 Aug 2025).

5. Transfer Learning and Multi-Task Extensions

Repeated BO runs across similar but not identical tasks call for transfer learning strategies. The ranking-weighted Gaussian process ensemble (RGPE) (Feurer et al., 2018) aggregates surrogates from historical tasks and the current task into a convex combination:

$\bar{f}(x|\mathcal{D}) = \sum_{i=1}^t w_i f^i(x|\mathcal{D}^i)$

Weights $w_i$ are determined by bootstrap-estimated ranking loss, ensuring that only high-performing, relevant models contribute. The weight dilution mechanism ensures asymptotic convergence to the standard BO solution even in adversarial (transfer-misleading) settings.

Transfer-driven BO reduces sample complexity in hyperparameter tuning and other sequential experimentation contexts by efficiently exploiting accumulated cross-task experience.

6. Cost-Aware, Pathwise, and Constrained Sequencing

Many applications exhibit non-uniform costs for function evaluations and for changing input parameters between sequential experiments. Extensions in this direction include:

Pathwise Bayesian Optimization / SnAKe (Folch et al., 2022, Folch et al., 2023): Models sequential input changes as a “path”, optimizing both the acquisition function and the cumulative cost, often formalized as a Traveling Salesman Problem over candidate queries.

$\min_{\pi} \sum_{t=1}^{T-1} C(\pi(t), \pi(t+1))$

where $C(\cdot, \cdot)$ denotes the cost of transitioning between input settings, and $\pi$ is a permutation of query points.
Pandora’s Box Gittins Index for Cost-Aware Acquisition (Xie et al., 28 Jun 2024): Relates the selection of the next query to the Pandora's Box problem, with the Pandora’s Box Gittins Index (PBGI) acquisition defined by the threshold $g$ satisfying

$\text{EI}_{f|y_{1:T}}(x; g) = \lambda c(x)$

where $c(x)$ is the cost at location $x$ and $\lambda$ is a cost-sensitivity tunable parameter.
Transition-Constrained BO via MDPs (Folch et al., 13 Feb 2024): Models the feasible next-query set $\mathcal{C}(x_h)$ as a transition constraint, with the BO-acquisition function replaced by a policy optimization over an MDP. The policy is optimized in the space of state–action visitation distributions—enabling global, history-aware search strategies that respect the non-arbitrary sequencing limitations of real experiments.

7. Scalability and Algorithmic Efficiency

Bayesian optimization’s complexity, particularly for GP-based surrogates, scales cubically in the number of observations. Extensions have focused on kernel and model approximations such as:

Vecchia Approximation (Jimenez et al., 2022): Surrogate GP likelihoods are approximated by conditioning each observation on a small subset of neighbors (size $m \ll n$ ), inducing sparse Cholesky factors and reducing computational cost to $\mathcal{O}(nm^3)$ .
Warped GPs and Mini-Batch Training: Warped kernels and mini-batch SGD are employed to further scale model training, with careful neighbor ordering and approximate search ensuring local prediction quality even in high dimensions.

These methods embed efficiently into scalable BO frameworks (e.g., Trust Region Bayesian Optimization, TuRBO), with empirical evidence for improved regret in high-dimensional and noisy settings.

Table: Major Classes of Extensions and Representative Approaches

Extension Type	Representative Technique / Formula	Reference arXiv id
Constrained/MO	Extended domination, EHVI, SMC	(Feliot et al., 2015)
Correlated Outputs	Bivariate GP, correlated ECI formula	(Li et al., 30 May 2025)
Dynamic/Spatiotemporal	Separable $\mathcal{GP}$ kernels: $K_s(x,x') K_t(t,t')$	(Nyikosa et al., 2018)
Pathwise/Cost-aware	Pathwise SnAKe, PBGI, MDP policy acquisition	(Folch et al., 2022, Xie et al., 28 Jun 2024, Folch et al., 13 Feb 2024)
Transfer Learning	RGPE ensemble, ranking loss weighting	(Feurer et al., 2018)
Surrogate Scalability	Vecchia GP, mini-batch SGD, input warping	(Jimenez et al., 2022)

These techniques are applicable across a range of fields (engineering design, bioprocess optimization, machine learning hyperparameter search, robotics), with the selection of extension(s) driven by experimental requirements, cost structure, data complexity, and physical constraints.

Conclusion

Extensions to classical Bayesian optimization have made the methodology robust and broadly applicable in domains facing constraints, multi-objectivity, nonstationarity, dynamic costs, and high-dimensionality. Key innovations include unifying objectives/constraints via domination rules, leveraging surrogate dependence structures, transferring knowledge across tasks, explicitly modeling costs and transition constraints, and integrating scalable surrogate models. These advances not only improve performance under practical restrictions but also provide new computational paradigms (e.g., policy space optimization in MDPs, pathwise planning, ensemble surrogates) that continue to broaden the scope of Bayesian optimization methodologies in real-world applications.