Bayesian Optimization Review

Updated 15 November 2025

Bayesian Optimization is a probabilistic strategy for optimizing costly, black-box functions by leveraging surrogate models such as Gaussian processes.
It employs acquisition functions like Expected Improvement, Probability of Improvement, and Upper Confidence Bound to balance exploration and exploitation.
Recent advances extend BO to high-dimensional, cost-aware, multi-objective, and preference-based applications, enabling scalable, efficient optimization.

Bayesian Optimization (BO) is an advanced probabilistic approach for global optimization of black-box functions that are expensive or time-consuming to evaluate. It is grounded in the design of sequential decision procedures that leverage surrogate models—most commonly Gaussian processes (GPs)—to balance exploration and exploitation, thereby reducing the number of costly evaluations required to identify an optimum. Recent developments in BO encompass extensions to multi-objective, structured, cost-sensitive, high-dimensional, and preference-based settings, as well as novel integration of expert knowledge and alternative surrogates. This article provides an in-depth technical review of BO, its foundational principles, state-of-the-art methodological advances, and practical considerations in both academic and industrial applications.

1. Mathematical Formulation and Principle Surrogate Models

The canonical BO problem seeks

$x^* = \arg\min_{x\in\mathcal{X}} f(x)$

where $f: \mathcal{X}\rightarrow\mathbb{R}$ is an expensive, non-differentiable black-box objective. Evaluations yield $y_i = f(x_i) + \varepsilon_i$ with noise $\varepsilon_i \sim \mathcal{N}(0,\sigma_n^2)$ .

BO places a Bayesian prior—typically a zero-mean GP—over the unknown $f(x)$ : $f(x) \sim \mathcal{GP}(0, k(x, x'))$ Common kernel choices include squared-exponential and Matérn, often with Automatic Relevance Determination (ARD). After $n$ observations $D_n = \{x_i, y_i\}_{i=1}^n$ , the GP posterior at $x$ yields: $\mu(x) = k(x,X)[K+\sigma_n^2 I]^{-1}y, \quad \sigma^2(x) = k(x,x) - k(x,X)[K+\sigma_n^2 I]^{-1}k(X,x)$ where $K$ is the observed kernel matrix.

Surrogate selection is critical: extensions include Bayesian neural networks (BNNs), random forests, and process-based models for combinatorial or categorical domains (Naveiro et al., 19 Jan 2024, Neiswanger et al., 2019).

2. Acquisition Functions and Trade-Offs

Acquisition functions guide sampling by quantifying information gain or improvement:

Expected Improvement (EI):

$\mathrm{EI}(x) = (\mu(x)-y_\text{best})\Phi(Z) + \sigma(x)\phi(Z),\quad Z=(y_\text{best}-\mu(x))/\sigma(x)$

Probability of Improvement (PI):

$\mathrm{PI}(x) = \Phi\bigl((y_\text{best}-\mu(x))/\sigma(x)\bigr)$

Upper Confidence Bound (UCB):

$\mathrm{UCB}(x) = \mu(x) + \beta\,\sigma(x)$

These functions are optimized each iteration to propose new candidates. Marginalization of GP hyperparameters via fully Bayesian approaches (e.g., MCMC; FBBO) can improve performance, particularly when paired with EI and ARD kernels (Ath et al., 2021).

Recent advances address cost-aware objectives (Lee et al., 2020), batch sampling, multi-objective scalarizations (Tran et al., 2020), and mutual-information/entropy for preference or binary data (Fauvel et al., 2021).

3. Algorithmic Workflow and Implementation Strategies

The core BO loop follows:

1. Initialize dataset D₀ with n₀ samples.
2. For t = n₀+1,…,N:
    a. Fit surrogate model (GP or alternative) to D_{t-1}.
    b. Select x_t = argmax_x α(x | D_{t-1}) using chosen acquisition.
    c. Evaluate y_t = f(x_t) + noise.
    d. Update D_t = D_{t-1} ∪ {(x_t, y_t)}.
3. Return x yielding lowest/optimal observed y.

Hyperparameters (e.g., GP kernel, acquisition type, batch size) are inferred from data or periodically re-optimized. Model fitting and acquisition optimization are computational bottlenecks, scaling as $O(n^3)$ (GP) and depending on the dimension $d$ .

Key implementation points include parallelization (especially for batch/fidelity extensions), constrained optimization (for feasible regions), surrogate replacement (for discrete, categorical spaces), and multi-fidelity augmentation (Paulson et al., 29 Jan 2024, Neiswanger et al., 2019).

4. Extensions: High-Dimensional, Cost-Aware, Multi-Objective, Preference-Based BO

High-Dimensional BO

Standard GP-BO degrades rapidly for $d>15$ due to the curse of dimensionality. Overcoming this employs:

Low-dimensional embeddings (PCA-BO, KPCA-BO, feature-mapped GP with joint decoder (Moriconi et al., 2019, Antonov et al., 2022)).
Trust-region local modeling (TuRBO) (Santoni et al., 2023).
Sparsity-exploring methods (SEBO with L₀ homotopy, multi-objective Pareto-driven frameworks) (Liu et al., 2022).
Dimension scheduling for parallel, subspace-based updates (Ulmasov et al., 2015).

Cost-aware BO

Accounting for variable evaluation costs (e.g., time, wall-clock budget, energy), CArBO integrates cost surrogates and decaying cost penalties into the acquisition (Lee et al., 2020). Preliminaries include cost-efficient initial design and batch cost-cool acquisition scaling.

Preference-based and Discrete Feedback

Preference-elicitation and ranking-based surrogates are increasingly deployed in settings where pairwise or ordinal feedback is more reliable or practical than direct measurements. Notable frameworks:

Siamese BNNs + Active Learning for expert integration (Huang et al., 2022).
Poisson Process BO (PoPBO) modeling ranks via nonhomogeneous Poisson process, with tailored acquisition functions (Rectified LCB, Expected Ranking Improvement), demonstrating superior noise robustness and scalability (Wang et al., 5 Feb 2024).
Mutual-information–based acquisition for binary/preferential data (Fauvel et al., 2021).

Multi-objective and Constrained Optimization

BO for multi-objective problems relies either on scalarizations (e.g., regularized Tchebycheff (Tran et al., 2020)) or builds independent surrogate GPs for each objective plus Pareto-diversity terms. Acquisition functions are extended to optimize hypervolume improvement and Pareto-frontier spread. Hidden and known constraints are modeled via separate feasibility surrogates or probabilistic classifiers.

5. Practical Surrogate and Acquisition Selection

Table: Surrogate Model & Acquisition Function Preference by Setting

Scenario	Surrogate Model	Acquisition Function
Standard continuous	GP, ARD kernel	EI/UCB/PI
High-dimensional, low rank	PCA-BO, KPCA-BO	EI in latent z-space
Discrete/categorical	Random forest, BNN	MC-based EI/UCB
Preference/ranking	Siamese BNN, Poisson	Mutual info, ERI, R-LCB
Cost-sensitive evaluation	Dual GP (cost/obj)	Cost-cooled EI/PU
Multi-objective/constrained	3x GP, MOGP	Composite, hypervolume

Selection depends on the problem structure, scalable compute resources, regularization, interpretability constraints (e.g., sparsity), and feedback modality.

6. Empirical Evaluation and Impact

Empirical results across simulated and real-world benchmarks consistently demonstrate:

Substantial reductions in wall-clock cost, evaluation count, or convergence time by employing cost-aware, high-dimensional, preference-based, and expert-augmented BO (Huang et al., 2022, Lee et al., 2020, Antonov et al., 2022, Wang et al., 5 Feb 2024).
Robustness to noise and model misspecification by ranking-based surrogates (PoPBO), classifier-based EI estimation (BORE), and preference-based active learning strategies (Wang et al., 5 Feb 2024, Tiao et al., 2021).
Accelerated Pareto-frontier discovery for multi-objective design applications via multi-GP and scalarized acquisition frameworks (Tran et al., 2020).
Scalability to hundreds of dimensions using surrogate replacements, embedding, and subspace optimization (Santoni et al., 2023, Ulmasov et al., 2015, Liu et al., 2022).
Transferable algorithmic principles to domains such as additive manufacturing, hyperparameter tuning, experimental design, and recommendation systems (Zhang et al., 2021, Lee et al., 2020, Liu et al., 2022).

7. Challenges, Limitations, and Ongoing Research Directions

Key technical challenges include:

Scalability in input dimension, sample count, and surrogate fidelity.
Surrogate selection and model misspecification, especially as non-Gaussian, nonstationary, or heteroscedastic effects arise.
Acquisition-optimization overhead for high-dimensional, non-convex, or mixed-variable spaces.
Active incorporation of expert knowledge or preference feedback without biasing the search, as studied in Siamese BNN architectures (Huang et al., 2022).
Unified frameworks capable of robustly handling multi-fidelity, multi-objective, constrained, and preference-based settings simultaneously.

Recent work emphasizes the principled integration of full Bayesian hyperparameter marginalization, flexible surrogate architectures, Pareto-/hypervolume-aware acquisitions, and efficient parallel/batch settings. There is growing interest in simulation-based, probabilistic programming-driven, and multi-modal local-optima identification (Neiswanger et al., 2019, Mei et al., 2022).

The field is progressing toward theoretically grounded, computationally tractable algorithms that retain sample efficiency and applicability to increasingly complex, noisy, and structurally varied optimization landscapes.