Batch Bayesian Optimization

Updated 19 November 2025

Batch Bayesian Optimization is a framework that concurrently selects and evaluates batches of points to optimize expensive, noisy black-box functions.
It employs Bayesian surrogates like Gaussian processes and diverse acquisition functions (e.g., ATS, OEI, LP) to balance exploration and diversity.
Practical methods include dynamic batch sizing, diversity-promoting heuristics, and rigorous regret analysis to ensure scalable, efficient performance.

Batch Bayesian Optimization (batch BO) is a collection of algorithmic strategies for efficient global optimization of expensive black-box functions, where multiple evaluations can be performed concurrently. The central objective is to select sets ("batches") of points to evaluate in parallel, optimizing both sample efficiency (regret) and total wall-clock time. Batch BO is essential in scientific, engineering, and machine learning settings with parallel experiment or compute resources.

1. Problem Formulation and Core Challenges

The canonical batch BO problem is

$\min_{x \in \mathcal{X} \subset \mathbb{R}^d} f(x)$

where $f$ is an expensive, possibly noisy, black-box objective. Evaluation at $x$ returns $y = f(x) + \varepsilon$ ; data $\mathcal{D}_n = \{(x_i, y_i)\}_{i=1}^n$ accumulates over $n$ function calls.

A Bayesian surrogate, usually a Gaussian process (GP) with prior $f \sim \mathrm{GP}(m(x),k(x,x'))$ and hyperparameters $\theta$ , is fit to data. The GP posterior provides mean $\mu_n(x | \mathcal{D}_n,\theta)$ and variance $\sigma_n^2(x | \mathcal{D}_n,\theta)$ at any candidate $x$ .

Single-point ("sequential") BO selects $x_{n+1} = \arg\max_x \alpha(x;\theta | \mathcal{D}_n)$ for an acquisition function $\alpha$ , but this precludes parallelism. Batch BO generalizes this to selecting $M > 1$ points $\{x_{n+1}, ..., x_{n+M}\}$ to evaluate in parallel before updating the model.

Key challenges unique to the batch setting arise:

Information staleness: Later points in a batch are chosen without knowledge of outcomes of earlier points
Diversity vs. exploitation: Batch points must balance informativeness (to reduce uncertainty) and exploration of diverse regions, to avoid redundant sampling
Scalability: High-dimensional domains and large batch sizes render many joint-acquisition criteria computationally intractable

A rich literature has developed techniques to address these challenges via acquisition function design, batch construction heuristics, sampling-based approximations, and regret-theoretic analysis.

2. Batch Acquisition Functions and Constructive Paradigms

Several categories of batch acquisition strategies exist. Each aims to select batches offering high expected gain given the current GP surrogate while maintaining tractable computation.

2.1 Marginalized Acquisition and Acquisition Thompson Sampling

Acquisition Thompson Sampling (ATS) exploits uncertainty in model hyperparameters to construct diverse batches (Palma et al., 2019). Instead of jointly optimizing a high-dimensional batch acquisition, ATS samples $M$ independent instantiations of the acquisition function—each marginalized over a fresh sample of GP hyperparameters $\theta_{k,1},..., \theta_{k,s}$ from $p(\theta|\mathcal{D}_n)$ . For batch $k$ , a stochastic acquisition $\bar{\alpha}_k(x)$ is computed as

$\bar{\alpha}_k(x) = \frac{1}{s} \sum_{q=1}^s \alpha(x; \theta_{k,q} | \mathcal{D}_n)$

and $x_{n+k} = \arg\max_{x} \bar{\alpha}_k(x)$ . This approach generalizes any sequential acquisition (EI, LCB, TS, UCB) to a batch mode with minimal modification, and all points are selected in parallel (Palma et al., 2019).

ATS can be trivially scaled to large batch size, requires no model hallucination or penalization machinery, and empirically demonstrates strong performance on both synthetic benchmarks and realistic hyperparameter optimization tasks (e.g., XGBoost with $M=20$ parallel workers).

2.2 Moment-Based and Distributionally Ambiguous Batch Acquisitions

Traditional batch Expected Improvement (EI) computes

$\alpha_{\mathrm{EI}}(X) = \mathbb{E}_{y \sim \mathcal N(\mu, \Sigma)} \left[ \max(y_{\text{best}} - \min_i y_i, 0) \right]$

for a batch $X = \{x_1, ..., x_B\}$ , but this requires high-dimensional integrals. The Optimistic Expected Improvement (OEI) (Rontsis et al., 2017) replaces the joint Gaussian assumption with a distributionally ambiguous set based on mean and covariance constraints, deriving a tractable semidefinite program (SDP) optimization that lower-bounds the classic batch EI. OEI scales to $B \geq 20$ with modest wall-clock time and gives robust, differentiation-friendly acquisition landscapes, outperforming alternatives for larger batches.

2.3 Parallel Knowledge Gradient

The parallel knowledge gradient (q-KG) (Wu et al., 2016) extends the one-step lookahead value-of-information approach to batches. It selects $\{x_1, ..., x_q\}$ to maximize the expected improvement in the surrogate mean after a batch is observed: $q\text{-}KG(x_1, ..., x_q) = \mathbb{E}_n[\max_x \mu_{n+q}(x)] - \max_x \mu_n(x)$ where $\mu_{n+q}$ is the updated mean after hypothetical observations. Practical and efficient computation proceeds via Monte Carlo with Infinitesimal Perturbation Analysis for gradient estimation. q-KG demonstrates superior batch performance especially under observation noise (Wu et al., 2016).

2.4 Diversity and Penalization-Based Heuristics

Greedy heuristics iteratively add points, modifying $\alpha(x)$ to prevent clustering. Local Penalization (LP) (González et al., 2015) penalizes the acquisition at distance $r_j$ from prior selections, using Lipschitz constants to approximate exclusion zones. LP builds a batch by maximizing a product of acquisition and local penalizers, requiring only a single GP retraining per batch and no joint-batch optimization. LP matches or exceeds more involved batch methods in wall-clock speed and regret across low to moderate dimensions (González et al., 2015).

3. Batch Construction and Diversity Mechanisms

Many batch BO methods employ explicit diversity-promoting mechanisms to mitigate batch redundancy and over-exploitation.

3.1 Penalization, Hallucination, and Liar-Based Approaches

In LP (González et al., 2015), local penalizers down-weight the acquisition near already-selected points, based on GP-inferred gradients. Kriging-Believer and Constant-Liar methods (Mia et al., 4 Apr 2025) sequentially simulate pseudo-observations for earlier batch points (either setting $y = \mu(x)$ or some fixed value), refitting the GP before each selection, thus inducing diversity.

3.2 Determinantal Point Processes

Combinatorial settings and high-dimensional additive models leverage determinantal point processes (DPPs) (Wang et al., 2017, Oh et al., 2021). Given a kernel $L$ , the probability of selecting a batch $X$ is proportional to $\det(L_X)$ . DPPs encourage spatially diverse batches. In combinatorial BO tasks (e.g., permutations), the Acquisition Weighted kernel (Oh et al., 2021) $L(x,x') = w(\alpha(x)) K(x,x') w(\alpha(x'))$ fuses acquisition value and diversity, enabling near-optimal batch selection with guarantees (Oh et al., 2021).

3.3 Particle Gradient Flows and Measure Optimization

Recent approaches recast batch acquisition maximization as optimization in the space of probability measures. By interpreting multipoint EI as a concave functional of the empirical measure and performing Wasserstein-gradient flow ascent in measure space, a particle-based algorithm is obtained that jointly encourages high-utility and diverse samples (Crovini et al., 2022). Asymptotic convexity on the measure space sidesteps combinatorial intractability for large batches, reliably yielding diverse proposals.

4. Batch Size, Dynamic and Adaptive Schemes

While basic batch BO presumes a fixed batch size, several works address optimal batch size determination and dynamic adjustment.

Dynamic Batch BO (Azimi et al., 2011, Azimi et al., 2012) assesses the mutual independence of acquisition candidates under the current GP. Points are only added to the batch if their selection would persist even after simulating the outcomes of others. This yields adaptive batch sizes with near-sequential regret and substantial wall-clock speedup (up to 78%) (Azimi et al., 2012).
The Budgeted Batch Bayesian Optimization (B3O) (Nguyen et al., 2017) fits an infinite Gaussian mixture model to the acquisition surface, interpreting the number of mixture components as the appropriate batch size per step, directly tuning parallel resource consumption to acquisition multimodality.
Minimal Terminal Variance (MTV) (Ren et al., 27 Apr 2024) formulates batch acquisition based on experimental design—selecting batches that minimize expected posterior variance at likely-optimal locations, as inferred from the surrogate.

This flexibility in batch size enables BO to navigate the trade-off between computational workloads, evaluation costs, and statistical efficiency.

5. High-Dimensional, Less Expensive, and Nonstandard Domains

Batch BO faces additional statistical and computational challenges in high-dimensional, cheap-to-evaluate, or non-Gaussian settings.

KMBBO (Groves et al., 2018) employs batch slice sampling followed by K-Means clustering on the acquisition surface to identify peaks efficiently. For $d \gg 10$ , compressed-sensing projections are used to perform BO in a lower-dimensional subspace, with empirical success on benchmarks up to 167 dimensions.
Structural kernel methods (Wang et al., 2017) assume latent additive decompositions, inferring structure via Gibbs sampling and constructing batches within low-dimensional subspaces using DPPs.
UCB-DE (Nguyen et al., 2018) targets the low-cost-function regime where acquisition function optimization dominates total time, splitting each batch into one UCB point and (q–1) maximally-distant points from observed data (Sobol-guided exploration). This approach is 3–6× faster than standard batch BO when function evaluations are cheap.
For combinatorial and permutation spaces, determinantal point process–based and acquisition–weighted batch methods offer sublinear regret and empirical superiority (Oh et al., 2021).
Density-ratio estimation frameworks (Oliveira et al., 2022) (BORE++) decouple batch BO from explicit Gaussian-process priors, enabling scalable classifier-based BO with parallelized batch queries and regret guarantees, particularly advantageous in high data or high-dimensional regimes.

6. Regret Analysis and Empirical Performance

Mature batch BO methods are now accompanied by frequentist and Bayesian regret guarantees, quantifying their efficiency relative to the sequential optimal policy.

Improved batch UCB (IGP-BUCB) and batch Thompson sampling (GP-BTS) (Chowdhury et al., 2019) offer frequentist cumulative regret bounds that degrade only mildly (polylogarithmically) with batch size, provided initial rounds perform sufficient pure exploration.
TS-RSR (Ren et al., 7 Mar 2024) introduces a parameter-free, regret-to-uncertainty ratio acquisition for batch selection, attaining optimal $\sqrt{T \gamma_T}$ -style regret rates (where $\gamma_T$ is maximum information gain). It penalizes batch redundancy via conditional variances, outperforming standard batch UCB, qEI, and other recent strategies in both synthetic and real tasks.
Empirical studies across numerous benchmarks (Ackley, Hartmann, Branin, SVR tuning, PPO hyperparameter optimization) consistently show that state-of-the-art batch BO methods (e.g., ATS, OEI, LP, TS-RSR, q-KG) achieve competitive or superior regret for moderate batch sizes ( $M=5$ –$20$) with near-ideal wall-clock speedups (Palma et al., 2019, Rontsis et al., 2017, Ren et al., 7 Mar 2024, González et al., 2015).

7. Practical Considerations and Integration

Most modern batch BO methods are modular and can be integrated with standard GP toolkits. Hyperparameters (e.g., exploration weights, batch size, Lipschitz constant estimation, acquisition transformation) influence the statistical efficiency–speedup trade-off. In high-dimensional or combinatorial spaces, methods based on slice sampling, random subspaces (Zhan et al., 25 Nov 2024), compressed sensing, or DPP-based combinatorial selection are most tractable.

Dynamic or pipelined batch BO (e.g., PipeBO for multi-stage experiments (Taguchi et al., 5 Dec 2024)) is especially effective for resource-constrained or asynchronous experimental settings, reducing wall-clock optimization time by a factor roughly equal to the number of pipeline stages.

In summary, batch Bayesian optimization provides a spectrum of tractable, theoretically justified, and empirically robust algorithms for parallel global optimization of expensive black-box functions, with techniques tailored to the specifics of batch size, problem dimension, landscape structure, scalability, and resource constraints. Contemporary research continues to refine regret analyses, extend applicability to non-Gaussian and non-Euclidean domains, and address parallelization in both hardware and experiment-limited environments (Palma et al., 2019, Chowdhury et al., 2019, Ren et al., 7 Mar 2024, González et al., 2015, Nguyen et al., 2017).