Knowledge Gradient Acquisition Functions

Updated 27 December 2025

Knowledge Gradient acquisition functions are defined as computing the expected one-step improvement in the posterior estimate of the optimum, prioritizing both exploration and exploitation.
They employ computational strategies like closed-form, Monte Carlo, and one-shot hybrid methods to handle discrete, continuous, and high-dimensional domains.
The approach extends to constrained, multi-objective, and preferential settings, achieving robust sampling efficiency and near-optimal performance across various applications.

The knowledge gradient (KG) family of acquisition functions plays a central role in sequential and batch optimization of expensive black-box objectives, offering a decision-theoretic paradigm for information-directed sampling in Bayesian optimization, ranking & selection, and related optimal learning domains. Rooted in one-step Bayesian value-of-information analysis, the KG provides a principled mechanism to select the next experiment by maximizing the expected improvement—measured directly in terms of the estimated optimum—after conditioning on prospective new data. Its generality admits rigorous extension to batch, noisy, constrained, multi-objective, high-dimensional, and bandit settings.

1. Definition and Underlying Principle

The core knowledge gradient acquisition computes the expected one-step improvement in the posterior estimate of the global optimum, were a candidate experiment (or batch) to be performed. In its canonical form for black-box minimization under a Gaussian process (GP) surrogate, the sequential (q=1) KG is: $\mathrm{KG}(x) = \E_n \left[ \min_{x'\in\Theta} \mu^{(n)}(x') - \min_{x'\in\Theta} \mu^{(n+1)}(x') \mid \text{sample at } x \right]$ where $\mu^{(n)}$ denotes the GP predictive mean at stage $n$ and $\Theta$ is the experimental design domain. For maximization problems, $\min \rightarrow \max$ .

Batch settings generalize this to a $q$ -vector of points $z^{(1:q)}$ : $\mathrm{KG}(z^{(1:q)}) = \E_n\left[ \min_{x\in\Theta} \mu^{(n)}(x) - \min_{x\in\Theta} \mu^{(n+q)}(x) \,\big|\, z^{(1:q)} \right]$ This expected decrement encapsulates both exploitation (local improvement) and exploration (uncertainty reduction affecting future decisions). In discrete settings (e.g., ranking & selection), this reduces to the expected improvement in the posterior best among arms.

For noisy, constrained, multi-objective, or preference-based feedback, KG is adapted by incorporating additional terms or reparameterizing the underlying value function (Wang et al., 2016, Wu et al., 2016, Buckingham et al., 2023, Chen et al., 2021, Astudillo et al., 2023).

2. Mathematical Formulations and Computational Approaches

The computation of KG acquisition values is nontrivial due to the nested maximization/minimization under a future posterior, which itself depends on random observations. Key practical approaches include:

Finite discrete domains: For a finite set $\mathcal{X}$ of alternatives, closed-form KG expressions (involving Gaussian CDF/PDF) can be derived, yielding efficient $O(M\log M)$ algorithms for $M=|\mathcal{X}|$ (Wang et al., 2016).
Continuous or high-dimensional domains: For continuous inputs, KG is typically approximated using:
- Grid discretization (epigraph methods)
- Monte Carlo Sampling (nested optimization per sample)
- Hybrid approaches (analytic outer, MC inner)
- One-Shot joint optimization and automatic differentiation for scalable, low-variance estimation (Ungredda et al., 2022).
Batch/Parallel KG: The parallel KG (qKG) leverages joint GP update formulas to model the posterior mean after an entire batch, and employs Monte Carlo with Infinitesimal Perturbation Analysis (IPA) to efficiently estimate both value and gradients for batch construction (Wu et al., 2016).
Gradient-based optimization: Unbiased stochastic gradients for KG acquisitions are systematically derived via IPA (for the improvement component) and, where constraints are present, by likelihood-ratio (LR) methods for feasibility terms (Chen et al., 2021).

A summary of computational trade-offs is presented below:

KG Variant	Main Computational Strategy	Applicability
Closed-form (discrete)	Analytic, grid enumeration	Finite/low-dimensional domains
Monte Carlo	Sampling, inner optimizations	Higher dimensions, general domains
One-Shot Hybrid	Joint variable optimization, autodiff	Moderate-high D (BOTorch, PyTorch support)
Batch/IPAGradient	Monte Carlo + IPA for gradients	Batch/parallel GP Bayesian optimization

3. Generalizations: Constraints, Multi-Objectives, Preferences, and Structure

Knowledge gradient adapts flexibly to complex Bayesian optimization settings:

Constrained Bayesian Optimization (c-KG): c-KG combines the exponential of expected optimality gain (future minimum objective mean) with a product of posterior feasibility probabilities, directly steering batch selection toward feasible, high-value regions. Its stochastic-gradient implementation leverages both IPA and LR techniques, and empirically demonstrates superior efficiency over unconstrained KG on noisy-constrained testbeds (Chen et al., 2021).
Multi-objective Optimization: Scalarization-based KG incorporates random linear scalarizations and cost-weighting for decoupled objectives (cost-aware MOKG). The acquisition is defined as the expected improvement in the best predicted scalarized mean across the objectives, dividing by the cost per objective to balance information gain and resource use. Analytical epigraph calculations are used on discretized domains (Buckingham et al., 2023).
Preference/Pairwise Feedback (Preferential BO): In the preferential setting, qEUBO (expected utility of the best option) is shown to be equivalent to batch knowledge gradient under noiseless feedback, and retains a one-step Bayes optimality guarantee. Its regret rate $o(1/n)$ notably improves over qEI for minimizing Bayesian simple regret (Astudillo et al., 2023).
Sparsity and Structure: Extensions to high-dimensional sparse additive models (KGSpLin/KGSpAM) integrate Bayesian value-of-information calculus with group Lasso for structure recovery, yielding policies that learn both relevant features and functional form with tight finite-sample bounds (Li et al., 2015).

4. Theoretical Guarantees and Optimality

The knowledge gradient methodology reflects the one-step Bayes-optimal rule for maximizing the expected value of information, with key theoretical results including:

Finite-time bounds and submodularity: Under adaptive submodularity of the value of information (VOI) objective, KG achieves at least a constant $(1-1/e)\approx0.632$ fraction of the best achievable VOI in finite budget, via greedy policy analysis (Wang et al., 2016). For the two-arm case, submodularity holds exactly; for larger or correlated arms, sufficient conditions and numerical checks are available.
Consistency and convergence: For GP-based KG in differentiable domains, if the acquisition is nonnegative and appropriately maximized each iteration, the sequence of sampled points is dense and the corresponding recommended optimum converges to the global maximizer almost surely and in mean (Ungredda et al., 2022, Buckingham et al., 2023).
Asymptotic optimality in ranking & selection: Standard KG is not always asymptotically optimal in fixed-budget best-arm identification; improved variants (iKG) that target expected gain in probability of correct selection (PCS) achieve the exponential rate prescribed by optimal allocation rules (OCBA systems) (Le et al., 2023).
Bayesian simple regret decay: In the preference BO setting, KG-based qEUBO achieves simple regret $o(1/n)$ , whereas qEI-type acquisitions may stall with strictly positive regret (Astudillo et al., 2023).

5. Empirical Performance and Application Regimes

Comprehensive empirical studies corroborate that knowledge gradient policies deliver:

Near-equivalent or superior sample efficiency to Expected Improvement (EI) and UCB on unimodal and multi-modal synthetic functions and challenging real-world engineering testbeds, especially in higher dimensions and under noise (Herten et al., 2016, Wu et al., 2016, Ungredda et al., 2022).
Robustness to model misspecification and improved hyperparameter calibration when combined with fully Bayesian (slice sampling) surrogates for kernel parameters (Herten et al., 2016).
Effective handling of parallel/batch evaluations without pathological oversampling or excessive exploitation, outperforming "constant liar" and GP-BUCB heuristics (Wu et al., 2016).
Practical computational overhead is manageable across a range of KG variants by leveraging epigraph analytic techniques, joint optimizations, and modern automatic differentiation frameworks (PyTorch/BOTorch) (Ungredda et al., 2022).

A selection of reported benchmark results appears below:

Application	Acquisition	Comparative Result
2D Eggholder	KGCP, EI, UCB	KGCP converges 2–3x faster than EI
10D Truss Dynamics	KGCP (SS), EI	KGCP-SS continues improving when EI stalls
Hyperparam tuning (q=4)	qKG, pEI, GP-BUCB	qKG finds better minima more rapidly
High-dim sparse BO	KGSpLin, KGLin	KGSpLin achieves lower OC, better support recovery

6. Implementation and Practical Considerations

Implemented variants of the KG acquisition function and its extensions are available in open-source platforms (e.g., BOTorch), supporting both sequential and batch (parallel) modes with automatic differentiation and efficient analytic routines for the epigraph computation (Ungredda et al., 2022). For moderate dimensions ( $D\leq10$ ), one-shot hybrid KG with a small discretization size ( $n_z\sim5$ –$10$) achieves near-optimal performance with minimal computation per acquisition step.

Tuning choice of acquisition variant, discretization granularity, and batch size depends on problem dimensionality, observation noise, and computational budget.

7. Extensions, Limitations, and Future Directions

Active research continues in expanding the KG framework to:

Robust constrained or multi-fidelity Bayesian optimization;
Further generalizations to non-Gaussian and heteroskedastic observation models;
Preference learning with complex user models or composite feedback;
Adaptive tuning of exploration versus exploitation via dynamic submodularity checks;
Theoretical analysis of regret and large deviations under broader classes of utility.

Important limitations include curse-of-dimensionality in analytic KG approaches (motivation for hybrid/one-shot strategies), possible myopic behavior in pathological priors (necessitating improved designs like iKG), and computational scaling challenges in very large batch or ultra-high-dimensional settings.

References:

"A New Knowledge Gradient-based Method for Constrained Bayesian Optimization" (Chen et al., 2021)
"Fast Calculation of the Knowledge Gradient for Optimization of Deterministic Engineering Simulations" (Herten et al., 2016)
"The Parallel Knowledge Gradient Method for Batch Bayesian Optimization" (Wu et al., 2016)
"Efficient computation of the Knowledge Gradient for Bayesian Optimization" (Ungredda et al., 2022)
"Finite-time Analysis for the Knowledge-Gradient Policy" (Wang et al., 2016)
"Knowledge Gradient for Multi-Objective Bayesian Optimization with Decoupled Evaluations" (Buckingham et al., 2023)
"qEUBO: A Decision-Theoretic Acquisition Function for Preferential Bayesian Optimization" (Astudillo et al., 2023)
"The Knowledge Gradient Policy Using A Sparse Additive Belief Model" (Li et al., 2015)
"Improving the Knowledge Gradient Algorithm" (Le et al., 2023)
"Optimal Learning for Sequential Decision Making for Expensive Cost Functions with Stochastic Binary Feedbacks" (Wang et al., 2017)