A Tutorial on Bayesian Optimization (1807.02811v1)

Published 8 Jul 2018 in stat.ML, cs.LG, and math.OC

Abstract: Bayesian optimization is an approach to optimizing objective functions that take a long time (minutes or hours) to evaluate. It is best-suited for optimization over continuous domains of less than 20 dimensions, and tolerates stochastic noise in function evaluations. It builds a surrogate for the objective and quantifies the uncertainty in that surrogate using a Bayesian machine learning technique, Gaussian process regression, and then uses an acquisition function defined from this surrogate to decide where to sample. In this tutorial, we describe how Bayesian optimization works, including Gaussian process regression and three common acquisition functions: expected improvement, entropy search, and knowledge gradient. We then discuss more advanced techniques, including running multiple function evaluations in parallel, multi-fidelity and multi-information source optimization, expensive-to-evaluate constraints, random environmental conditions, multi-task Bayesian optimization, and the inclusion of derivative information. We conclude with a discussion of Bayesian optimization software and future research directions in the field. Within our tutorial material we provide a generalization of expected improvement to noisy evaluations, beyond the noise-free setting where it is more commonly applied. This generalization is justified by a formal decision-theoretic argument, standing in contrast to previous ad hoc modifications.

PDF Abstract

A Tutorial on Bayesian Optimization by Peter I. Frazier

The paper "A Tutorial on Bayesian Optimization" by Peter I. Frazier offers a comprehensive overview of Bayesian Optimization (BayesOpt), a machine learning-based optimization technique particularly useful for optimizing expensive-to-evaluate functions in continuous domains with dimensionality less than 20. The paper provides a detailed explanation of key concepts, methodologies, and practical considerations, complemented by insightful discussions on advanced topics and emerging research directions in the field.

Bayesian Optimization Overview

Bayesian Optimization is designed to solve optimization problems of the form:

$\max_{x \in A} f(x)$

where the objective function $f(x)$ is continuous, lacks known structure, and is expensive to evaluate. Typically, $x$ resides in a low-dimensional space ( $d \leq 20$ ), and evaluating $f(x)$ might take minutes or hours. BayesOpt is highly suitable for "black-box" derivative-free global optimization and is notable for its versatility, making it applicable in various domains like engineering design, materials science, drug discovery, environmental model calibration, and hyperparameter tuning in machine learning, particularly for deep neural networks.

Key Components

Bayesian Optimization employs two main components:

Statistical Model: Commonly, Gaussian Process (GP) regression is used to model the objective function $f(x)$ .
Acquisition Function: This function determines the next point to evaluate by balancing exploration and exploitation. Notable acquisition functions include Expected Improvement (EI), Knowledge Gradient (KG), and Entropy Search (ES).

Gaussian Process (GP) Regression

GP regression is a Bayesian method for modeling functions using a mean function $\mu_0$ and a kernel $\Sigma_0$ . The GP provides a posterior distribution over $f(x)$ after observing data, incorporating both the mean and uncertainty at each point. This feature is crucial for BayesOpt, allowing it to predict not just the function value but also the uncertainty in unexplored regions.

Acquisition Functions

Expected Improvement (EI) quantifies the expected gain from evaluating a candidate point and is most effective for noise-free evaluations. EI is defined as:

$EI_n(x) := \mathbb{E}_n\left[ [f(x) - f^*_n]^+ \right]$

where $f^*_n$ is the best observed function value. The function balances between high expected values and high uncertainty, driving efficient exploration.

Knowledge Gradient (KG) optimizes for the maximum posterior mean value, considered more versatile than EI, especially in noisy settings or with complex constraints. KG is defined as:

$KG_n(x) := \mathbb{E}_n\left[ \mu^*_{n+1} - \mu^*_n \mid x_{n+1} = x \right]$

Entropy Search (ES) and Predictive Entropy Search (PES) focus on information gain about the location of the global optimum. They aim to reduce the uncertainty (entropy) about the optimum's position. PES improves computational tractability over ES by reformulating the entropy reduction in terms of mutual information.

Advanced Bayesian Optimization Techniques

The paper explores several extended or "exotic" Bayesian Optimization problems, addressing practical challenges and solutions:

Noisy Evaluations: Extending GP regression to handle noise and adapting acquisition functions like EI and KG appropriately for noisy settings.
Parallel Evaluations: Strategies for optimizing acquisition functions with multiple simultaneous evaluations, crucial for leveraging modern computational resources.
Constraints: Handling constraints on the feasible set with methods like expected improvement in feasible regions.
Multi-Fidelity and Multi-Information Source Optimization: Efficiently utilizing various sources of information with different accuracies and costs.
Random Environmental Conditions and Multi-task Optimization: Handling objectives that are integrals or sums over random environmental conditions.
Derivative Observations: Incorporating gradient information to enhance the GP model and improve optimization efficiency.

Implications and Future Directions

The implications of Bayesian Optimization are profound, extending its applicability to numerous scientific and engineering disciplines. Its ability to efficiently optimize expensive black-box functions without derivative information makes it indispensable for high-stakes applications like algorithm hyperparameter tuning, engineering system design, materials discovery, and more.

Future research directions include:

Developing multi-step optimal strategies to improve sequential decision-making processes.
Exploring alternative statistical models beyond Gaussian Processes to better capture the characteristics of specific problems.
Enhancing scalability to tackle high-dimensional optimization problems by identifying and exploiting problem-specific structures.
Leveraging exotic problem structures for methodological advancements and real-world applications.

In summary, this paper serves as a critical resource for researchers and practitioners in the field of Bayesian Optimization, presenting foundational concepts, practical tools, and insights into future research avenues. The depth and breadth of coverage make it a valuable reference for anyone looking to understand or leverage Bayesian Optimization in their work.