Multi-Task Bayesian Optimization

Updated 18 March 2026

Multi-Task Bayesian Optimization is a framework that extends traditional Bayesian optimization by jointly optimizing multiple correlated objectives using multi-task Gaussian process surrogates and coregionalization kernels.
It leverages scalable inference techniques like Kronecker methods and Matheron’s identity to achieve rapid, sample-efficient convergence even in high-dimensional and multi-output settings.
MTBO incorporates innovative acquisition functions and transfer learning strategies to ensure robustness, safety, and adaptability in diverse, complex applications.

Multi-Task Bayesian Optimization (MTBO) extends standard Bayesian optimization (BO) to the setting where multiple correlated objectives (tasks or outputs) must be optimized jointly, leveraging inter-task dependencies to accelerate sample-efficient optimization. MTBO encompasses approaches for high-dimensional vector-valued objectives, simultaneous algorithm and hyperparameter selection, multi-fidelity and meta-learning settings, and safety-critical applications. Modern MTBO methodology is built around multi-task Gaussian process (MTGP) surrogates, Kronecker and coregionalization kernels, and scalable acquisition and sampling strategies that exploit the structure among tasks.

1. Core Principles and Multi-Task Surrogate Models

The foundational model for MTBO is the multi-task Gaussian process (MTGP) with intrinsic coregionalization, which enables joint modeling of several outcomes across a shared set of input designs. Consider $N$ input locations $X=\{x_1,...,x_N\}$ and $T$ correlated tasks/outputs. Observations are collected into an $N\times T$ matrix $Y$ , vectorized as $\mathrm{vec}(Y)\in\mathbb{R}^{NT}$ .

The standard prior used is: $\mathrm{vec}(Y) \sim \mathcal{N}(0, K_{XX} \otimes K_t)$ where $K_{XX}$ is the input covariance matrix, $K_t$ encodes task similarities (coregionalization matrix), and $\otimes$ is the Kronecker product (Maddox et al., 2021).

Covariances across outcomes are structured by the Intrinsic Coregionalization Model (ICM): $k([x,i],[x',j]) = k_x(x,x') \cdot k_t(i,j)$ allowing transfer of information between tasks: improving the surrogate on one task refines predictions for others according to $K_t$ (Maddox et al., 2021, Manzoni et al., 9 Dec 2025).

Extensions include Linear Model of Coregionalization (LMC), multi-output kernels for meta-learning (Papenmeier et al., 29 Jan 2026), and latent shared embedding strategies to unify heterogeneous hyperparameter spaces (Ishikawa et al., 13 Feb 2025).

2. Scalable Inference and Sampling Schemes

A major challenge arises as $T$ or $N$ increases: the naive $O((NT)^3)$ complexity of GP inference becomes untenable. To address this, efficient computational techniques exploit structured covariance:

Kronecker methods: The joint covariance $K = K_{XX} \otimes K_t$ is exploited for eigendecomposition; posterior means and variances, as well as posterior samples, are computed at $O(N^3 + T^3 + NT(N+T))$ cost via Kronecker and eigendecomposition algebra (Maddox et al., 2021).
Matheron’s identity: Used for conditional simulation of posterior samples, facilitating MC-based acquisition strategies even for tens of thousands of outputs (Maddox et al., 2021).
Manual grouping/dimensionality reduction: For high-dimensional parameter spaces, domain knowledge can be used to cluster parameters into low-dimensional, correlated sub-tasks, drastically reducing kernel inversion costs (Alabed et al., 2021).

The table below summarizes complexities for different approaches:

Method	Complexity	Notes
Naive Cholesky on $NT \times NT$	$O((NT)^3)$	Only feasible for small $N,T$
Kronecker + Matheron	$O(N^3 + T^3 + NT(N+T))$	Additive, scalable (Maddox et al., 2021)
Grouped Low- $d$ Tasks	$O(T n_t^3 + T^3)$	Each small task GP, ARD kernel (Alabed et al., 2021)

These methods enable MTBO on problems with up to $65\,000$ correlated outputs, as in optical interferometer tuning, and non-trivial multi-UAV task scenarios (Maddox et al., 2021, Manzoni et al., 9 Dec 2025).

3. Acquisition Functions and Optimization Strategies

Acquisition functions in MTBO must balance information gain or expected improvement across multiple outputs:

Multi-task Expected Improvement (EI): For candidate $x$ , samples are drawn from the MTGP posterior. Multi-task utility functions (e.g., improvement across all outputs, hypervolume) are computed and averaged to yield the acquisition. The Kronecker+Matheron approach enables rapid MC acquisition evaluation even in high dimensions (Maddox et al., 2021, Manzoni et al., 9 Dec 2025).
Random Scalarization/UCB: For multi-objective setups, random scalarizations (e.g., linear or Chebyshev) are sampled, and scalarized UCB is maximized. Regret bounds directly reflect task similarity structure (Chowdhury et al., 2020).
Min-Regret/Entropy Search: Information-theoretic approaches (entropy search or minimum-regret search) can be directly extended to vector-valued or contextual settings by using GPs over joint parameter-task domains (Metzen, 2016).
Safe Acquisitions: Safety constraints are respected via robust upper confidence bounds or constrained EI computed with inflated multi-task GP confidence intervals that incorporate task correlation uncertainty (Lübsen et al., 2023, Luebsen et al., 11 Mar 2025).

Notably, the robust error bounds for safety-critical MTBO require Bayesian or frequentist quantification and uniformity over confidence sets for hyperparameters like the task covariance matrix $K_t$ (Lübsen et al., 2023, Luebsen et al., 11 Mar 2025).

4. Knowledge Transfer, Meta- and Lifelong Learning

Recent research has advanced meta-learning and transfer in MTBO:

Sequential/Meta-MTBO: When tasks arrive sequentially, posterior or latent structure is estimated over previous tasks and propagated to new tasks (Zhang et al., 2019, Zhang et al., 2024). Lifelong Bayesian Optimization uses an Indian Buffet Process–driven latent factor model, which adaptively clusters tasks for selective sharing, yielding improvement in both sample-efficiency and robustness to task drift (Zhang et al., 2019).
Shared Latent Embedding: Different algorithm/hyperparameter spaces are mapped to a shared latent space (MLP embedding, with adversarial-domain overlap), enabling a single MTGP surrogate and facilitating efficient algorithm/hyperparameter selection and transfer (Ishikawa et al., 13 Feb 2025).
Meta-GP Priors: SMOG constructs a multi-output, multi-task GP where target-task surrogates are explicitly augmented with meta-task posteriors. Linear scaling in the number of meta-tasks is achieved, with principled propagation of task-wise epistemic uncertainty into target predictions (Papenmeier et al., 29 Jan 2026).

Large-scale MTBO can be achieved using pretrained LLMs to propose high-quality initializations for new tasks. Iterative fine-tuning of the LLM on past BO-optimal trajectories creates a positive feedback loop, yielding rapid convergence and scalability across 1,400+ tasks (Zeng et al., 11 Mar 2025).

5. Negative Transfer, Safety, and Robustness

A central concern in MTBO is preventing negative transfer—degradation due to transfer from poorly related tasks. Bayesian Competitive Knowledge Transfer (BCKT) formulates a competition between within-task and transferred solutions, with a Bayesian update on cross-task transferabilities; negative transfer is naturally suppressed as transferability posteriors approach zero (Lu et al., 27 Oct 2025). Similarly, advances in robust safety require uniform error bounds that remain valid over all plausible inter-task correlation parameters, ensuring that safety constraints are respected at all times (Lübsen et al., 2023, Luebsen et al., 11 Mar 2025).

Safety-centric methods employ multi-task GPs with unknown or uncertain task-correlation matrices, using posterior hyperpriors (e.g., LKJ) and inflated UCBs to construct safe sets. Strategies such as scheduling real and surrogate task evaluations are shown to achieve significantly faster convergence (up to 3× reduction in expensive real evaluations) compared to single-task baselines (Lübsen et al., 2023, Luebsen et al., 11 Mar 2025).

6. Empirical Highlights and Practical Guidelines

MTBO achieves substantial empirical gains across domains:

High-dimensional and Many-task Regimes: MTBO with Kronecker+Matheron enables constrained optimization with 50–100 outputs in seconds, and BO over 65,000-D outputs with 2× faster convergence (Maddox et al., 2021).
Hyperparameter Tuning and CASH: Shared-latent MTBO with PTEM ranking consistently outperforms baselines in low-data, low-evaluation regimes, demonstrating acceleration even in mixed discrete/continuous hyperparameter domains (Ishikawa et al., 13 Feb 2025).
Sample Efficiency: In pulmonary nodule classification, MTBO reduces required evaluations by half compared to single-task BO (Chi et al., 2024).
Control and Robotics: Hierarchical multi-task surrogates exploiting dynamics-cost structure yield 30–50% faster sample-efficient learning in sequential control tasks (Hirt et al., 18 Aug 2025). Multi-UAV trajectory tuning shows MTBO achieves competitive or superior performance with lower computational cost, especially as swarm complexity grows (Manzoni et al., 9 Dec 2025).
Scalability: LBO and SMOG provide linear scaling in numbers of tasks or objectives by leveraging modular priors and parallelizable meta-GP fitting (Zhang et al., 2019, Papenmeier et al., 29 Jan 2026).

Practical implementation highlights include the requirement of block-design for exact Kronecker methods, recommended small latent dimensions ( $L=2{-}4$ ), robust eigensolvers with double precision, and proper hyperparameter initialization to avoid negative transfer or autokrigeability collapse (Maddox et al., 2021, Ishikawa et al., 13 Feb 2025). Approximations (e.g., sparse GPs, Nyström) may be necessary for large $T$ or $N$ (Chowdhury et al., 2020, Ishikawa et al., 13 Feb 2025).

7. Limitations, Open Questions, and Future Directions

Key limitations include:

Block-design requirements in Kronecker methods—missing data patterns complicate scalable inference (Maddox et al., 2021).
Sensitivity to task similarities—ICM and LMC models may be insufficient when task relations are nonlinear or highly varying; negative transfer is a persistent risk (Maddox et al., 2021, Lu et al., 27 Oct 2025).
Computational scaling to massive $T,N$ —while Kronecker, grouping, and meta-GP structures relieve cubic scaling, further advances in sparse, approximate inference are required for very large datasets (Chowdhury et al., 2020, Papenmeier et al., 29 Jan 2026).
Robustness to concept drift and nonstationarity in lifelong settings remains limited, although automated latent clustering offers some adaptivity (Zhang et al., 2019).
Formal regret analysis for modern, deeply parameterized or LLM-initialized MTBO is still nascent, with most theory available for classical kernel-based models (Chowdhury et al., 2020).

Open research directions target nonstationary or evolving task relationships, integration of batch/parallel acquisitions, and unified transfer-acquisition strategies that provably balance present-task and future-task sample efficiency (Zhang et al., 2019, Zhang et al., 2024, Papenmeier et al., 29 Jan 2026).

In summary, Multi-Task Bayesian Optimization leverages multi-output surrogates, efficient Kronecker-structured inference, and transfer/metamodeling strategies to accelerate optimization across correlated objectives or tasks, enabling scalable, sample-efficient learning in domains ranging from scientific simulators and automated ML to robotics and safe control (Maddox et al., 2021, Ishikawa et al., 13 Feb 2025, Manzoni et al., 9 Dec 2025, Papenmeier et al., 29 Jan 2026, Luebsen et al., 11 Mar 2025).