Ensemble Transfer Learning Bayesian Optimisation

Updated 26 January 2026

The paper introduces a method that integrates multiple Gaussian Process surrogates from target and source tasks to accelerate convergence on expensive black-box objectives.
It employs adaptive weight computation via regularised regression and ranking-based schemes to selectively leverage relevant historical data while mitigating negative transfer.
Warm-start initialisation combined with non-negative weight constraints effectively reduces early simple regret and the total evaluation budget.

Ensemble-based transfer learning Bayesian optimisation (BO) is an approach within sample-efficient global optimisation for expensive black-box objectives, designed to leverage heterogeneous historical datasets from related tasks to accelerate convergence on a new “target” task. Rather than merging all previous data into a monolithic model, ensemble-based transfer learning fits a separate surrogate—typically a Gaussian Process (GP)—to each source task and the target. These models are combined into an adaptive ensemble whose composition evolves as observations on the target accrue, allowing efficient transfer while controlling for negative transfer.

1. Formalisation and Ensemble Surrogate Construction

In classical BO, a single GP surrogate is iteratively fit to the target’s observations to model the objective and guide evaluation via an acquisition function. In ensemble-based transfer learning BO, assume access to $N$ historic source task datasets %%%%1%%%%. Each dataset provides a GP surrogate $\mu_i(x), \sigma_i^2(x)$ . Simultaneously, a GP is fit to the accumulating data from the target task.

The ensemble surrogate prediction at any candidate $x$ is formed as follows: $f_{\mathrm{ens}}(x) \sim \mathcal{N}\left( \sum_{i=1}^N w_i \mu_i(x) + w_{\mathrm{target}} \mu_{\mathrm{target}}(x),\; \sigma_{\mathrm{target}}^2(x) \right)$ Here, $w_i$ are non-negative weights assigned to each source and target model (typically $w_{\mathrm{target}}$ is included so that $\sum w_i = 1$ for interpretability), adaptively re-computed as new target observations arrive, allowing the ensemble to “borrow strength” from relevant source task surrogates while retaining the flexibility to diminish irrelevant ones (Trinkle et al., 22 Jan 2026, Feurer et al., 2018, Bai et al., 2023).

2. Weight Computation Schemes and Regularisation

Several weight computation schemes have been developed, primarily falling into regression-based and ranking-based families.

Regularised Regression with Non-Negativity ("RiGPE+" and "LaGPE+"): Weights are obtained by solving a regularised linear regression over the current target data, subject to $w_i\geq0$ :

$\min_{w \geq 0} \frac{1}{M_{\mathrm{target}}} \sum_{j=1}^{M_{\mathrm{target}}} \Big( y_{\mathrm{target}}(x_j) - \sum_{i=1}^{N+1} w_i \mu_i(x_j) \Big)^2 + \alpha \sum_{i=1}^{N+1} w_i^2$

The parameter $\alpha$ is the ridge regularisation (L2) or lasso (L1) penalty, pre-learned via leave-one-task-out cross-validation on source data. The use of non-negative weights guards against negative transfer and empirical results show improved performance compared to unconstrained or negative-weighted alternatives (Trinkle et al., 22 Jan 2026).

Ranking-based Methods (RGPE, TSTR): RGPE (Ranking-Weighted Gaussian Process Ensemble) derives $w_i$ as the empirical frequency (across $S$ bootstrapped draws) with which surrogate $i$ yields the lowest pairwise ranking loss on target data (Feurer et al., 2018):

$w_i = \frac{1}{S} \sum_{s=1}^S \frac{1\{i \in \arg\min_j \ell_{j,s}\}}{\#\text{domain ties}}$

The TSTR approach uses a kernel between order-features of source and target predictions to inform $w_i$ (Bai et al., 2023).

Other Schemes (WAC, Product of Experts): Unconstrained gradient-based or precision-weighted mixtures (as in Product-of-GP-Experts) are also employed, but often do not match the empirical performance of the above families.

Implementation involves frequent re-fitting of weights as the target dataset grows, typically leveraging bootstrapping for stability and avoiding overfitting to early or noisy target data (Trinkle et al., 22 Jan 2026, Feurer et al., 2018).

3. Detection and Mitigation of Negative Transfer

Negative transfer is explicitly addressed via fallback mechanisms:

Expert Dropout (for ranking-based schemes): Each source surrogate $i$ is dropped from the ensemble with probability

$p_{\mathrm{drop}_i} = 1 - \left(1-\frac{t}{T}\right) \Big[ \frac{1}{S} \sum_{s=1}^S \mathbb{I}( \mathcal{L}_{i,s} < \mathcal{L}_{\mathrm{target},s} ) \Big]$

dropping sources whose performance falls behind the target, particularly as BO progresses.

Alternating-Mode Switching (for regression-based schemes): The method alternates between standard BO (target GP only) and transfer mode (ensemble) based on a “mode switch” indicator using cross-validated mean squared error differences as a guard against persistently high error rates from transfer (Trinkle et al., 22 Jan 2026).

Empirical evidence indicates these mechanisms do not degrade performance but offer no consistent improvement over the base positive-weighted ensemble transfer strategies under broad conditions (Trinkle et al., 22 Jan 2026).

4. Warm-Start Initialisation Using Source Data

Warm-start strategies can substantially accelerate BO. Instead of random sampling for the initial $n_{\text{init}}$ evaluations, warm-start selects $n_{\text{init\_ws}}$ candidate points directly from historic $\cup X_i$ that are predicted by the current ensemble to be promising for the target.

Empirical analysis with $n_{\text{init\_ws}}=2$ shows that this approach consistently yields lower early simple regret compared to random $n_{\text{init}}=10$ initialisation, particularly on the first 30 iterations (Trinkle et al., 22 Jan 2026). This result is robust across mixed search spaces (continuous, integer, categorical) and multiple real-world and surrogate benchmarks.

5. Empirical Benchmarks and Findings

Evaluation across nine benchmarks with up to 15 seeds and $T=100$ iterations shows:

Warm-start initialisation (2 points) outperforms random initialisation (10 points) in 8/9 cases, especially in early BO.
Enforcing $w_i \geq 0$ in the ensemble consistently yields superior performance compared to unconstrained or negative-weighted schemes.
Ranking-based ensemble weights (RGPE, TSTR) excel in low-dimensional or discretised domains, while regularised regression (RiGPE+, LaGPE+) dominates in higher-dimensional or mixed-variable settings.
Neither expert-dropping nor alternating-mode switching confers a consistent advantage over standard positive-weighted transfer ensembles.
WAC (unregularised, unconstrained ensemble) ranks worst in comparative evaluations.
Practical performance gains consist of reduced early simple regret and reduced function evaluation budgets to reach target accuracy compared to both vanilla BO and competing transfer learning strategies (Trinkle et al., 22 Jan 2026, Feurer et al., 2018, Bai et al., 2023, Tighineanu et al., 2021).

New “real-time” benchmarks introduced for transfer-BO include OpenML-CC18 RandomForest (mixed-variable HPO), LassoBench (CalCOFI regression), and Cartpole simulation (continuous control), spanning diverse variable types and domains (Trinkle et al., 22 Jan 2026).

6. Algorithmic Workflow

A typical ensemble-based transfer learning BO pipeline involves:

Constructing an ensemble of GPs over all source tasks.
Warm-start initialisation by evaluating ensemble-predicted promising candidates.
At each step:
- Updating the target GP with accumulated target data.
- Recomputing ensemble weights using one of the described strategies (regularised regression, ranking-loss frequency).
- Optionally applying bad transfer handling (expert dropping/mode switch).
- Computing the ensemble mean and variance for acquisition function optimisation (e.g., Lower Confidence Bound).
- Acquiring and evaluating the next point and updating the target dataset.

This workflow is modular, allowing for interchangeability of weighting schemes, initialisation strategies, and transfer fallback rules (Trinkle et al., 22 Jan 2026, Feurer et al., 2018).

7. Practical Guidelines and Theoretical Limits

Best practices established in current literature include:

Inclusion of warm-start initialisation is strongly recommended.
Imposing non-negativity on ensemble weights is generally crucial for robust performance and for controlling the risk of negative transfer.
Pre-learn the regularisation parameter $\alpha$ via cross-validation on source data to avoid pathological early optimisation states.
Employ ranking-weighted ensembles in low-dimensional/discrete settings; prefer positive-weighted regularised regression in higher dimensions or when downweighting many irrelevant sources is required.
Ensemble-based transfer BO cannot degrade asymptotic convergence beyond a calculable constant factor compared to vanilla BO, guaranteeing “safe transfer” properties (Feurer et al., 2018).
For new BO applications with mixed variables, the default pipeline is warm-start + positive-ridge weights + a robust acquisition function (e.g., LCB), with optional ablation of other components as necessary.

Empirical findings reinforce that these two components—warm-start and positive-weight constraints—yield the most significant and reliable gains in practical settings with diverse variables and challenging benchmarks. (Trinkle et al., 22 Jan 2026, Bai et al., 2023, Tighineanu et al., 2021, Feurer et al., 2018)

Markdown Upgrade to Chat

References (4)

An Empirical Study on Ensemble-Based Transfer Learning Bayesian Optimisation with Mixed Variable Types (2026)

Practical Transfer Learning for Bayesian Optimization (2018)

Transfer Learning for Bayesian Optimization: A Survey (2023)

Transfer Learning with Gaussian Processes for Bayesian Optimization (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ensemble-Based Transfer Learning Bayesian Optimisation.