Real-Time Transfer Learning BO Benchmarks

Updated 26 January 2026

The article surveys benchmark frameworks that use historical task data to improve sample efficiency and accelerate optimization in real-time, costly evaluation settings.
It details pipeline components such as nonnegative surrogate weighting, ranking-based bootstrap ensembles, and affine transformations to reduce negative transfer under strict compute budgets.
It compares performance metrics like regret, success rate, and wall-clock cost across diverse domains—from robotics to hyperparameter tuning—to guide future methodological improvements.

Real-time transfer learning Bayesian optimisation (RT-TL BO) benchmarks constitute a critical framework for evaluating methods that exploit historical data from related tasks to accelerate sample-efficient black-box function optimisation under strict wall-clock constraints. These benchmarks formalise both the algorithmic and empirical protocols by which transfer strategies and pipeline components (e.g., surrogate model weighting, warm-start initialisation, cross-task kernel design) are quantitatively compared in "real-time" regimes—i.e., where evaluation costs, compute budgets, or deployment schedules require immediate assimilation and adaptation of new data. This article surveys benchmark definitions, state-of-the-art pipelines, representative testbeds, and the comparative experimental landscape for RT-TL BO, with explicit reference to current arXiv research.

1. Benchmark Objectives and Design Principles

RT-TL BO benchmarks are distinguished by several structural properties. First, they target optimisation tasks where each function evaluation is costly or time-constrained, typical in robotics, simulation, hardware tuning, or real-world hyperparameter search. Second, the transfer learning protocol leverages historic datasets from related but non-identical tasks—often exhibiting nontrivial distributional shift, overlapping but not identical variable spaces, or incremental context changes. Third, the benchmark framework specifies data ingestion (source/target splits, warm-start selection), surrogate model architectures (ensemble, kernel, or meta-learned), acquisition strategies, and evaluation schedules under fixed budget (e.g., 40–100 evaluations). For comprehensive assessment, benchmarks report both sample efficiency (regret, success rate, error metrics) and computational overhead per iteration (Feurer et al., 2018, Trinkle et al., 22 Jan 2026, Hellan et al., 2023).

2. Representative Real-Time Transfer Benchmarks

Recent literature provides a spectrum of canonical RT-TL BO testbeds, each parameterised by domain, variable types, and historic/task structure:

Benchmark	Domain	Search Space	Transfer Tasks	Evaluation Cost
randomforest	Binary classification (OpenML-CC18)	10D mixed	38 datasets	1–5 s
lassobench	Regression (Fisheries)	10D continuous	59 locations	0.1 s
cartpole	Simulated LQR control	2D continuous	50 simulations	0.01 s
OTHPO MNIST	XGBoost, data drift	4D mixed	28 contexts	~0.1 s
BBOB RF	Black-box BBOB suite	2–10D continuous	24 functions	<10 s
Robotics TL	Kamido vision/grasp tuning	9D continuous	11–13 objects	2 h/40 trials

RT-TL BO benchmarks typically initialise with "warm-start" points from historic runs, assimilate new target observations rapidly, and evaluate both transfer and single-task methods under identical conditions (Trinkle et al., 22 Jan 2026, Hellan et al., 2023, Pan et al., 23 Jan 2025, Petit et al., 2018).

3. Pipeline Components: Surrogate Models, Weighting, and Warm-Start

A recurring architecture in RT-TL BO is the ensemble surrogate model. For N historic source tasks and one target, the pipeline constructs an ensemble of N+1 surrogates (typically Gaussian processes with Matérn or RBF kernels; also random forests), each providing predictive mean and uncertainty (Feurer et al., 2018, Trinkle et al., 22 Jan 2026, Tighineanu et al., 2021, Pan et al., 23 Jan 2025):

Weighting strategies: A key pipeline component is the assignment of nonnegative, regularised regression weights to each subnet, optimised for predictive accuracy on the growing target set. Both ridge (RiGPE) and lasso (LaGPE) losses constrained to w_i≥0 are empirically superior to unconstrained variants, effectively mitigating negative transfer (Trinkle et al., 22 Jan 2026).
Ranking-based and bootstrap ensembles: Ranking-loss bootstrap approaches (RGPE) assign weights by bootstrapped minimisation of pairwise ranking error over target data (Feurer et al., 2018). In high-dimensional settings, regression-based weights outperform RGPE; in low-dimensional or categorical domains, RGPE or two-stage transfer surrogates (TSTR) remain competitive (Trinkle et al., 22 Jan 2026).
Warm-start initialisation: Warm-start selection algorithms extract the best points from source surrogates, minimising the mean or ranking loss over source evaluations, leading to improved performance in early iterations (Trinkle et al., 22 Jan 2026, Hellan et al., 2023, Petit et al., 2018).

4. Transfer Kernels, Meta-Learning, and Domain Adaptation

RT-TL BO benchmarks evaluate not only ensemble pipelines but also advanced kernel and meta-learning approaches:

Hierarchical GP and transfer kernels: Hierarchical kernels parameterise cross-task similarity via separable or coregionalisation kernels, with computational complexity scaling as O((N_s+N_t)³⁾ for full multi-task GP and O(N_t³ + N_t^2N_s + N_tN_s²⁾ for scalable SHGP/BHGP models (Tighineanu et al., 2021).
Quantile-based Copula processes: Gaussian Copula models pool historic cross-task evaluations by mapping responses to normalised quantiles, enabling scale-independent transfer and multi-objective scalarisation (Salinas et al., 2019).
Meta-learned acquisition functions: RL-driven meta-learning schemes train neural acquisition policies on source distributions to instantiate custom acquisition functions for new target tasks, adapting approach to task structure and falling back to general AFs if required (Volpp et al., 2019).
Affine domain transformations: For non-differentiable surrogates (e.g., random forests), adaptation via affine input transformations (learned CMA-ES over rotations and translations) yields rapid (<10 s) surrogate adjustment and efficient transfer to new domains, as demonstrated in BBOB and industrial/testbed benchmarks (Pan et al., 23 Jan 2025).

5. Experimental Protocols and Sample Efficiency Metrics

Empirical analyses standardise the following protocols:

Leave-one-out transfer: For each target, use all remaining historic tasks as sources; compare against standard BO run on target data alone.
Fixed budget and wall-clock accounting: Evaluate up to 100 function calls per target, or as dictated by operational constraints. Record both sample efficiency (regret, error, success rate) and per-iteration wall-clock cost (Feurer et al., 2018, Trinkle et al., 22 Jan 2026, Pan et al., 23 Jan 2025).
Performance Metrics: Use simple regret, normalised regret, success rate over trials (robotics), SMAPE (regression), Pareto hypervolume (multi-objective), and functional gap to optimum.
Switching mechanisms: Monitor "bad TL" via cross-validation residuals; switch into single-task mode if transfer component degrades performance (empirically, such switching provides little systematic gain) (Trinkle et al., 22 Jan 2026).

Table: Sample efficiency results (selected from recent benchmarks)

Method	Best ADTM (%)	Early Iter Success	Wall-time/iter (s)
RGPE-TAF	0.26–0.65	1.5–2× over GP	1.6
RiGPE(POS)	0.1–0.4	yes	1–2
COPULA-GCP	2–3× faster	yes (multi-task)	<10
SHGP/BHGP	~1 s	scalable	∼1
MetaBO	0.003–0.01	yes (structured)	0.02

6. Domain-Specific Benchmarks and Ordered Transfer Protocols

Certain RT-TL BO benchmarks explicitly target domain drift or sequential context evolution:

Ordered Transfer HPO (OTHPO): Tasks are indexed by sequential context (training set size, simulation parameter), with a decaying context kernel modeling task similarity. Warm-start from previous best settings yields substantial early gains in realistic sequential HPO scenarios (e.g., MNIST-XGBoost training size drift, YAHPO Gym scenarios) (Hellan et al., 2023).
Robotics and industrial testbeds: Real robot grasp tuning yields >88% success in 2 h vs manual expert, with transfer-based BO converging 20–30 trials earlier than cold-start (Petit et al., 2018). BBOB transfer adaptation for Random Forests yields 30–80% SMAPE reduction with 50d transfer samples in seconds per adaptation (Pan et al., 23 Jan 2025).

7. Comparative Analysis, Limitations, and Future Directions

RT-TL BO benchmarks consistently demonstrate that transfer learning—via ensemble surrogates, warm-start, context-aware kernels or meta-learned acquisitions—delivers robust improvements in sample efficiency and wall-clock time. Positive-constrained regression weights and warm-start initialisation are near-universally beneficial, particularly in high-dimensional or continuous domains (Trinkle et al., 22 Jan 2026). RGPE ensembles offer theoretical guarantees: performance is never significantly worse than single-task BO (worst-case constant-factor slowdown) (Feurer et al., 2018). Affine transformation and Copula-based models extend transfer to random forests and multi-objective tasks (Pan et al., 23 Jan 2025, Salinas et al., 2019).

Empirical limitations arise in low-overlap source–target regimes, or where transfer is deleterious and adaptation components become non-informative. Meta-learning and hierarchical kernel approaches scale transfer across many tasks, with computational cost amortised in offline phase or through scalable per-iteration updates (Tighineanu et al., 2021, Volpp et al., 2019). Benchmarks are increasingly available as reproducible software packages (e.g., Syne Tune OTHPO benchmarks), facilitating rigorous cross-method comparison (Hellan et al., 2023).

In summary, RT-TL BO benchmarks operationalise the evaluation of transfer learning for real-time, sample-efficient optimisation of black-box functions, providing both methodological and empirical touchstones that continue to drive the field's technical progress.

Markdown Upgrade to Chat

References (8)

Practical Transfer Learning for Bayesian Optimization (2018)

An Empirical Study on Ensemble-Based Transfer Learning Bayesian Optimisation with Mixed Variable Types (2026)

Obeying the Order: Introducing Ordered Transfer Hyperparameter Optimisation (2023)

Transfer Learning of Surrogate Models via Domain Affine Transformation Across Synthetic and Real-World Benchmarks (2025)

Developmental Bayesian Optimization of Black-Box with Visual Similarity-Based Transfer Learning (2018)

Transfer Learning with Gaussian Processes for Bayesian Optimization (2021)

A Quantile-based Approach for Hyperparameter Transfer Learning (2019)

Meta-Learning Acquisition Functions for Transfer Learning in Bayesian Optimization (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Real-Time Transfer Learning Bayesian Optimisation Benchmarks.