Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 114 tok/s
Gemini 3.0 Pro 53 tok/s Pro
Gemini 2.5 Flash 132 tok/s Pro
Kimi K2 176 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

DeltaBO: Efficient Transfer Bayesian Optimization

Updated 9 November 2025
  • DeltaBO is a Bayesian optimization algorithm that transfers historical source data by modeling the difference between target and source functions in distinct RKHSs.
  • It explicitly quantifies uncertainty using an additive model, leading to provably faster regret rates when the source data is abundant and the discrepancy is smooth.
  • The method employs an upper confidence bound rule and demonstrates superior performance in hyperparameter tuning and benchmark applications compared to conventional GP-UCB.

DeltaBO is a Bayesian optimization (BO) algorithm designed for accelerated search on a new (target) black-box function through the transfer of historical data from a related source task. Distinct from prior transfer-Bayesian optimization approaches, DeltaBO performs uncertainty quantification via an explicit modeling of the difference function between target and source tasks, allowing each to belong to different reproducing kernel Hilbert spaces (RKHSs). Under mild regularity assumptions, DeltaBO achieves provably faster regret rates than conventional GP-based BO, particularly when a large sample of source data is available and the source-target discrepancy is smooth or simple.

1. Problem Setting and Notation

Consider a compact input domain DRd\mathcal D \subset \mathbb R^d. The goal is to maximize an unknown target function f:DRf:\mathcal D \to \mathbb R, given access to NN historical observations from a source function g:DRg:\mathcal D \to \mathbb R and sequential noisy evaluations of ff.

Let the dataset of source evaluations be

S(0)={(xi(0),yi(0))}i=1N,yi(0)=g(xi(0))+εi(0),εi(0)N(0,σ02).\mathcal S^{(0)} = \left\{ \left(x_i^{(0)}, y_i^{(0)}\right) \right\}_{i=1}^N,\quad y_i^{(0)} = g(x_i^{(0)}) + \varepsilon_i^{(0)},\quad \varepsilon_i^{(0)} \sim \mathcal N(0, \sigma_0^2).

DeltaBO posits an additive model

f(x)=g(x)+δ(x),f(x) = g(x) + \delta(x),

where δ(x)=f(x)g(x)\delta(x) = f(x) - g(x) represents the difference (or "delta") function. Both gg and δ\delta are modeled as independent draws from zero-mean GPs: gGP(0,kg),δGP(0,kδ),g \sim \mathcal{GP}(0, k_g), \quad \delta \sim \mathcal{GP}(0, k_\delta), with positive semi-definite, uniformly bounded kernels kg(,)k_g(\cdot, \cdot), kδ(,)k_\delta(\cdot, \cdot). This implies gHgg \in \mathcal H_g, δHδ\delta \in \mathcal H_\delta (their respective RKHSs) with controlled norms. At each BO iteration tt, the target evaluation is observed as

yt=f(xt)+εt,εtN(0,σ2).y_t = f(x_t) + \varepsilon_t,\quad \varepsilon_t \sim \mathcal N(0, \sigma^2).

The mutual information gain from MM noisy observations of a GP with kernel kfk_f is defined as

γf,M=maxAD,A=MI(yA;fA).\gamma_{f, M} = \max_{A \subset \mathcal D, |A|=M} I(y_A; f_A).

2. Posterior Inference on Source and Difference Functions

DeltaBO leverages the access to source data and the additive model to efficiently decompose the BO task.

2.1 Posterior on Source Function gg

Given the NN source points, the GP regression posterior for gg is available in closed form: Kg,N=[kg(xi(0),xj(0))]i,j=1N,kg,N(x)=[kg(xi(0),x)]i=1N,K_{g,N} = [k_g(x_i^{(0)}, x_j^{(0)})]_{i,j=1}^N, \quad k_{g,N}(x) = [k_g(x_i^{(0)}, x)]_{i=1}^N,

μg,N(x)=kg,N(x)(Kg,N+σ02IN)1y(0),\mu_{g,N}(x) = k_{g,N}(x)^\top \left( K_{g,N} + \sigma_0^2 I_N \right)^{-1} y^{(0)},

σg,N2(x)=kg(x,x)kg,N(x)(Kg,N+σ02IN)1kg,N(x).\sigma^2_{g,N}(x) = k_g(x, x) - k_{g,N}(x)^\top \left( K_{g,N} + \sigma_0^2 I_N \right)^{-1} k_{g,N}(x).

2.2 Residual Observations and Posterior on δ\delta

At each BO iteration, upon evaluation yt=f(xt)+εt=g(xt)+δ(xt)+εty_t = f(x_t) + \varepsilon_t = g(x_t) + \delta(x_t) + \varepsilon_t, the mean prediction μg,N(xt)\mu_{g,N}(x_t) is subtracted, producing a residual

y~t=ytμg,N(xt)=δ(xt)+ηt,\tilde y_t = y_t - \mu_{g,N}(x_t) = \delta(x_t) + \eta_t,

where ηt\eta_t is zero-mean Gaussian noise, ηtN(0,σg,N2(xt)+σ2)\eta_t \sim \mathcal N(0, \sigma^2_{g,N}(x_t) + \sigma^2). The residuals serve as unbiased observations of δ(x)\delta(x) with their own variance structure.

Let (x1,y~1),...,(xt1,y~t1)(x_1, \tilde y_1), ..., (x_{t-1}, \tilde y_{t-1}) denote all previous residuals. Define: Kδ,t1=[kδ(xi,xj)]i,j=1t1,kδ,t1(x)=[kδ(xi,x)]i=1t1,K_{\delta, t-1} = [k_\delta(x_i, x_j)]_{i,j=1}^{t-1},\quad k_{\delta, t-1}(x) = [k_\delta(x_i, x)]_{i=1}^{t-1}, then, the GP posterior for δ\delta is: μδ,t1(x)=kδ,t1(x)(Kδ,t1+Σt1)1y~1:t1,\mu_{\delta, t-1}(x) = k_{\delta, t-1}(x)^\top \left( K_{\delta, t-1} + \Sigma_{t-1} \right)^{-1} \tilde y_{1:t-1},

σδ,t12(x)=kδ(x,x)kδ,t1(x)(Kδ,t1+Σt1)1kδ,t1(x),\sigma^2_{\delta, t-1}(x) = k_\delta(x, x) - k_{\delta, t-1}(x)^\top \left( K_{\delta, t-1} + \Sigma_{t-1} \right)^{-1} k_{\delta, t-1}(x),

with diagonal noise matrix Σt1=diag(σg,N2(x1)+σ2,...,σg,N2(xt1)+σ2)\Sigma_{t-1} = \operatorname{diag}(\sigma^2_{g,N}(x_1) + \sigma^2, ..., \sigma^2_{g,N}(x_{t-1}) + \sigma^2).

3. Acquisition Function and Algorithmic Structure

The posterior mean and variance for the target ff at round tt are: μt(x)=μg,N(x)+μδ,t1(x),σt2(x)=σg,N2(x)+σδ,t12(x).\mu_t(x) = \mu_{g,N}(x) + \mu_{\delta, t-1}(x), \quad \sigma_t^2(x) = \sigma^2_{g,N}(x) + \sigma^2_{\delta, t-1}(x).

DeltaBO employs an upper confidence bound (UCB) acquisition rule. At each of TT rounds, with fixed source posterior and updatable residual GP, the next query point is chosen as: xt=argmaxxD{μt(x)+βtσt(x)},x_t = \arg\max_{x \in \mathcal D} \left\{ \mu_t(x) + \sqrt{\beta_t}\sigma_t(x) \right\}, where, for confidence level 1ρ1-\rho and discrete D\mathcal D,

βt=2log(Dπ2t2/(6ρ)).\beta_t = 2 \log(|\mathcal D| \pi^2 t^2/(6\rho)).

DeltaBO Algorithm Pseudocode

Step Description
1 Compute source GP posterior (μg,N,σg,N2)(\mu_{g,N}, \sigma^2_{g,N}) from S(0)\mathcal S^{(0)}
2 Initialize δ\delta-GP mean μδ,0(x)=0\mu_{\delta,0}(x) = 0, variance σδ,02(x)=σg,N2(x)+σ2\sigma^2_{\delta,0}(x) = \sigma^2_{g,N}(x) + \sigma^2
3 For t=1,...,Tt = 1, ..., T:
3a Set βt\beta_t as above
3b Select xt=argmaxxD{μg,N(x)+μδ,t1(x)+βtσg,N2(x)+σδ,t12(x)}x_t = \arg\max_{x \in \mathcal D} \{ \mu_{g,N}(x) + \mu_{\delta, t-1}(x) + \sqrt{\beta_t}\sqrt{\sigma^2_{g,N}(x) + \sigma^2_{\delta, t-1}(x)} \}
3c Query yt=f(xt)+εty_t = f(x_t) + \varepsilon_t
3d Compute residual y~t=ytμg,N(xt)\tilde y_t = y_t - \mu_{g,N}(x_t)
3e Update δ\delta-GP with (xt,y~t)(x_t, \tilde y_t) and variance σg,N2(xt)+σ2\sigma^2_{g,N}(x_t) + \sigma^2
4 Return the best xx or sample uniformly from {x1,...,xT}\{x_1, ..., x_T\}

4. Regret Analysis and Theoretical Guarantees

The cumulative regret after TT rounds is RT=t=1T[f(x)f(xt)]R_T = \sum_{t=1}^T [f(x^*) - f(x_t)] for x=argmaxxDf(x)x^* = \arg\max_{x \in \mathcal D} f(x). The information gains γg,N\gamma_{g, N} and γδ,T\gamma_{\delta, T} reflect the GP information contraction from source and difference processes respectively; τ2=supxkδ(x,x)\tau^2 = \sup_x k_{\delta}(x,x) is the maximal variance in δ\delta.

4.1 Main Regret Bound

With high probability (1ρ\ge 1 - \rho), DeltaBO satisfies: RT8TβT(Tγg,Nσ02N2γg,N+C2γδ,T(2γg,Nσ02N2γg,N+σ2)),R_T \le \sqrt{8 T \beta_T \left( \frac{T \gamma_{g,N}\, \sigma_0^2}{N-2\gamma_{g,N} + C_2 \gamma_{\delta,T} \left( \frac{2\gamma_{g,N} \sigma_0^2}{N-2\gamma_{g,N}+\sigma^2} \right)} \right)}, where C2=(τ2/σ2)/log(1+τ2/σ2)1+τ2/σ2C_2 = (\tau^2/\sigma^2) / \log(1+\tau^2/\sigma^2) \le 1+\tau^2/\sigma^2.

4.2 Asymptotic and Comparative Results

If γg,N=o(N)\gamma_{g,N} = o(N), γδ,T=O(T)\gamma_{\delta,T}=O(T), τ2=O(σ2)\tau^2 = O(\sigma^2), and γg,N/N=O(γδ,T/T)\gamma_{g,N}/N = O(\gamma_{\delta,T}/T), the bound simplifies to

RT=O(TβTγδ,T)=O~(T(T/N+γδ,T)).R_T = O\Bigl(\sqrt{T \beta_T \gamma_{\delta,T}}\Bigr) = \widetilde O\left(\sqrt{T(T/N + \gamma_{\delta,T})}\right).

Standard GP-UCB regret scales as O(TβTγf,T)O(\sqrt{T \beta_T \gamma_{f,T}}). For NTN \gg T and γδ,Tγf,T\gamma_{\delta,T} \ll \gamma_{f,T} (i.e., δ\delta is “simpler”), DeltaBO provides provable acceleration over conventional BO.

4.3 Sketch of Proof Structure

  • A high-probability confidence argument bounds the deviation f(x)μg,N(x)μδ,t1(x)|f(x) - \mu_{g,N}(x) - \mu_{\delta, t-1}(x)| via βtσt(x)\sqrt{\beta_t} \sigma_t(x).
  • Instantaneous regret is upper bounded by 2βtσt(xt)2\sqrt{\beta_t} \sigma_t(x_t).
  • Summed variance contributions from σg,N2(xt)\sigma^2_{g,N}(x_t) and σδ,t12(xt)\sigma^2_{\delta, t-1}(x_t) are controlled by information gains γg,N\gamma_{g,N} and γδ,T\gamma_{\delta,T} via lemmas A.4, A.6–A.7.
  • The Cauchy–Schwarz inequality yields the total cumulative regret rate.

5. Practical Guidance and Experimental Findings

5.1 Kernel Selection

  • Source GP (gg): Typically a Matérn kernel is used when moderate smoothness is expected in gg.
  • Difference GP (δ\delta): Smoother kernels, such as squared-exponential (SE) or Matérn with long length scale, model δ\delta as a simple, low-complexity function. Small amplitude τ2\tau^2 for kδk_\delta further reduces γδ,T\gamma_{\delta,T}.
  • Noise variances: σ0,σ\sigma_0, \sigma should be set from replicate noise estimates.

5.2 Choice of βt\beta_t

  • In continuous domains, a discretization argument is required, increasing βt\beta_t logarithmically with discretization size.
  • Empirically, a constant βt\beta_t (tuned via cross-validation) suffices in many settings.

5.3 Empirical Applications

  • Hyperparameter Tuning (AutoML): Examined on UCI Breast-Cancer classification with Gradient-Boosting (11 hyperparameters) and MLP (8 hyperparameters). With N90N \approx 90, T=30T=30, and using Matérn (for gg, ff) and SE (for δ\delta), DeltaBO achieves lower cumulative regret than GP-UCB, GP-EI/PI/TS, Env-GP, and Diff-GP.
  • Synthetic Benchmarks: On shifted Gaussians (SE kernels), Bohachevsky functions (120×120120 \times 120 grid), and a ground-truth additive construction, DeltaBO demonstrates rapid regret decay with increasing NN. Competing baselines do not fully leverage large NN or require the same kernel for gg and δ\delta.

5.4 Recommendations

  • Collect a large source sample (NTN \gg T), as theoretical regret improves with NN.
  • Model δ\delta with a smooth kernel and low amplitude to minimize γδ\gamma_{\delta}.
  • Apply conservative tuning for βt\beta_t to maintain valid confidence intervals without resorting to over-exploration.

6. Implications and Context Within Transfer Bayesian Optimization

DeltaBO formalizes a principled and computationally efficient framework to combine existing source GP data with sequential target evaluations, explicitly quantifying the informativeness and complexity of both the source and difference functions. The explicit dependence of regret on NN and γδ\gamma_{\delta} enables sharp guidance on when and how transfer learning is beneficial in BO. Empirical results indicate that DeltaBO consistently outperforms established classical and transfer-BO methods, particularly when source-target alignment is strong, the source dataset is considerably larger than the target, and the difference function is well-modeled by a simple GP.

This suggests that in practical Bayesian optimization regimes where related source data is abundant and the transfer gap is small in complexity, DeltaBO should be favored for provably rapid convergence and effective knowledge transfer.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DeltaBO Algorithm.