Papers
Topics
Authors
Recent
2000 character limit reached

Finite-Arm GP-UCB

Updated 23 December 2025
  • Finite-Arm GP-UCB is a bandit algorithm that uses Gaussian process priors to balance exploration and exploitation over a fixed set of arms.
  • It employs an upper confidence bound rule with computed posterior mean and variance to ensure efficient updates and strong regret guarantees.
  • Enhanced variants like IRGP-UCB use randomized confidence parameters to tighten regret bounds and accelerate convergence.

Finite-Arm GP-UCB refers to Gaussian Process Upper Confidence Bound algorithms for Bayesian optimization and bandit problems where the arm set X\mathcal{X} is finite. In this regime, the agent sequentially selects arms from a fixed finite set, and observations are modeled as yt=f(xt)+ϵty_t = f(x_t) + \epsilon_t, with ff following a Gaussian process (GP) prior and ϵt\epsilon_t additive noise. Finite-armed GP-UCB provides strong regret guarantees with efficient computation, and, in recent research, variants with randomized confidence parameters have further improved regret bounds and reduced over-exploration.

1. Problem Setting: Bayesian Optimization with Finite Arms

The setting consists of a finite domain X={x1,,xX}Rd\mathcal{X} = \{x^1, \dots, x^{|\mathcal{X}|}\} \subset \mathbb{R}^d. The unknown reward function ff is drawn from a GP prior with mean zero and covariance k(,)k(\cdot,\cdot) satisfying k(x,x)1k(x,x) \leq 1 for all xx, and the learner observes yt=f(xt)+ϵty_t = f(x_t) + \epsilon_t at each round tt, with i.i.d. noise ϵtN(0,σ2)\epsilon_t \sim \mathcal{N}(0, \sigma^2). The aim is to minimize cumulative regret RT=t=1T[f(x)f(xt)]R_T = \sum_{t=1}^T [f(x^*) - f(x_t)], where xx^* is an arm maximizing ff over X\mathcal{X} (Takeno et al., 2 Sep 2024, Krishnamurthy, 20 Dec 2025).

2. The Standard Finite-Arm GP-UCB Algorithm

GP-UCB is established for maximizing an unknown function under GP priors. At each round tt, posterior mean μt1(x)\mu_{t-1}(x) and variance σt12(x)\sigma_{t-1}^2(x) are computed using past data. The classic GP-UCB rule selects:

xt=argmaxxX[μt1(x)+βtσt1(x)]x_t = \arg\max_{x \in \mathcal{X}} \left[ \mu_{t-1}(x) + \sqrt{\beta_t} \, \sigma_{t-1}(x) \right]

where βt\beta_t is a deterministic confidence parameter, typically chosen to ensure a concentration inequality over all arms and all rounds, e.g. βt=O(log(Xt/δ))\beta_t = O(\log(|\mathcal{X}| t /\delta)) (Krishnamurthy, 20 Dec 2025, Hu et al., 11 Jun 2025). The resulting algorithm is computationally tractable on finite domains, as all required GP quantities can be efficiently updated (Hu et al., 11 Jun 2025).

3. Regret Analysis and Information Gain

A central result is that the concentration inequality

Pr(x,t:f(x)μt1(x)βtσt1(x))1δ\Pr\left(\forall x, t: |f(x) - \mu_{t-1}(x)| \leq \sqrt{\beta_t}\, \sigma_{t-1}(x)\right) \geq 1 - \delta

holds with a judicious choice of βt\beta_t. Regret can be decomposed into two analytical components:

  • Instance-dependent bound: Each suboptimal arm xx is chosen at most O(βTσ2/Δx2)O(\beta_T \sigma^2 / \Delta_x^2) times, where Δx=f(x)f(x)\Delta_x = f(x^*) - f(x).
  • Distribution-free bound: The cumulative regret satisfies

RT=O(TβTγT)R_T = O(\sqrt{T \beta_T \gamma_T})

with γT\gamma_T the maximum mutual information (information gain) between ff and TT observations, i.e.

γT=maxAX,A=T12logdet(I+σ2KA,A)\gamma_T = \max_{A \subset \mathcal{X}, |A|=T} \frac{1}{2}\log \det(I + \sigma^{-2} K_{A,A})

where KA,AK_{A,A} is the Gram matrix restricted to the set AA (Krishnamurthy, 20 Dec 2025, Hu et al., 11 Jun 2025). For finite X\mathcal{X}, γT=O(logX)\gamma_T = O(\log |\mathcal{X}|).

Table 1: Key analytical ingredients in finite-arm GP-UCB

Ingredient Formula/Description Reference
Posterior mean μt1(x)=kt1(x)Vt11y1:t1\mu_{t-1}(x) = k_{t-1}(x)^\top V_{t-1}^{-1} y_{1:t-1} (Krishnamurthy, 20 Dec 2025)
Posterior variance σt12(x)=k(x,x)kt1(x)Vt11kt1(x)\sigma_{t-1}^2(x) = k(x,x) - k_{t-1}(x)^\top V_{t-1}^{-1} k_{t-1}(x) (Krishnamurthy, 20 Dec 2025)
Regret bound RT=O(TβTγT)R_T = O(\sqrt{T \beta_T \gamma_T}) (Krishnamurthy, 20 Dec 2025)
Max info. gain γT=maxAX12logdet(I+σ2KA,A)\gamma_T = \max_{A \subset \mathcal{X}} \frac{1}{2} \log \det(I + \sigma^{-2}K_{A,A}) (Hu et al., 11 Jun 2025)

4. Improved Randomized GP-UCB (IRGP-UCB)

A recognized limitation of classic GP-UCB is the need for a confidence parameter βt\beta_t that grows with tt to maintain high-probability bounds. This growth is conservative and leads to excessive long-term exploration (Takeno et al., 2 Sep 2024).

IRGP-UCB replaces βt\beta_t by a random variable ζt\zeta_t sampled from a shifted exponential distribution ζtst+Exp(λ)\zeta_t \sim s_t + \mathrm{Exp}(\lambda), with st=2log(X/2)s_t = 2 \log(|\mathcal{X}|/2) and λ=1/2\lambda = 1/2. This results in:

  • The expected confidence parameter is constant in tt, so the width of the exploration bonus does not grow.
  • The regret bound (Bayesian cumulative regret and high-probability regret) becomes

BCRTO(TγTlogX)\mathrm{BCR}_T \leq O\left(\sqrt{T \gamma_T \log |\mathcal{X}|}\right)

This is asymptotically sharper in TT than standard GP-UCB, where RT=O(TγTlogT)R_T = O(\sqrt{T \gamma_T \log T}) (Takeno et al., 2 Sep 2024). Numerical studies on synthetic and real datasets confirm IRGP-UCB achieves faster convergence and consistently lower regret.

5. Deterministic Proof Ingredients: Concentration, Radius Collapse, and Forced Deviations

The regret proof for finite-arm GP-UCB is structurally identical to that for finite-armed UCB and linear UCB algorithms (Krishnamurthy, 20 Dec 2025). The analysis consists of:

  • A single uniform concentration bound (for posterior deviation),
  • Radius collapse: After sufficient pulls, a suboptimal arm's posterior variance (and hence its bonus) falls below a threshold proportional to its suboptimality gap,
  • Optimism-forced deviation: The selection rule cannot pick suboptimal arms after enough information is acquired, unless the confidence bound fails,
  • Combining these, one derives both instance-dependent (logarithmic) and distribution-free ("oracle") regret bounds.

This structural unity extends across most optimism-based finite bandit algorithms (Krishnamurthy, 20 Dec 2025).

6. Practical Considerations and Computational Aspects

With moderate X|\mathcal{X}|, GP-UCB and IRGP-UCB are computationally efficient. The K×KK \times K Gram matrix can be cached, and Cholesky or inverse updates are O(K2)O(K^2) per round (Hu et al., 11 Jun 2025). No discretization or approximate maximization is required. The regularization via the GP kernel smooths out estimates, exploiting correlations between arms for more sample-efficient exploration.

Typical applications include finite-design experimental settings, algorithm configuration, and resource allocation with categorical options (Takeno et al., 2 Sep 2024).

7. Comparison of IRGP-UCB and Classic GP-UCB

IRGP-UCB eliminates the need for time-growing confidence bonuses on finite domains. This removes late-iteration over-exploration and provides tighter bounds. In particular:

  • Exploration bonus: Fixed in IRGP-UCB; grows with time in GP-UCB.
  • Regret rate: O(TγTlogX)O(\sqrt{T \gamma_T \log |\mathcal{X}|}) for IRGP-UCB, vs. O(TγTlogT)O(\sqrt{T \gamma_T \log T}) for GP-UCB.
  • Empirical performance: In all reported experiments with finite X\mathcal{X}, IRGP-UCB outperforms GP-UCB, Thompson Sampling, and other state-of-the-art BO algorithms in sample complexity and regret (Takeno et al., 2 Sep 2024).

A plausible implication is that, for the finite-arm setting, randomization in the confidence parameter eliminates the necessity for conservative high-probability guarantees, yielding both theoretical and practical benefits.


References: (Takeno et al., 2 Sep 2024, Hu et al., 11 Jun 2025, Krishnamurthy, 20 Dec 2025)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Finite-Arm GP-UCB.