Finite-Arm GP-UCB

Updated 23 December 2025

Finite-Arm GP-UCB is a bandit algorithm that uses Gaussian process priors to balance exploration and exploitation over a fixed set of arms.
It employs an upper confidence bound rule with computed posterior mean and variance to ensure efficient updates and strong regret guarantees.
Enhanced variants like IRGP-UCB use randomized confidence parameters to tighten regret bounds and accelerate convergence.

Finite-Arm GP-UCB refers to Gaussian Process Upper Confidence Bound algorithms for Bayesian optimization and bandit problems where the arm set $\mathcal{X}$ is finite. In this regime, the agent sequentially selects arms from a fixed finite set, and observations are modeled as $y_t = f(x_t) + \epsilon_t$ , with $f$ following a Gaussian process (GP) prior and $\epsilon_t$ additive noise. Finite-armed GP-UCB provides strong regret guarantees with efficient computation, and, in recent research, variants with randomized confidence parameters have further improved regret bounds and reduced over-exploration.

1. Problem Setting: Bayesian Optimization with Finite Arms

The setting consists of a finite domain $\mathcal{X} = \{x^1, \dots, x^{|\mathcal{X}|}\} \subset \mathbb{R}^d$ . The unknown reward function $f$ is drawn from a GP prior with mean zero and covariance $k(\cdot,\cdot)$ satisfying $k(x,x) \leq 1$ for all $x$ , and the learner observes $y_t = f(x_t) + \epsilon_t$ at each round $t$ , with i.i.d. noise $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$ . The aim is to minimize cumulative regret $R_T = \sum_{t=1}^T [f(x^*) - f(x_t)]$ , where $x^*$ is an arm maximizing $f$ over $\mathcal{X}$ (Takeno et al., 2 Sep 2024, Krishnamurthy, 20 Dec 2025).

2. The Standard Finite-Arm GP-UCB Algorithm

GP-UCB is established for maximizing an unknown function under GP priors. At each round $t$ , posterior mean $\mu_{t-1}(x)$ and variance $\sigma_{t-1}^2(x)$ are computed using past data. The classic GP-UCB rule selects:

$x_t = \arg\max_{x \in \mathcal{X}} \left[ \mu_{t-1}(x) + \sqrt{\beta_t} \, \sigma_{t-1}(x) \right]$

where $\beta_t$ is a deterministic confidence parameter, typically chosen to ensure a concentration inequality over all arms and all rounds, e.g. $\beta_t = O(\log(|\mathcal{X}| t /\delta))$ (Krishnamurthy, 20 Dec 2025, Hu et al., 11 Jun 2025). The resulting algorithm is computationally tractable on finite domains, as all required GP quantities can be efficiently updated (Hu et al., 11 Jun 2025).

3. Regret Analysis and Information Gain

A central result is that the concentration inequality

$\Pr\left(\forall x, t: |f(x) - \mu_{t-1}(x)| \leq \sqrt{\beta_t}\, \sigma_{t-1}(x)\right) \geq 1 - \delta$

holds with a judicious choice of $\beta_t$ . Regret can be decomposed into two analytical components:

Instance-dependent bound: Each suboptimal arm $x$ is chosen at most $O(\beta_T \sigma^2 / \Delta_x^2)$ times, where $\Delta_x = f(x^*) - f(x)$ .
Distribution-free bound: The cumulative regret satisfies

$R_T = O(\sqrt{T \beta_T \gamma_T})$

with $\gamma_T$ the maximum mutual information (information gain) between $f$ and $T$ observations, i.e.

$\gamma_T = \max_{A \subset \mathcal{X}, |A|=T} \frac{1}{2}\log \det(I + \sigma^{-2} K_{A,A})$

where $K_{A,A}$ is the Gram matrix restricted to the set $A$ (Krishnamurthy, 20 Dec 2025, Hu et al., 11 Jun 2025). For finite $\mathcal{X}$ , $\gamma_T = O(\log |\mathcal{X}|)$ .

Table 1: Key analytical ingredients in finite-arm GP-UCB

Ingredient	Formula/Description	Reference
Posterior mean	$\mu_{t-1}(x) = k_{t-1}(x)^\top V_{t-1}^{-1} y_{1:t-1}$	(Krishnamurthy, 20 Dec 2025)
Posterior variance	$\sigma_{t-1}^2(x) = k(x,x) - k_{t-1}(x)^\top V_{t-1}^{-1} k_{t-1}(x)$	(Krishnamurthy, 20 Dec 2025)
Regret bound	$R_T = O(\sqrt{T \beta_T \gamma_T})$	(Krishnamurthy, 20 Dec 2025)
Max info. gain	$\gamma_T = \max_{A \subset \mathcal{X}} \frac{1}{2} \log \det(I + \sigma^{-2}K_{A,A})$	(Hu et al., 11 Jun 2025)

4. Improved Randomized GP-UCB (IRGP-UCB)

A recognized limitation of classic GP-UCB is the need for a confidence parameter $\beta_t$ that grows with $t$ to maintain high-probability bounds. This growth is conservative and leads to excessive long-term exploration (Takeno et al., 2 Sep 2024).

IRGP-UCB replaces $\beta_t$ by a random variable $\zeta_t$ sampled from a shifted exponential distribution $\zeta_t \sim s_t + \mathrm{Exp}(\lambda)$ , with $s_t = 2 \log(|\mathcal{X}|/2)$ and $\lambda = 1/2$ . This results in:

The expected confidence parameter is constant in $t$ , so the width of the exploration bonus does not grow.
The regret bound (Bayesian cumulative regret and high-probability regret) becomes

$\mathrm{BCR}_T \leq O\left(\sqrt{T \gamma_T \log |\mathcal{X}|}\right)$

This is asymptotically sharper in $T$ than standard GP-UCB, where $R_T = O(\sqrt{T \gamma_T \log T})$ (Takeno et al., 2 Sep 2024). Numerical studies on synthetic and real datasets confirm IRGP-UCB achieves faster convergence and consistently lower regret.

5. Deterministic Proof Ingredients: Concentration, Radius Collapse, and Forced Deviations

The regret proof for finite-arm GP-UCB is structurally identical to that for finite-armed UCB and linear UCB algorithms (Krishnamurthy, 20 Dec 2025). The analysis consists of:

A single uniform concentration bound (for posterior deviation),
Radius collapse: After sufficient pulls, a suboptimal arm's posterior variance (and hence its bonus) falls below a threshold proportional to its suboptimality gap,
Optimism-forced deviation: The selection rule cannot pick suboptimal arms after enough information is acquired, unless the confidence bound fails,
Combining these, one derives both instance-dependent (logarithmic) and distribution-free ("oracle") regret bounds.

This structural unity extends across most optimism-based finite bandit algorithms (Krishnamurthy, 20 Dec 2025).

6. Practical Considerations and Computational Aspects

With moderate $|\mathcal{X}|$ , GP-UCB and IRGP-UCB are computationally efficient. The $K \times K$ Gram matrix can be cached, and Cholesky or inverse updates are $O(K^2)$ per round (Hu et al., 11 Jun 2025). No discretization or approximate maximization is required. The regularization via the GP kernel smooths out estimates, exploiting correlations between arms for more sample-efficient exploration.

Typical applications include finite-design experimental settings, algorithm configuration, and resource allocation with categorical options (Takeno et al., 2 Sep 2024).

7. Comparison of IRGP-UCB and Classic GP-UCB

IRGP-UCB eliminates the need for time-growing confidence bonuses on finite domains. This removes late-iteration over-exploration and provides tighter bounds. In particular:

Exploration bonus: Fixed in IRGP-UCB; grows with time in GP-UCB.
Regret rate: $O(\sqrt{T \gamma_T \log |\mathcal{X}|})$ for IRGP-UCB, vs. $O(\sqrt{T \gamma_T \log T})$ for GP-UCB.
Empirical performance: In all reported experiments with finite $\mathcal{X}$ , IRGP-UCB outperforms GP-UCB, Thompson Sampling, and other state-of-the-art BO algorithms in sample complexity and regret (Takeno et al., 2 Sep 2024).

A plausible implication is that, for the finite-arm setting, randomization in the confidence parameter eliminates the necessity for conservative high-probability guarantees, yielding both theoretical and practical benefits.

References: (Takeno et al., 2 Sep 2024, Hu et al., 11 Jun 2025, Krishnamurthy, 20 Dec 2025)