Papers
Topics
Authors
Recent
2000 character limit reached

Fixed-Confidence Best Arm Identification

Updated 13 October 2025
  • The paper introduces the GLUCB algorithm that minimizes sample complexity in fixed-confidence BAI by efficiently identifying the best arm in linear bandits.
  • It leverages a geometric overlap strategy and confidence ellipsoids to dynamically focus sampling on arms that best differentiate the optimal arm from its competitors.
  • Empirical and theoretical analyses demonstrate that GLUCB outperforms traditional methods, achieving near-optimal sample bounds in both two- and three-arm cases.

Fixed-confidence best arm identification (BAI) is the problem of adaptively sampling arms in a (structured) bandit model in order to identify the arm with maximal mean, such that the probability of error is at most δ. The central challenge is to minimize the expected number of samples required for this fixed-confidence guarantee. Fixed-confidence BAI has been extended from the classical multi-armed (unstructured) model to a broad variety of classes, with recent work focusing on linearly parameterized bandits. In this setting, each arm is encoded as a known feature vector, the mean reward is a linear function of an unknown parameter, and the best arm is the one maximizing this linear expectation.

1. Problem Formulation: Linear Bandit BAI

In the linear bandit BAI setting, the learner is presented with KK arms x1,,xKRdx_1, \ldots, x_K \in \mathbb{R}^d. The unknown parameter θRd\theta^* \in \mathbb{R}^d determines the mean reward for arm aa as xaθx_a^\top \theta^*. At each round tt, the learner chooses an arm xatx_{a_t}, observes reward yt=xatθ+ξty_t = x_{a_t}^\top \theta^* + \xi_t (with ξt\xi_t sub-Gaussian), and, after as few rounds as possible, outputs an arm aa^\dagger such that

Pθ(xaθ<maxbxbθ)δ,\mathbb{P}_{\theta^*}(x_{a^\dagger}^\top \theta^* < \max_{b} x_b^\top \theta^*) \leq \delta,

while minimizing E[τ]\mathbb{E}[\tau], the expected stopping time.

The linear structure implies strong correlations between arms: pulling one arm informs the means of others. In contrast to classical settings, the challenge is to design adaptive strategies that “probe” the parameter space efficiently.

2. GLUCB Algorithm Architecture and Geometric Overlap

The GLUCB (“Generalized Lower Upper Confidence Bounds”) algorithm generalizes the classic LUCB of Kalyanakrishnan et al. to the linear structure. Its sequential procedure comprises:

  • Empirical Estimation: After tt pulls, compute the regularized least-squares estimate

θt=Vt1bt,Vt=λI+s=1txasxas,bt=s=1txasys\theta_t = V_t^{-1} b_t, \qquad V_t = \lambda I + \sum_{s=1}^t x_{a_s} x_{a_s}^\top, \qquad b_t = \sum_{s=1}^t x_{a_s} y_s

  • Confidence Ellipsoid: Construct a set

Ct={θRd:θθtVtβt}\mathcal{C}_t = \left\{ \theta \in \mathbb{R}^d : \|\theta - \theta_t\|_{V_t} \leq \beta_t \right\}

with βt=Rdln(t/δ)\beta_t = R \sqrt{d \ln(t/\delta)} for sub-Gaussian noise of size RR.

  • Best Arm and Advantage: Predict the best arm

ht=argmaxa(θtxa)h_t = \arg\max_{a} (\theta_t^\top x_a)

and, for every ahta \neq h_t, compute the “advantage”

Adv(a)=maxθCt[θxaθxht]\mathrm{Adv}(a) = \max_{\theta \in \mathcal{C}_t} [\theta^\top x_a - \theta^\top x_{h_t}]

  • Stopping Rule: Terminate when aht,  Adv(a)0\forall a \neq h_t, \;\mathrm{Adv}(a) \leq 0. This occurs precisely when Ct\mathcal{C}_t is fully contained in the cone R(xht)R(x_{h_t}) where xhtx_{h_t} is optimal.
  • Sampling Rule (Geometric Overlap): Unlike LUCB, which samples the arm with the second-highest upper confidence, GLUCB chooses the arm at+1a_{t+1} that most reduces the “geometric overlap” of Ct\mathcal{C}_t with R(xht)cR(x_{h_t})^c (non-optimality). This is formalized:

at+1argmaxaxaVt1(xhtxlt)1+xaVt1xaa_{t+1} \in \arg\max_{a} \frac{|x_a^\top V_t^{-1} (x_{h_t} - x_{l_t})|}{\sqrt{1 + x_a^\top V_t^{-1} x_a}}

where xltx_{l_t} is the arm (other than hth_t) of maximal advantage.

The geometric overlap encodes the intersection of the ellipsoid Ct\mathcal{C}_t with the complement of the cone corresponding to the current best arm; minimization of this overlap focuses sampling on arms whose feature directions best distinguish hth_t from its main competitors.

3. Adaptive and Computationally Efficient Design

GLUCB’s adaptivity arises from updating θt\theta_t and the confidence set Ct\mathcal{C}_t at each round and dynamically allocating samples where they maximally reduce “worst-case possibility” of misidentification. Each decision step leverages:

  • Efficient updates of VtV_t and Vt1V_t^{-1} using rank-one formulas;
  • Simple, closed-form calculations for the advantage and geometric overlap criteria;
  • Selection of sampling actions determined only by current empirical sufficient statistics.

Compared to algorithms that solve complex instance-dependent optimizations (e.g., LinGapE, X–ElimTil–p), GLUCB’s per-round computational cost is limited to basic matrix–vector calculations, making it scalable for high KK and dd.

4. Sample Complexity in Two- and Three-Arm Cases

The paper presents explicit theoretical guarantees for K=2K=2 and K=3K=3, illuminating the efficacy of the geometric approach:

Two-arm case

For x1,x2Rdx_1, x_2 \in \mathbb{R}^d,

  • Up to rounding, GLUCB alternates between arms, ensuring near-perfect balance:

t2nk(t)t2+1\left\lfloor \frac{t}{2} \right\rfloor \leq n_k(t) \leq \left\lfloor \frac{t}{2} \right\rfloor + 1

  • The “potential function” Φ(t)=(x1x2)Vt1(x1x2)\Phi(t) = (x_1 - x_2)^\top V_t^{-1} (x_1 - x_2) quantifies uncertainty in the direction of interest and is shown to decrease at least as quickly as any alternative policy.
  • The expected sample complexity (omitting logarithmic factors) is

E[τ]βt2HG+1\mathbb{E}[\tau] \leq \beta_t^2 H_G + 1

with HGH_G arising as an instance-dependent term from a sampling frequency optimization, matching the lower bound up to dimension-dependent constants.

Three-arm case

When, for instance, x3x_3 is “geometrically” dominated (linear combination of x1,x2x_1, x_2 with a small angle), GLUCB rapidly eliminates dominated arms and focuses sampling only where it is effective for identification. The upper bound becomes

O(βt2Δmin2sin2(ω)+βt2Δminsin(ω))O\left(\frac{\beta_t^2}{\Delta_{\min}^2} \sin^2(\omega) + \frac{\beta_t^2}{\Delta_{\min} \sin(\omega)}\right)

where ω\omega is the angular separation and Δmin\Delta_{\min} the smallest gap, validating order-optimality except for log/dimensional slack.

5. Empirical Validation and Advantage over State of the Art

Extensive experiments confirm theoretical findings:

  • For synthetic arms (random or with ambiguous/dominated arms), GLUCB requires fewer samples—often by an order of magnitude—compared to classical LUCB, LinGapE, XY–static, and X–ElimTil–p.
  • In large diagonal-structured instances (K104K \sim 10^4), GLUCB scales efficiently while maintaining sample complexity advantages.
  • On the Yahoo! Webscope dataset (structured real-world arms), GLUCB achieves the fixed confidence guarantee with observed error probability zero across all trials, and consistently matches or outperforms the next-best methods in terms of sample efficiency.

The central qualitative finding is that by exploiting the geometry of the linear model, GLUCB “knows” to cease sampling arms whose differences have already been determined with high statistical power, thus avoiding the waste incurred by unstructured strategies.

6. Mathematical Expressions and Structural Insights

Core formulations underlying GLUCB’s design include:

  • Confidence ellipsoid:

Ct={θRd:θθtVtβt}\mathcal{C}_t = \left\{ \theta \in \mathbb{R}^d : \|\theta - \theta_t\|_{V_t} \leq \beta_t \right\}

  • Advantage of arm aa:

Adv(a)=maxθCt[θxaθxht]\textrm{Adv}(a) = \max_{\theta \in \mathcal{C}_t} [\theta^\top x_a - \theta^\top x_{h_t}]

  • Geometric overlap-driven sampling:

at+1argmaxaxaVt1(xhtxlt)1+xaVt1xaa_{t+1} \in \arg\max_a \frac{|x_a^\top V_t^{-1} (x_{h_t} - x_{l_t})|}{\sqrt{1 + x_a^\top V_t^{-1} x_a}}

  • Potential function for two-arm case:

Φ(t)=(x1x2)Vt1(x1x2)\Phi(t) = (x_1 - x_2)^\top V_t^{-1} (x_1 - x_2)

  • Sample complexity bounds:

O(βt2Δmin2sin2(ω)+βt2Δminsin(ω))O\left( \frac{\beta_t^2}{\Delta_{\min}^2} \sin^2(\omega)+ \frac{\beta_t^2}{\Delta_{\min} \sin(\omega)} \right)

The use of ellipsoids, cones, and Mahalanobis norms as the fundamental devices for uncertainty quantification and sampling allocation marks a distinct departure from naïve UCB approaches.

7. Significance and Implications

GLUCB provides a principled, fully adaptive, and computationally simple solution for fixed-confidence BAI in linear bandits, with near-optimal sample complexity and practical performance gains over previous art. The geometric overlap framework exposes the key directions to probe in parameter space, leading to immediate and effective reduction in ambiguity about which arm is best, and automatically adapts sampling to the most informative arms. This approach provides a general template for extending “structure-aware” BAI strategies in linearly parameterized or more broadly structured bandit models.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Fixed-Confidence Best Arm Identification (BAI).