Variance-Aware UCB (UCB-V) Algorithms

Updated 11 April 2026

Variance-Aware UCB (UCB-V) is a framework that integrates empirical variance into Bernstein-type confidence bounds, thereby improving regret guarantees in stochastic decision processes.
It is applied in multi-armed bandits, active mean estimation, and Monte Carlo tree search, offering adaptive bonus formulations that balance exploration with variance considerations.
The methodology achieves instance-dependent regret rates, enhanced stability in non-stationary environments, and fair sampling by optimally balancing arm pulls according to variance estimates.

Variance-Aware Upper Confidence Bound (UCB-V) algorithms are a framework for adaptive sampling in stochastic decision processes, where the primary objective is to efficiently balance exploration and exploitation by incorporating empirical variance estimates into the arm selection mechanism. These methods generalize classic UCB principles by constructing instance-dependent confidence intervals using Bernstein-type inequalities, resulting in tighter uncertainty quantification and improved regret guarantees, particularly in heterogeneous or non-stationary environments.

1. Theoretical Framework and Problem Settings

Variance-Aware UCB algorithms are most prominently studied in the context of stochastic multi-armed bandits (MAB), active mean estimation with fairness constraints, contextual dueling bandits, and Monte Carlo Tree Search (MCTS).

Multi-Armed Bandits

In the canonical stochastic K-armed bandit model, each arm $a \in \{1, \ldots, K\}$ yields i.i.d. rewards drawn from an unknown distribution with unknown mean $\mu_a$ and variance $\sigma_a^2$ . The learner must sequentially select arms to maximize expected cumulative reward, or, equivalently, minimize regret against the optimal arm.

The UCB-V approach replaces the fixed-width Hoeffding bonus with a data-adaptive bonus involving the sample variance, specifically:

$\mathrm{UCBV}_a(t) = \widehat\mu_{a}(t) + \sqrt{\frac{2\,\widehat\sigma^2_{a}(t)\,\ln T}{n_{a}(t)}} + \frac{3\,\ln T}{n_{a}(t)},$

where $\widehat\mu_{a}(t)$ and $\widehat\sigma^2_{a}(t)$ are the empirical mean and variance, and $n_{a}(t)$ is the number of pulls of arm $a$ up to time $t$ (Fan et al., 2024, Mukherjee et al., 2017). A related generic construct is to use an upper confidence bound $\mathrm{UCB}_a(n)$ for $\mu_a$ 0 itself, then apply a deterministic selection rule that maximizes a function of this UCB and sample count (Aznag et al., 20 May 2025).

Active Mean Estimation and Fair Sampling

In the multi-group mean estimation setting, the goal is not regret minimization per se, but to sample arms (groups) so that the noise in each mean estimator is minimized and fairly distributed. Here, the Variance-UCB algorithm uses empirical-Bernstein bounds for variance estimation and selects arms to minimize the $\mu_a$ 1-norm of the vector of mean estimator variances (Aznag et al., 20 May 2025).

Monte Carlo Tree Search

UCB-V variants are extended to the tree search setting, where node selection policies must balance the mean and variance of return estimates. Bernstein-style bonuses with empirical variance appear in tree node selection scores, and recent work systematically derives prior-based tree policies from UCB-V using regularized policy optimization (Weichart, 25 Dec 2025).

Contextual and Dueling Bandits

Variance-aware bonuses are generalizable to contextual and dueling bandit models, with recent neural-contextual bandits incorporating estimated uncertainty into last-layer UCB-style exploration (Oh et al., 2 Jun 2025).

2. Algorithmic Details and Selection Strategies

Variance-Aware UCB algorithms operate with the following principal components:

Empirical Estimation: For each arm, maintain sample mean $\mu_a$ 2 and empirical variance $\mu_a$ 3 using sequential updates.
Confidence Bound Construction: Build upper confidence bounds for the arm mean using Bernstein-type (variance-aware) inequalities, e.g., $\mu_a$ 4.
Selection Rule: At each round, select the arm $\mu_a$ 5 maximizing $\mu_a$ 6. Variants may incorporate prior information, context, or regularization in tree-search or contextual settings.
Fair Sampling/Variance Balancing: In active mean estimation, the index for arm selection is constructed as $\mu_a$ 7 and the arm with the largest index is pulled (Aznag et al., 20 May 2025).
Recursive Updates and Efficiency: Modern algorithms (e.g., RAVEN-UCB) adopt $\mu_a$ 8 recursive mean and variance computations for practical scalability (Fang et al., 3 Jun 2025).

A summary of selection rules across settings:

Setting	UCB-V Bonus Formulation	Selection Rule
MAB	$\mu_a$ 9	$\sigma_a^2$ 0
Active Mean Estim.	UCB on $\sigma_a^2$ 1, power-based index	$\sigma_a^2$ 2
Tree Search	$\sigma_a^2$ 3	$\sigma_a^2$ 4
Contextual/Dueling	$\sigma_a^2$ 5	Optimistic argmax with bonus

3. Regret Analysis and Theoretical Guarantees

Variance-Aware UCB methods achieve tighter (often instance-dependent) regret guarantees than classic UCB, particularly in the presence of heterogeneous variances.

Classical Regret Bounds

UCB1: $\sigma_a^2$ 6 (gap-dependent), $\sigma_a^2$ 7 (gap-independent) (Mukherjee et al., 2017).
UCB-V (Audibert et al. 2009): $\sigma_a^2$ 8, $\sigma_a^2$ 9. Here, $\mathrm{UCBV}_a(t) = \widehat\mu_{a}(t) + \sqrt{\frac{2\,\widehat\sigma^2_{a}(t)\,\ln T}{n_{a}(t)}} + \frac{3\,\ln T}{n_{a}(t)},$ 0 is the maximal arm variance (Mukherjee et al., 2017).

Improved Rates and Instability

Refined Regret: Recent analyses prove that, e.g., in two-armed cases, UCB-V achieves $\mathrm{UCBV}_a(t) = \widehat\mu_{a}(t) + \sqrt{\frac{2\,\widehat\sigma^2_{a}(t)\,\ln T}{n_{a}(t)}} + \frac{3\,\ln T}{n_{a}(t)},$ 1 regret when $\mathrm{UCBV}_a(t) = \widehat\mu_{a}(t) + \sqrt{\frac{2\,\widehat\sigma^2_{a}(t)\,\ln T}{n_{a}(t)}} + \frac{3\,\ln T}{n_{a}(t)},$ 2, interpolating to $\mathrm{UCBV}_a(t) = \widehat\mu_{a}(t) + \sqrt{\frac{2\,\widehat\sigma^2_{a}(t)\,\ln T}{n_{a}(t)}} + \frac{3\,\ln T}{n_{a}(t)},$ 3 as $\mathrm{UCBV}_a(t) = \widehat\mu_{a}(t) + \sqrt{\frac{2\,\widehat\sigma^2_{a}(t)\,\ln T}{n_{a}(t)}} + \frac{3\,\ln T}{n_{a}(t)},$ 4 (Fan et al., 2024).
Stability and Phase Transitions: UCB-V exhibits stable asymptotic arm-pulling rates under mild conditions, but under critical "signal-to-noise" regimes ( $\mathrm{UCBV}_a(t) = \widehat\mu_{a}(t) + \sqrt{\frac{2\,\widehat\sigma^2_{a}(t)\,\ln T}{n_{a}(t)}} + \frac{3\,\ln T}{n_{a}(t)},$ 5) the distribution of pull counts may not concentrate, resulting in instability—a phenomenon absent from UCB1 (Fan et al., 2024).
Fair Mean Estimation: Variance-UCB achieves $\mathrm{UCBV}_a(t) = \widehat\mu_{a}(t) + \sqrt{\frac{2\,\widehat\sigma^2_{a}(t)\,\ln T}{n_{a}(t)}} + \frac{3\,\ln T}{n_{a}(t)},$ 6-regret rates of $\mathrm{UCBV}_a(t) = \widehat\mu_{a}(t) + \sqrt{\frac{2\,\widehat\sigma^2_{a}(t)\,\ln T}{n_{a}(t)}} + \frac{3\,\ln T}{n_{a}(t)},$ 7 (for $\mathrm{UCBV}_a(t) = \widehat\mu_{a}(t) + \sqrt{\frac{2\,\widehat\sigma^2_{a}(t)\,\ln T}{n_{a}(t)}} + \frac{3\,\ln T}{n_{a}(t)},$ 8), without pathological dependence on the minimal variance, by explicitly balancing pulls in proportion to variance (Aznag et al., 20 May 2025).

Modern Descendants

EUCBV: Incorporates arm elimination and achieves $\mathrm{UCBV}_a(t) = \widehat\mu_{a}(t) + \sqrt{\frac{2\,\widehat\sigma^2_{a}(t)\,\ln T}{n_{a}(t)}} + \frac{3\,\ln T}{n_{a}(t)},$ 9 gap-dependent and $\widehat\mu_{a}(t)$ 0 gap-independent bounds, optimizing over UCB1, UCB-V, and other variants (Mukherjee et al., 2017).
RAVEN-UCB: Further tightens constants, removes additive $\widehat\mu_{a}(t)$ 1 terms, and extends to non-stationary environments, while matching minimax rates (Fang et al., 3 Jun 2025).

4. Extensions, Variants, and Generalizations

Variance-aware confidence bounds are now widely adopted and generalized:

Adaptive Active Learning: Algorithms in group mean estimation adapt confidence widths to any tail class, not only to sub-Gaussian arms, via admissible $\widehat\mu_{a}(t)$ 2 procedures for UCB construction. This modularity allows incorporating improved concentration bounds as they arise (Aznag et al., 20 May 2025).
Monte Carlo Tree Search: Variance-aware bonuses are extended to prior-based tree selection via a systematic Inverse-RPO methodology, yielding UCT-V-P and PUCT-V, which inherit Bernstein-optimal guarantees (Weichart, 25 Dec 2025).
Contextual Dueling Bandits: Deep learning-based approaches incorporate variance-aware bonuses computed on last-layer features, ensuring dimensionally efficient exploration without full parameter-matrix inversion (Oh et al., 2 Jun 2025).

A plausible implication is that similar variance adaptation principles could further benefit other structured exploration domains where estimator heteroscedasticity is critical.

5. Practical Implementation and Experimental Evidence

Variance-Aware UCB and its derivatives are both theoretically grounded and empirically validated across classical and modern benchmarks.

Efficiency: Recursive updates for mean/variance maintain per-arm storage and per-round update cost at $\widehat\mu_{a}(t)$ 3, allowing deployment in high-frequency or large-scale settings (Fang et al., 3 Jun 2025).
Empirical Evaluations:
- EUCBV and RAVEN-UCB demonstrably outperform UCB-V and classical UCB1 across regimes with variable arm variances, small mean gaps, and in non-stationary reward distributions (Mukherjee et al., 2017, Fang et al., 3 Jun 2025).
- In MCTS, variance-aware UCT methods consistently outperform PUCT, with negligible computational overhead, particularly in environments with intrinsic stochasticity (Weichart, 25 Dec 2025).
- In contextual dueling applications, neural variance-aware approaches achieve regret $\widehat\mu_{a}(t)$ 4 with competitive computational demands (Oh et al., 2 Jun 2025).

Key implementation considerations include correct initialization to prevent under-sampling, ensuring positive-definiteness in variance estimates, and, where required, efficient matrix handling for contextual variants.

6. Comparative Analysis and Open Directions

Variance-Aware UCB fundamentally contrasts with classical UCB in both theoretical and practical respects:

Method	Gap-Dependent Regret	Gap-Independent Regret	Variance-Adaptivity	Stability	Comments
UCB1	$\widehat\mu_{a}(t)$ 5	$\widehat\mu_{a}(t)$ 6	No	Always deterministic	Hoeffding-based; variance ignored
UCB-V	$\widehat\mu_{a}(t)$ 7	$\widehat\mu_{a}(t)$ 8	Yes	May be unstable at $\widehat\mu_{a}(t)$ 9	Bernstein-based, tighter when variances differ
EUCBV	$\widehat\sigma^2_{a}(t)$ 0	$\widehat\sigma^2_{a}(t)$ 1	Yes	Stable	Arm-elimination, optimal constants
RAVEN-UCB	$\widehat\sigma^2_{a}(t)$ 2	$\widehat\sigma^2_{a}(t)$ 3	Yes	Stable, robust in nonstat.	Decaying exploration, strictly reduced regret

Despite clear empirical and theoretical advantages, UCB-V is subject to statistical instability (e.g., multiple non-concentrating pull count scales) at critical parameter regimes (Fan et al., 2024). Designing variance-aware policies that retain stability across all regimes remains an open challenge. Additionally, adoption in structured domains—such as active medical trials with fairness constraints (Aznag et al., 20 May 2025), large-scale tree search (Weichart, 25 Dec 2025), and deep contextual bandits (Oh et al., 2 Jun 2025)—continues to be an active area of theoretical and applied research.