Upper Confidence Bounds with Curiosity (UCC)

Updated 28 October 2025

Upper Confidence Bounds with Curiosity (UCC) is a framework that augments classic UCB algorithms with dynamic, curiosity-driven exploration to balance exploitation and adaptive exploration.
It synthesizes theory and practice by integrating information-theoretic measures and rational models to adjust exploration bonuses based on uncertainty and learning progress.
UCC methods improve sample efficiency and robustness in complex settings such as stochastic bandits, reinforcement learning, and heavy-tailed environments.

Upper Confidence Bounds with Curiosity (UCC) designates a class of exploration strategies for sequential decision making—most notably in stochastic multi-armed bandits and reinforcement learning—where classical Upper Confidence Bound (UCB) algorithms are augmented by mechanisms that instantiate curiosity-driven or adaptive exploration. UCC-type methods synthesize the classic optimism-in-the-face-of-uncertainty principle with more dynamic or information-theoretically inspired bonuses drawn from empirical or theoretical measures of uncertainty, information gain, or learning progress. The resulting framework encompasses a wide range of instantiations in bandit problems, contextual bandits, reinforcement learning, and more general stochastic process optimization. Approaches to UCC are informed by both lower bound theory and rational models of curiosity.

1. Mathematical Structure and Foundational UCB Forms

The archetypal UCB policy for stochastic bandits assigns, at each round $t$ , to arm $k$ an index

$B_k(s, t) = X_{k,s} + \sqrt{\frac{c \log t}{s}}$

where $X_{k,s}$ is the empirical mean after $s$ pulls, and $c$ is a tunable constant. This index achieves a high-probability upper bound for the unknown mean reward $\mu_k$ and balances exploitation (via $X_{k,s}$ ) with exploration (via $\sqrt{(c \log t)/s}$ ). Extensions such as UCB $(\rho)$ replace the constant $c$ by a parameter $\rho > 0$ :

$B_k(s, t) = X_{k,s} + \sqrt{\frac{\rho \log t}{s}}$

The parameter $\rho$ modulates the amount of forced exploration: larger $\rho$ increases exploration, crucial for worst-case optimality. Generalized policies consider $B_k(s,t) = X_{k,s} + \sqrt{\frac{f_k(t)}{s}}$ for any increasing function $f_k(t)$ , capturing a trade-off between sample efficiency in "hard" versus "simple" environments (Salomon et al., 2011).

In KL-UCB algorithms, the bonus is based on inverting the Kullback-Leibler divergence between the empirical distribution $\widehat{\nu}_a$ and plausible means $\nu$ ,

$U_a(t) = \sup\left\{\mu : d(\widehat{\mu}_a(t), \mu) \leq \frac{f(t)}{N_a(t)}\right\}$

where $d(\cdot,\cdot)$ is the appropriate divergence for the underlying parametric family (Cappé et al., 2012). This tightens confidence intervals and aligns the bonus with the statistical geometry of the model.

When ported to reinforcement learning, especially in high-dimensional or continuous spaces, the UCB principle is operationalized through model ensembles, random feature embeddings (e.g., via neural tangent kernel gradients), or by optimizing with respect to uncertainty in latent model predictions (Chen et al., 2017, Zhou et al., 2019, Seyde et al., 2020).

2. Regret Lower Bounds and Adaptive Exploration

UCB methods are characterized by their regret scaling properties. Consistent policies, as defined by Lai and Robbins, guarantee

$\mathbb{E}[R_n] = o(n^a), \quad \forall a > 0$

and incur a minimal number of explorative pulls per suboptimal arm, set by the instance-dependent complexity $D_k(\theta)$ ,

$\liminf_n \frac{\mathbb{E}[T_k(n)]}{\log n} \geq \frac{1}{D_k(\theta)}$

where $D_k(\theta)$ is a min-KL subject to mean constraints. For generalized policies (so-called $\alpha$ -consistent), the bound is relaxed to

$\liminf_n \frac{\mathbb{E}[T_k(n)]}{\log n} \geq \frac{1-\alpha}{D_k(\theta)}$

highlighting the exploration-sample complexity trade-off (Salomon et al., 2011).

Crucially, more aggressive adaptations—in which $f_k(t)$ grows sub-logarithmically—may minimize regret in "easy" (e.g., deterministic) environments down to $O(\log\log n)$ , but at the cost of potentially catastrophic regret in "hard" settings. Theoretical impossibility results preclude the existence of adaptive meta-algorithms that optimally select the best exploration rate online without environmental side information (Salomon et al., 2011).

3. Curiosity-Driven Bonuses: Rational Models and Information-Theoretic Extensions

Curiosity in UCC extends exploration strategies by dynamically tuning the exploration term using either empirical uncertainty, theoretical information gain estimates, or intrinsic learning progress. Rational models (Dubey et al., 2017) formalize curiosity $\Omega_k$ as the expected change in knowledge utility $V$ ,

$\Omega_k = p_k \, \frac{d c_k}{d h_k}$

where $p_k$ is the probability of need ("need probability"), $h_k$ the exposure, and $c_k$ the confidence in stimulus $k$ . Mathematical forms such as $\Omega_k \propto -\ln(1 - c_k)(1 - c_k)$ (inverted-U w.r.t. confidence) or $\Omega_k \propto 1 - c_k$ (novelty-driven) provide dynamic curiosity bonuses as a function of confidence and sampling history.

Mechanistically, this bonus can be integrated into classical UCB by augmenting the confidence bound, yielding policies of the form:

$\text{UCC}(s, a) = Q(s, a) + \beta \cdot \Omega(s, a)$

where $\Omega$ quantifies local curiosity and $\beta$ balances exploitation, uncertainty, and exploration according to context.

In information-theoretic UCB extensions, such as KL-UCB, the width of the confidence region already quantifies the "difficulty" of the learning problem through divergence. This approach can be merged with curiosity by, for instance, adaptively increasing the exploration parameter for arms or states where learning progress is maximal or statistical surprise is high (Cappé et al., 2012).

4. UCC in Reinforcement Learning and Contextual Bandits

In deep RL, UCC is operationalized through ensemble methods and context-dependent variance estimation:

Q-ensemble approaches maintain an ensemble of $K$ independently learned $Q$ -functions $\{ Q_k \}$ , computing

$\widetilde{\mu}(s, a) = \frac{1}{K} \sum_k Q_k(s, a), \quad \widetilde{\sigma}(s, a) = \sqrt{\frac{1}{K} \sum_k (Q_k(s, a) - \widetilde{\mu}(s, a))^2}$

and selecting actions via

$a_t \in \arg\max_a \{ \widetilde{\mu}(s_t, a) + \lambda \cdot \widetilde{\sigma}(s_t, a) \}$

which encodes optimism and directs exploration to uncertain state-action pairs (Chen et al., 2017).

In context-dependent UCBs for function approximation (e.g., UCLS), the local covariance of value estimates is tracked, yielding state-action specific confidence bounds and focusing exploration where uncertainty or prediction error is large (Kumaraswamy et al., 2018).
NeuralUCB leverages neural tangent kernel embeddings: for each context $x$ , define $\phi(x) = \nabla_\theta f(x; \theta)/\sqrt{m}$ , and build UCBs of the form

$U_{t,a} = f(x_{t,a}; \theta_{t-1}) + \gamma_{t-1} \sqrt{ \phi(x_{t,a})^\top Z_{t-1}^{-1} \phi(x_{t,a}) }$

thus quantifying uncertainty as a function of learned distributed representations (Zhou et al., 2019).

In contextual bandits with very large or infinite action spaces, UCCB (Upper Confidence Counterfactual Bounds) constructs confidence intervals in policy space rather than just in action space, leveraging counterfactual divergence and context complexity to achieve optimal trade-offs without dependency on context cardinality (Xu et al., 2020).

5. Implications of Prior Information, Optimality, and Limiting Theorems

Bayesian perspectives on UCC (e.g., (Russo, 2019, Atsidakou et al., 2023)) show that Upper Confidence Bound rules can be formally equivalent to Gittins indices in certain regimes, especially in Gaussian bandits with large time horizons (discount factor $\gamma \to 1$ ):

$\lambda_\gamma(\mu, \sigma^2) = \mu + \Phi^{-1}(\gamma) \sigma + o(1)$

This establishes that proper tuning of the optimism parameter (via the linkage $q = \gamma$ for the quantile) can achieve the same exploration-exploitation characteristics as dynamically optimal solutions.

Recent results further establish that, under certain conditions, UCB-type procedures with exploration bonuses tuned by prior informativeness (in terms of variance and mean separation) can achieve finite-time regret that is logarithmic or even constant in favorable regimes,

$R(n) = O(c_\Delta \log n), \quad R(n) = O(c_h \log^2 n)$

with explicit dependency on gap surprise and the prior's information content (Atsidakou et al., 2023).

The impossibility of oracle selectivity imposes that UCC designs must trade-off between uniform regret guarantees and specialization to narrow classes of environments—no algorithm can "meta-optimize" exploration adaptively to the unknown environment class (Salomon et al., 2011).

6. Robustness, Heavy-Tailed Distributions, and Data-Driven UCB

Standard UCBs rely on moment assumptions (variance, sub-Gaussianity) that may not hold in practical, heavy-tailed environments. Data-driven UCBs using robust estimators such as the resampled median-of-means (RMM) circumvent this limitation by constructing parameter-free, distribution-free confidence bounds without recourse to unknown moment parameters (Tamás et al., 9 Jun 2024). The procedure for RMM-UCB constructs bounds directly from empirical data using randomized sign flips and median aggregation, supporting robust operation even when rewards are heavy-tailed, thereby enabling curiosity-based bonuses to be reliably computed in non-sub-Gaussian regimes.

The associated regret bounds (for symmetric heavy-tailed settings) remain near optimal:

$R_n \leq \sum_{i:\Delta_i > 0} [\max\{c_i \left(\frac{M_i}{\Delta_i^{1+a_i}}\right)^{1/a_i}, 17^2\} \log^2(n) + C ]\Delta_i$

independent of unknown distributional moments, providing strong guarantees in adversarial noise.

7. Practical Deployment and Application Scenarios

Applications of UCC-type algorithms are diverse:

In global optimization of continuous domains, the use of generic chaining UCBs (with discretization trees) extends optimism- and curiosity-driven methods to Gaussian or broader stochastic processes (Contal et al., 2016). These approaches adapt exploration granularity to intrinsic problem complexity (e.g., the metric entropy of the domain).
In robot planning or point-goal navigation, model uncertainty from an ensemble of predictors is integrated via UCB objectives, selecting actions or paths that are both promising and uncertain—an instantiation of curiosity at the planning level (Georgakis et al., 2022). For example, path costs may be minimized as

$\text{argmin}_{s \in S} (\mu_s - \alpha_1 \sigma_s + \alpha_2 d_s)$

where $\mu_s$ is mean traversability, $\sigma_s$ is epistemic uncertainty, and $d_s$ is path length.

In model-based RL, latent ensemble models forecast long-term returns, and a UCB objective prioritizes both high-mean and high-variance (curious) exploratory actions, yielding substantial improvements in sample efficiency and robust learning (Seyde et al., 2020).

Conclusion

UCC, as an umbrella for Upper Confidence Bounds with Curiosity, encapsulates the progressive refinement of optimism-based exploration methods through principled and dynamic uncertainty quantification, adaptation to structural properties of the environment, and robust, data-driven estimation. While universal adaptivity is theoretically precluded, approaches integrating information-theoretic, rational, or empirical curiosity can sharply improve sample efficiency and robustness within target environment classes. Leading algorithmic instantiations span bandits, contextual bandits with deep representations, reinforcement learning with ensembles, robust statistics, and planning under epistemic uncertainty, each demonstrating how curiosity—mathematically formalized—augments the power and nuance of exploration in learning agents.