Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 425 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Upper Confidence Bounds with Curiosity (UCC)

Updated 28 October 2025
  • Upper Confidence Bounds with Curiosity (UCC) is a framework that augments classic UCB algorithms with dynamic, curiosity-driven exploration to balance exploitation and adaptive exploration.
  • It synthesizes theory and practice by integrating information-theoretic measures and rational models to adjust exploration bonuses based on uncertainty and learning progress.
  • UCC methods improve sample efficiency and robustness in complex settings such as stochastic bandits, reinforcement learning, and heavy-tailed environments.

Upper Confidence Bounds with Curiosity (UCC) designates a class of exploration strategies for sequential decision making—most notably in stochastic multi-armed bandits and reinforcement learning—where classical Upper Confidence Bound (UCB) algorithms are augmented by mechanisms that instantiate curiosity-driven or adaptive exploration. UCC-type methods synthesize the classic optimism-in-the-face-of-uncertainty principle with more dynamic or information-theoretically inspired bonuses drawn from empirical or theoretical measures of uncertainty, information gain, or learning progress. The resulting framework encompasses a wide range of instantiations in bandit problems, contextual bandits, reinforcement learning, and more general stochastic process optimization. Approaches to UCC are informed by both lower bound theory and rational models of curiosity.

1. Mathematical Structure and Foundational UCB Forms

The archetypal UCB policy for stochastic bandits assigns, at each round tt, to arm kk an index

Bk(s,t)=Xk,s+clogtsB_k(s, t) = X_{k,s} + \sqrt{\frac{c \log t}{s}}

where Xk,sX_{k,s} is the empirical mean after ss pulls, and cc is a tunable constant. This index achieves a high-probability upper bound for the unknown mean reward μk\mu_k and balances exploitation (via Xk,sX_{k,s}) with exploration (via (clogt)/s\sqrt{(c \log t)/s}). Extensions such as UCB(ρ)(\rho) replace the constant cc by a parameter ρ>0\rho > 0:

Bk(s,t)=Xk,s+ρlogtsB_k(s, t) = X_{k,s} + \sqrt{\frac{\rho \log t}{s}}

The parameter ρ\rho modulates the amount of forced exploration: larger ρ\rho increases exploration, crucial for worst-case optimality. Generalized policies consider Bk(s,t)=Xk,s+fk(t)sB_k(s,t) = X_{k,s} + \sqrt{\frac{f_k(t)}{s}} for any increasing function fk(t)f_k(t), capturing a trade-off between sample efficiency in "hard" versus "simple" environments (Salomon et al., 2011).

In KL-UCB algorithms, the bonus is based on inverting the Kullback-Leibler divergence between the empirical distribution ν^a\widehat{\nu}_a and plausible means ν\nu,

Ua(t)=sup{μ:d(μ^a(t),μ)f(t)Na(t)}U_a(t) = \sup\left\{\mu : d(\widehat{\mu}_a(t), \mu) \leq \frac{f(t)}{N_a(t)}\right\}

where d(,)d(\cdot,\cdot) is the appropriate divergence for the underlying parametric family (Cappé et al., 2012). This tightens confidence intervals and aligns the bonus with the statistical geometry of the model.

When ported to reinforcement learning, especially in high-dimensional or continuous spaces, the UCB principle is operationalized through model ensembles, random feature embeddings (e.g., via neural tangent kernel gradients), or by optimizing with respect to uncertainty in latent model predictions (Chen et al., 2017, Zhou et al., 2019, Seyde et al., 2020).

2. Regret Lower Bounds and Adaptive Exploration

UCB methods are characterized by their regret scaling properties. Consistent policies, as defined by Lai and Robbins, guarantee

E[Rn]=o(na),a>0\mathbb{E}[R_n] = o(n^a), \quad \forall a > 0

and incur a minimal number of explorative pulls per suboptimal arm, set by the instance-dependent complexity Dk(θ)D_k(\theta),

lim infnE[Tk(n)]logn1Dk(θ)\liminf_n \frac{\mathbb{E}[T_k(n)]}{\log n} \geq \frac{1}{D_k(\theta)}

where Dk(θ)D_k(\theta) is a min-KL subject to mean constraints. For generalized policies (so-called α\alpha-consistent), the bound is relaxed to

lim infnE[Tk(n)]logn1αDk(θ)\liminf_n \frac{\mathbb{E}[T_k(n)]}{\log n} \geq \frac{1-\alpha}{D_k(\theta)}

highlighting the exploration-sample complexity trade-off (Salomon et al., 2011).

Crucially, more aggressive adaptations—in which fk(t)f_k(t) grows sub-logarithmically—may minimize regret in "easy" (e.g., deterministic) environments down to O(loglogn)O(\log\log n), but at the cost of potentially catastrophic regret in "hard" settings. Theoretical impossibility results preclude the existence of adaptive meta-algorithms that optimally select the best exploration rate online without environmental side information (Salomon et al., 2011).

3. Curiosity-Driven Bonuses: Rational Models and Information-Theoretic Extensions

Curiosity in UCC extends exploration strategies by dynamically tuning the exploration term using either empirical uncertainty, theoretical information gain estimates, or intrinsic learning progress. Rational models (Dubey et al., 2017) formalize curiosity Ωk\Omega_k as the expected change in knowledge utility VV,

Ωk=pkdckdhk\Omega_k = p_k \, \frac{d c_k}{d h_k}

where pkp_k is the probability of need ("need probability"), hkh_k the exposure, and ckc_k the confidence in stimulus kk. Mathematical forms such as Ωkln(1ck)(1ck)\Omega_k \propto -\ln(1 - c_k)(1 - c_k) (inverted-U w.r.t. confidence) or Ωk1ck\Omega_k \propto 1 - c_k (novelty-driven) provide dynamic curiosity bonuses as a function of confidence and sampling history.

Mechanistically, this bonus can be integrated into classical UCB by augmenting the confidence bound, yielding policies of the form:

UCC(s,a)=Q(s,a)+βΩ(s,a)\text{UCC}(s, a) = Q(s, a) + \beta \cdot \Omega(s, a)

where Ω\Omega quantifies local curiosity and β\beta balances exploitation, uncertainty, and exploration according to context.

In information-theoretic UCB extensions, such as KL-UCB, the width of the confidence region already quantifies the "difficulty" of the learning problem through divergence. This approach can be merged with curiosity by, for instance, adaptively increasing the exploration parameter for arms or states where learning progress is maximal or statistical surprise is high (Cappé et al., 2012).

4. UCC in Reinforcement Learning and Contextual Bandits

In deep RL, UCC is operationalized through ensemble methods and context-dependent variance estimation:

  • Q-ensemble approaches maintain an ensemble of KK independently learned QQ-functions {Qk}\{ Q_k \}, computing

μ~(s,a)=1KkQk(s,a),σ~(s,a)=1Kk(Qk(s,a)μ~(s,a))2\widetilde{\mu}(s, a) = \frac{1}{K} \sum_k Q_k(s, a), \quad \widetilde{\sigma}(s, a) = \sqrt{\frac{1}{K} \sum_k (Q_k(s, a) - \widetilde{\mu}(s, a))^2}

and selecting actions via

atargmaxa{μ~(st,a)+λσ~(st,a)}a_t \in \arg\max_a \{ \widetilde{\mu}(s_t, a) + \lambda \cdot \widetilde{\sigma}(s_t, a) \}

which encodes optimism and directs exploration to uncertain state-action pairs (Chen et al., 2017).

  • In context-dependent UCBs for function approximation (e.g., UCLS), the local covariance of value estimates is tracked, yielding state-action specific confidence bounds and focusing exploration where uncertainty or prediction error is large (Kumaraswamy et al., 2018).
  • NeuralUCB leverages neural tangent kernel embeddings: for each context xx, define ϕ(x)=θf(x;θ)/m\phi(x) = \nabla_\theta f(x; \theta)/\sqrt{m}, and build UCBs of the form

Ut,a=f(xt,a;θt1)+γt1ϕ(xt,a)Zt11ϕ(xt,a)U_{t,a} = f(x_{t,a}; \theta_{t-1}) + \gamma_{t-1} \sqrt{ \phi(x_{t,a})^\top Z_{t-1}^{-1} \phi(x_{t,a}) }

thus quantifying uncertainty as a function of learned distributed representations (Zhou et al., 2019).

  • In contextual bandits with very large or infinite action spaces, UCCB (Upper Confidence Counterfactual Bounds) constructs confidence intervals in policy space rather than just in action space, leveraging counterfactual divergence and context complexity to achieve optimal trade-offs without dependency on context cardinality (Xu et al., 2020).

5. Implications of Prior Information, Optimality, and Limiting Theorems

Bayesian perspectives on UCC (e.g., (Russo, 2019, Atsidakou et al., 2023)) show that Upper Confidence Bound rules can be formally equivalent to Gittins indices in certain regimes, especially in Gaussian bandits with large time horizons (discount factor γ1\gamma \to 1):

λγ(μ,σ2)=μ+Φ1(γ)σ+o(1)\lambda_\gamma(\mu, \sigma^2) = \mu + \Phi^{-1}(\gamma) \sigma + o(1)

This establishes that proper tuning of the optimism parameter (via the linkage q=γq = \gamma for the quantile) can achieve the same exploration-exploitation characteristics as dynamically optimal solutions.

Recent results further establish that, under certain conditions, UCB-type procedures with exploration bonuses tuned by prior informativeness (in terms of variance and mean separation) can achieve finite-time regret that is logarithmic or even constant in favorable regimes,

R(n)=O(cΔlogn),R(n)=O(chlog2n)R(n) = O(c_\Delta \log n), \quad R(n) = O(c_h \log^2 n)

with explicit dependency on gap surprise and the prior's information content (Atsidakou et al., 2023).

The impossibility of oracle selectivity imposes that UCC designs must trade-off between uniform regret guarantees and specialization to narrow classes of environments—no algorithm can "meta-optimize" exploration adaptively to the unknown environment class (Salomon et al., 2011).

6. Robustness, Heavy-Tailed Distributions, and Data-Driven UCB

Standard UCBs rely on moment assumptions (variance, sub-Gaussianity) that may not hold in practical, heavy-tailed environments. Data-driven UCBs using robust estimators such as the resampled median-of-means (RMM) circumvent this limitation by constructing parameter-free, distribution-free confidence bounds without recourse to unknown moment parameters (Tamás et al., 9 Jun 2024). The procedure for RMM-UCB constructs bounds directly from empirical data using randomized sign flips and median aggregation, supporting robust operation even when rewards are heavy-tailed, thereby enabling curiosity-based bonuses to be reliably computed in non-sub-Gaussian regimes.

The associated regret bounds (for symmetric heavy-tailed settings) remain near optimal:

Rni:Δi>0[max{ci(MiΔi1+ai)1/ai,172}log2(n)+C]ΔiR_n \leq \sum_{i:\Delta_i > 0} [\max\{c_i \left(\frac{M_i}{\Delta_i^{1+a_i}}\right)^{1/a_i}, 17^2\} \log^2(n) + C ]\Delta_i

independent of unknown distributional moments, providing strong guarantees in adversarial noise.

7. Practical Deployment and Application Scenarios

Applications of UCC-type algorithms are diverse:

  • In global optimization of continuous domains, the use of generic chaining UCBs (with discretization trees) extends optimism- and curiosity-driven methods to Gaussian or broader stochastic processes (Contal et al., 2016). These approaches adapt exploration granularity to intrinsic problem complexity (e.g., the metric entropy of the domain).
  • In robot planning or point-goal navigation, model uncertainty from an ensemble of predictors is integrated via UCB objectives, selecting actions or paths that are both promising and uncertain—an instantiation of curiosity at the planning level (Georgakis et al., 2022). For example, path costs may be minimized as

argminsS(μsα1σs+α2ds)\text{argmin}_{s \in S} (\mu_s - \alpha_1 \sigma_s + \alpha_2 d_s)

where μs\mu_s is mean traversability, σs\sigma_s is epistemic uncertainty, and dsd_s is path length.

  • In model-based RL, latent ensemble models forecast long-term returns, and a UCB objective prioritizes both high-mean and high-variance (curious) exploratory actions, yielding substantial improvements in sample efficiency and robust learning (Seyde et al., 2020).

Conclusion

UCC, as an umbrella for Upper Confidence Bounds with Curiosity, encapsulates the progressive refinement of optimism-based exploration methods through principled and dynamic uncertainty quantification, adaptation to structural properties of the environment, and robust, data-driven estimation. While universal adaptivity is theoretically precluded, approaches integrating information-theoretic, rational, or empirical curiosity can sharply improve sample efficiency and robustness within target environment classes. Leading algorithmic instantiations span bandits, contextual bandits with deep representations, reinforcement learning with ensembles, robust statistics, and planning under epistemic uncertainty, each demonstrating how curiosity—mathematically formalized—augments the power and nuance of exploration in learning agents.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Upper Confidence Bounds with Curiosity (UCC).