Papers
Topics
Authors
Recent
2000 character limit reached

UCBZero: Reward-Free Exploration in RL

Updated 20 November 2025
  • UCBZero is a task-agnostic exploration algorithm for RL that collects reward-free trajectories and uses UCB bonuses in a two-phase protocol to build near-optimal policies.
  • It achieves sample complexity of O((log N + ι) H⁵SA/ε²) with a provable logarithmic dependence on the number of tasks, matching information-theoretic lower bounds.
  • The algorithm decouples exploration from reward signals, enabling efficient policy optimization across diverse tasks and robust model estimation from episodic MDPs.

UCBZero is a task-agnostic exploration algorithm designed for reinforcement learning (RL) in the absence of reward supervision during exploration. It operates within a tabular, fixed-horizon, episodic Markov Decision Process (MDP) and enables efficient reuse of exploration data to generate near-optimal policies for multiple tasks with different, a priori unknown reward functions. UCBZero achieves near-optimal sample complexity in terms of the number of exploration episodes required to guarantee ϵ\epsilon-optimality simultaneously across NN tasks, with a provably unavoidable logarithmic dependence on NN (Zhang et al., 2020).

1. Task-Agnostic RL: Problem Formulation

The setting is a finite, tabular, episodic MDP defined by (S,A,H,P,r)(\mathcal{S}, \mathcal{A}, H, P, r): S\mathcal{S} is the state space (S=S|\mathcal{S}|=S), A\mathcal{A} is the action space (A=A|\mathcal{A}|=A), HH is the horizon, Ph(s,a)P_h(\cdot|s,a) is the transition kernel at step hh, and rh(s,a)[0,1]r_h(\cdot|s,a)\in[0,1] is the reward distribution, with all episodes beginning at a fixed s1s_1.

Learning proceeds in two distinct phases:

  • Exploration phase: Over KK episodes, the agent explores the MDP without reward feedback, collecting state-action trajectories D={(shk,ahk)}h=1...H,k=1...KD = \{(s_h^k, a_h^k)\}_{h=1...H, k=1...K}.
  • Policy-optimization phase: Presented with NN tasks, each with an associated unknown reward function r(n)r^{(n)}, trajectories are augmented by sample rewards to form D(n)D^{(n)}. The objective is to identify, for each n=1...Nn=1...N, a policy π(n)\pi^{(n)} such that with high probability (1p\geq 1-p):

V1(n)(s1)V1π(n)(s1)ϵV_1^{*(n)}(s_1) - V_1^{\pi^{(n)}}(s_1) \leq \epsilon

for all NN tasks, by using as few exploration episodes KK as possible.

2. UCBZero Algorithmic Principles and Workflow

UCBZero executes an exploration-first, optimization-after protocol based on optimism-driven Q-learning with upper confidence bound (UCB) bonuses.

Notation

  • NN: Number of tasks
  • KK: Number of exploration episodes
  • pp: Confidence parameter
  • ι=log(SAHK/p)\iota = \log(SAHK/p), used in the bonus definition
  • bt=cH3(logN+ι)/tb_t = c\sqrt{H^3 (\log N + \iota)/t}: Hoeffding-style UCB bonus
  • αt=(H+1)/(H+t)\alpha_t = (H+1)/(H + t): Learning rate
  • Qh(s,a)Q_h(s, a): Q-value at step hh, optimistic initialization to HH
  • Nh(s,a)N_h(s,a): State-action visitation count

Pseudocode Outline

Exploration Phase (Zero-reward Q-learning with UCB bonuses)

  • Initialize Qh(s,a)HQ_h(s, a) \leftarrow H and Nh(s,a)0N_h(s,a) \leftarrow 0 for all (s,a,h)(s, a, h)
  • For k=1k=1 to KK exploration episodes:

    • For h=1h=1 to HH:
    • Choose ahk=argmaxaQh(shk,a)a_h^k = \arg \max_a Q_h(s_h^k, a)
    • Observe next state sh+1ks_{h+1}^k
    • tNh(shk,ahk)+1t \leftarrow N_h(s_h^k, a_h^k) + 1; update count
    • Vh+1min{H,maxaQh+1(sh+1k,a)}V_{h+1} \leftarrow \min\{H, \max_{a'} Q_{h+1}(s_{h+1}^k, a')\}
    • Update:

    Qh(shk,ahk)(1αt)Qh(shk,ahk)+αt[Vh+1+2bt]Q_h(s_h^k, a_h^k) \leftarrow (1-\alpha_t) Q_h(s_h^k, a_h^k) + \alpha_t [V_{h+1} + 2 b_t] - No rewards are used; exploration is driven by the bonus.

Policy-Optimization Phase (for each task nn)

  • For n=1...Nn=1...N: Input D(n)={(shk,ahk,rhk)}D^{(n)} = \{(s_h^k, a_h^k, r_h^k)\}
  • Initialize Qh(s,a)HQ_h(s, a) \leftarrow H and Nh(s,a)0N_h(s, a) \leftarrow 0
  • For k=1k=1 to KK:

    • For h=1h=1 to HH:
    • tNh(shk,ahk)+1t \leftarrow N_h(s_h^k, a_h^k) + 1
    • Vh+1min{H,maxaQh+1(sh+1k,a)}V_{h+1} \leftarrow \min\{H, \max_{a'} Q_{h+1}(s_{h+1}^k, a')\}
    • Update:

    Qh(shk,ahk)(1αt)Qh(shk,ahk)+αt[rhk+Vh+1+bt]Q_h(s_h^k, a_h^k) \leftarrow (1-\alpha_t) Q_h(s_h^k, a_h^k) + \alpha_t [r_h^k + V_{h+1} + b_t]

  • Output: Uniform mixture over the sequence of greedy policies.

Explanation

In the exploration phase, the algorithm uses only the confidence bonus 2bt2b_t, instilling optimism. In the policy-optimization phase, sampled rewards plus a standard bonus btb_t are used, recovering standard Q-learning with optimism for each task (Zhang et al., 2020).

3. Theoretical Sample Complexity and Optimality Results

UCBZero's theoretical guarantees quantify both its efficiency and the inherent difficulty of the task-agnostic RL problem.

Main Results

Guarantee Type Episodes Required Dependence on NN
Upper Bound O((logN+ι)H5SA/ϵ2)O((\log N + \iota) H^5 S A/\epsilon^2) logN\log N
Lower Bound Ω(logNH2SA/ϵ2)\Omega(\log N \cdot H^2 S A/\epsilon^2) logN\log N (provably necessary)
  • Upper Bound: With probability at least $1-p$, after K=O((logN+ι)H5SA/ϵ2)K = O((\log N + \iota)\, H^5 S A/\epsilon^2) exploration episodes, UCBZero delivers ϵ\epsilon-optimal policies for all NN tasks.
  • Lower Bound: Any (ϵ,p)(\epsilon, p)-correct algorithm must use at least Ω(logNH2SA/ϵ2)\Omega(\log N \cdot H^2 S A/\epsilon^2) episodes. The logarithmic dependence on NN is shown to be information-theoretically unavoidable.

Notation: O~(f)=O(fpolylog(f,1/p,H,S,A))\tilde{O}(f) = O(f\,\text{polylog}(f, 1/p, H, S, A)); Ω()\Omega(\cdot) denotes an asymptotic lower bound up to constants (Zhang et al., 2020).

4. Technical Proof Sketch: Regret Decomposition and Coverage

The sample complexity results rest on principles of optimism, Q-regret decomposition, and careful control of empirical value estimates.

  • Optimism & Q-Regret Decomposition: Define a pseudo-MDP with zero rewards to compare Q-updates under pure exploration (QhkQ_h^k) and per-task learning (Qhk,(n)Q_h^{k,(n)}). Using induction and Azuma–Hoeffding concentration:

Qhk,(n)(s,a)Qhπk,(n)(s,a)Qhk(s,a)Qhπk(s,a)Q_h^{k,(n)}(s,a) - Q_h^{\pi_k,(n)}(s,a) \leq Q_h^k(s,a) - Q_h^{\pi_k}(s,a)

where QhπkQ_h^{\pi_k} is null under the zero-reward MDP.

  • Aggregate Regret Control: The aggregate Q-regret across episodes,

k=1K[V1kV1πk],\sum_{k=1}^K [V_1^k - V_1^{\pi_k}],

is upper-bounded by (logN+ι)H5SAK\sqrt{(\log N + \iota)\, H^5 S A K}. By the regret-difference lemma, this formally applies independently across all NN tasks.

  • Implication: Averaging over KK episodes and solving for ϵ\epsilon, the bound K=O((logN+ι)H5SA/ϵ2)K = O((\log N + \iota) H^5 S A/\epsilon^2) emerges. This matches the lower bound up to order, apart from the H3H^3 exponent gap (Zhang et al., 2020).

5. Reward-Free RL: The Known-Reward Variant

When the reward functions r(n)r^{(n)} for the NN tasks are known at the planning stage (the reward-free RL framework), the sample complexity becomes independent of NN.

  • Key Result: By covering the reward function space with an ϵ\epsilon-net, the complexity reduces to:

K=O(H6S2A2(log(3H/ϵ)+ι)/ϵ2)K = O(H^6 S^2 A^2 (\log(3H/\epsilon) + \iota)/\epsilon^2)

for all policies. The explicit NN-dependence disappears.

  • Conceptual Mechanism: If two reward functions differ by at most ϵ/H\epsilon/H on every transition, then a single near-optimal policy suffices for both (by the Simulation Lemma). Discretizing the reward space yields an effective task number NMHSAN \approx M^{HSA}, with M=O(H/ϵ)M = O(H/\epsilon). This permits absorbing NN into O~\tilde O for description length scaling (Zhang et al., 2020).

6. Coverage, Model Estimation, and Practical Behavior

  • Visitation Guarantee: UCBZero offers a uniform lower bound on (state, action, timestep) coverage; every triplet is visited at least

Ω(Kδh(s)2H2SA)\Omega\Bigl(\frac{K \cdot \delta_h(s)^2}{H^2 S A}\Bigr)

times, where δh(s)=maxπPrπ(sh=s)\delta_h(s) = \max_\pi \Pr^\pi(s_h = s), ensuring sufficient empirical coverage.

  • Transition Model Estimation: From zero-reward trajectories, it is possible to estimate a transition model P^h\hat P_h such that

P^h(ss,a)Ph(ss,a)ϵ/δh(s)|\hat P_h(s'|s,a) - P_h(s'|s,a)| \leq \epsilon/\delta_h(s)

after O(H5SAι/ϵ2)O(H^5 S A \iota/\epsilon^2) episodes, validating the use of any off-policy or batch RL method on the collected data.

  • Conceptual Insights: UCBZero demonstrates that a combination of exhaustive exploration (without using reward signals) and Q-learning optimism suffices to match the best-reported sample efficiency for task-specific exploration, up to a logN\log N factor. This logN\log N is proved to be unavoidable when optimizing for NN tasks from a single pool of exploration data.

These insights establish UCBZero as a model-free, UCB-based approach for efficient, task-agnostic exploration, enabling a single data collection effort to be leveraged across multiple downstream RL tasks without reward-guided exploration and with near-optimal sample complexity in all core problem dimensions (Zhang et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to UCBZero.