UCBZero: Reward-Free Exploration in RL

Updated 20 November 2025

UCBZero is a task-agnostic exploration algorithm for RL that collects reward-free trajectories and uses UCB bonuses in a two-phase protocol to build near-optimal policies.
It achieves sample complexity of O((log N + ι) H⁵SA/ε²) with a provable logarithmic dependence on the number of tasks, matching information-theoretic lower bounds.
The algorithm decouples exploration from reward signals, enabling efficient policy optimization across diverse tasks and robust model estimation from episodic MDPs.

UCBZero is a task-agnostic exploration algorithm designed for reinforcement learning (RL) in the absence of reward supervision during exploration. It operates within a tabular, fixed-horizon, episodic Markov Decision Process (MDP) and enables efficient reuse of exploration data to generate near-optimal policies for multiple tasks with different, a priori unknown reward functions. UCBZero achieves near-optimal sample complexity in terms of the number of exploration episodes required to guarantee $\epsilon$ -optimality simultaneously across $N$ tasks, with a provably unavoidable logarithmic dependence on $N$ (Zhang et al., 2020).

1. Task-Agnostic RL: Problem Formulation

The setting is a finite, tabular, episodic MDP defined by $(\mathcal{S}, \mathcal{A}, H, P, r)$ : $\mathcal{S}$ is the state space ( $|\mathcal{S}|=S$ ), $\mathcal{A}$ is the action space ( $|\mathcal{A}|=A$ ), $H$ is the horizon, $P_h(\cdot|s,a)$ is the transition kernel at step $h$ , and $r_h(\cdot|s,a)\in[0,1]$ is the reward distribution, with all episodes beginning at a fixed $s_1$ .

Learning proceeds in two distinct phases:

Exploration phase: Over $K$ episodes, the agent explores the MDP without reward feedback, collecting state-action trajectories $D = \{(s_h^k, a_h^k)\}_{h=1...H, k=1...K}$ .
Policy-optimization phase: Presented with $N$ tasks, each with an associated unknown reward function $r^{(n)}$ , trajectories are augmented by sample rewards to form $D^{(n)}$ . The objective is to identify, for each $n=1...N$ , a policy $\pi^{(n)}$ such that with high probability ( $\geq 1-p$ ):

$V_1^{*(n)}(s_1) - V_1^{\pi^{(n)}}(s_1) \leq \epsilon$

for all $N$ tasks, by using as few exploration episodes $K$ as possible.

2. UCBZero Algorithmic Principles and Workflow

UCBZero executes an exploration-first, optimization-after protocol based on optimism-driven Q-learning with upper confidence bound (UCB) bonuses.

Notation

$N$ : Number of tasks
$K$ : Number of exploration episodes
$p$ : Confidence parameter
$\iota = \log(SAHK/p)$ , used in the bonus definition
$b_t = c\sqrt{H^3 (\log N + \iota)/t}$ : Hoeffding-style UCB bonus
$\alpha_t = (H+1)/(H + t)$ : Learning rate
$Q_h(s, a)$ : Q-value at step $h$ , optimistic initialization to $H$
$N_h(s,a)$ : State-action visitation count

Pseudocode Outline

Exploration Phase (Zero-reward Q-learning with UCB bonuses)

Initialize $Q_h(s, a) \leftarrow H$ and $N_h(s,a) \leftarrow 0$ for all $(s, a, h)$
For $k=1$ $k = 1$ to $K$ $K$ exploration episodes:
- For $h=1$ to $H$ :
- Choose $a_h^k = \arg \max_a Q_h(s_h^k, a)$
- Observe next state $s_{h+1}^k$
- $t \leftarrow N_h(s_h^k, a_h^k) + 1$ ; update count
- $V_{h+1} \leftarrow \min\{H, \max_{a'} Q_{h+1}(s_{h+1}^k, a')\}$
- Update:
$Q_h(s_h^k, a_h^k) \leftarrow (1-\alpha_t) Q_h(s_h^k, a_h^k) + \alpha_t [V_{h+1} + 2 b_t]$ - No rewards are used; exploration is driven by the bonus.

Policy-Optimization Phase (for each task $n$ )

For $n=1...N$ : Input $D^{(n)} = \{(s_h^k, a_h^k, r_h^k)\}$
Initialize $Q_h(s, a) \leftarrow H$ and $N_h(s, a) \leftarrow 0$
For $k=1$ $k = 1$ to $K$ $K$ :
- For $h=1$ to $H$ :
- $t \leftarrow N_h(s_h^k, a_h^k) + 1$
- $V_{h+1} \leftarrow \min\{H, \max_{a'} Q_{h+1}(s_{h+1}^k, a')\}$
- Update:
$Q_h(s_h^k, a_h^k) \leftarrow (1-\alpha_t) Q_h(s_h^k, a_h^k) + \alpha_t [r_h^k + V_{h+1} + b_t]$
Output: Uniform mixture over the sequence of greedy policies.

Explanation

In the exploration phase, the algorithm uses only the confidence bonus $2b_t$ , instilling optimism. In the policy-optimization phase, sampled rewards plus a standard bonus $b_t$ are used, recovering standard Q-learning with optimism for each task (Zhang et al., 2020).

3. Theoretical Sample Complexity and Optimality Results

UCBZero's theoretical guarantees quantify both its efficiency and the inherent difficulty of the task-agnostic RL problem.

Main Results

Guarantee Type	Episodes Required	Dependence on $N$
Upper Bound	$O((\log N + \iota) H^5 S A/\epsilon^2)$	$\log N$
Lower Bound	$\Omega(\log N \cdot H^2 S A/\epsilon^2)$	$\log N$ (provably necessary)

Upper Bound: With probability at least $1-p$, after $K = O((\log N + \iota)\, H^5 S A/\epsilon^2)$ exploration episodes, UCBZero delivers $\epsilon$ -optimal policies for all $N$ tasks.
Lower Bound: Any $(\epsilon, p)$ -correct algorithm must use at least $\Omega(\log N \cdot H^2 S A/\epsilon^2)$ episodes. The logarithmic dependence on $N$ is shown to be information-theoretically unavoidable.

Notation: $\tilde{O}(f) = O(f\,\text{polylog}(f, 1/p, H, S, A))$ ; $\Omega(\cdot)$ denotes an asymptotic lower bound up to constants (Zhang et al., 2020).

4. Technical Proof Sketch: Regret Decomposition and Coverage

The sample complexity results rest on principles of optimism, Q-regret decomposition, and careful control of empirical value estimates.

Optimism & Q-Regret Decomposition: Define a pseudo-MDP with zero rewards to compare Q-updates under pure exploration ( $Q_h^k$ ) and per-task learning ( $Q_h^{k,(n)}$ ). Using induction and Azuma–Hoeffding concentration:

$Q_h^{k,(n)}(s,a) - Q_h^{\pi_k,(n)}(s,a) \leq Q_h^k(s,a) - Q_h^{\pi_k}(s,a)$

where $Q_h^{\pi_k}$ is null under the zero-reward MDP.

Aggregate Regret Control: The aggregate Q-regret across episodes,

$\sum_{k=1}^K [V_1^k - V_1^{\pi_k}],$

is upper-bounded by $\sqrt{(\log N + \iota)\, H^5 S A K}$ . By the regret-difference lemma, this formally applies independently across all $N$ tasks.

Implication: Averaging over $K$ episodes and solving for $\epsilon$ , the bound $K = O((\log N + \iota) H^5 S A/\epsilon^2)$ emerges. This matches the lower bound up to order, apart from the $H^3$ exponent gap (Zhang et al., 2020).

5. Reward-Free RL: The Known-Reward Variant

When the reward functions $r^{(n)}$ for the $N$ tasks are known at the planning stage (the reward-free RL framework), the sample complexity becomes independent of $N$ .

Key Result: By covering the reward function space with an $\epsilon$ -net, the complexity reduces to:

$K = O(H^6 S^2 A^2 (\log(3H/\epsilon) + \iota)/\epsilon^2)$

for all policies. The explicit $N$ -dependence disappears.

Conceptual Mechanism: If two reward functions differ by at most $\epsilon/H$ on every transition, then a single near-optimal policy suffices for both (by the Simulation Lemma). Discretizing the reward space yields an effective task number $N \approx M^{HSA}$ , with $M = O(H/\epsilon)$ . This permits absorbing $N$ into $\tilde O$ for description length scaling (Zhang et al., 2020).

6. Coverage, Model Estimation, and Practical Behavior

Visitation Guarantee: UCBZero offers a uniform lower bound on (state, action, timestep) coverage; every triplet is visited at least

$\Omega\Bigl(\frac{K \cdot \delta_h(s)^2}{H^2 S A}\Bigr)$

times, where $\delta_h(s) = \max_\pi \Pr^\pi(s_h = s)$ , ensuring sufficient empirical coverage.

Transition Model Estimation: From zero-reward trajectories, it is possible to estimate a transition model $\hat P_h$ such that

$|\hat P_h(s'|s,a) - P_h(s'|s,a)| \leq \epsilon/\delta_h(s)$

after $O(H^5 S A \iota/\epsilon^2)$ episodes, validating the use of any off-policy or batch RL method on the collected data.

Conceptual Insights: UCBZero demonstrates that a combination of exhaustive exploration (without using reward signals) and Q-learning optimism suffices to match the best-reported sample efficiency for task-specific exploration, up to a $\log N$ factor. This $\log N$ is proved to be unavoidable when optimizing for $N$ tasks from a single pool of exploration data.

These insights establish UCBZero as a model-free, UCB-based approach for efficient, task-agnostic exploration, enabling a single data collection effort to be leveraged across multiple downstream RL tasks without reward-guided exploration and with near-optimal sample complexity in all core problem dimensions (Zhang et al., 2020).

PDF Markdown Chat (Pro)

References (1)

Task-agnostic Exploration in Reinforcement Learning (2020)

Follow Topic

Get notified by email when new papers are published related to UCBZero.