UCBZero: Reward-Free Exploration in RL
- UCBZero is a task-agnostic exploration algorithm for RL that collects reward-free trajectories and uses UCB bonuses in a two-phase protocol to build near-optimal policies.
- It achieves sample complexity of O((log N + ι) H⁵SA/ε²) with a provable logarithmic dependence on the number of tasks, matching information-theoretic lower bounds.
- The algorithm decouples exploration from reward signals, enabling efficient policy optimization across diverse tasks and robust model estimation from episodic MDPs.
UCBZero is a task-agnostic exploration algorithm designed for reinforcement learning (RL) in the absence of reward supervision during exploration. It operates within a tabular, fixed-horizon, episodic Markov Decision Process (MDP) and enables efficient reuse of exploration data to generate near-optimal policies for multiple tasks with different, a priori unknown reward functions. UCBZero achieves near-optimal sample complexity in terms of the number of exploration episodes required to guarantee -optimality simultaneously across tasks, with a provably unavoidable logarithmic dependence on (Zhang et al., 2020).
1. Task-Agnostic RL: Problem Formulation
The setting is a finite, tabular, episodic MDP defined by : is the state space (), is the action space (), is the horizon, is the transition kernel at step , and is the reward distribution, with all episodes beginning at a fixed .
Learning proceeds in two distinct phases:
- Exploration phase: Over episodes, the agent explores the MDP without reward feedback, collecting state-action trajectories .
- Policy-optimization phase: Presented with tasks, each with an associated unknown reward function , trajectories are augmented by sample rewards to form . The objective is to identify, for each , a policy such that with high probability ():
for all tasks, by using as few exploration episodes as possible.
2. UCBZero Algorithmic Principles and Workflow
UCBZero executes an exploration-first, optimization-after protocol based on optimism-driven Q-learning with upper confidence bound (UCB) bonuses.
Notation
- : Number of tasks
- : Number of exploration episodes
- : Confidence parameter
- , used in the bonus definition
- : Hoeffding-style UCB bonus
- : Learning rate
- : Q-value at step , optimistic initialization to
- : State-action visitation count
Pseudocode Outline
Exploration Phase (Zero-reward Q-learning with UCB bonuses)
- Initialize and for all
- For to exploration episodes:
- For to :
- Choose
- Observe next state
- ; update count
- Update:
- No rewards are used; exploration is driven by the bonus.
Policy-Optimization Phase (for each task )
- For : Input
- Initialize and
- For to :
- For to :
- Update:
- Output: Uniform mixture over the sequence of greedy policies.
Explanation
In the exploration phase, the algorithm uses only the confidence bonus , instilling optimism. In the policy-optimization phase, sampled rewards plus a standard bonus are used, recovering standard Q-learning with optimism for each task (Zhang et al., 2020).
3. Theoretical Sample Complexity and Optimality Results
UCBZero's theoretical guarantees quantify both its efficiency and the inherent difficulty of the task-agnostic RL problem.
Main Results
| Guarantee Type | Episodes Required | Dependence on |
|---|---|---|
| Upper Bound | ||
| Lower Bound | (provably necessary) |
- Upper Bound: With probability at least $1-p$, after exploration episodes, UCBZero delivers -optimal policies for all tasks.
- Lower Bound: Any -correct algorithm must use at least episodes. The logarithmic dependence on is shown to be information-theoretically unavoidable.
Notation: ; denotes an asymptotic lower bound up to constants (Zhang et al., 2020).
4. Technical Proof Sketch: Regret Decomposition and Coverage
The sample complexity results rest on principles of optimism, Q-regret decomposition, and careful control of empirical value estimates.
- Optimism & Q-Regret Decomposition: Define a pseudo-MDP with zero rewards to compare Q-updates under pure exploration () and per-task learning (). Using induction and Azuma–Hoeffding concentration:
where is null under the zero-reward MDP.
- Aggregate Regret Control: The aggregate Q-regret across episodes,
is upper-bounded by . By the regret-difference lemma, this formally applies independently across all tasks.
- Implication: Averaging over episodes and solving for , the bound emerges. This matches the lower bound up to order, apart from the exponent gap (Zhang et al., 2020).
5. Reward-Free RL: The Known-Reward Variant
When the reward functions for the tasks are known at the planning stage (the reward-free RL framework), the sample complexity becomes independent of .
- Key Result: By covering the reward function space with an -net, the complexity reduces to:
for all policies. The explicit -dependence disappears.
- Conceptual Mechanism: If two reward functions differ by at most on every transition, then a single near-optimal policy suffices for both (by the Simulation Lemma). Discretizing the reward space yields an effective task number , with . This permits absorbing into for description length scaling (Zhang et al., 2020).
6. Coverage, Model Estimation, and Practical Behavior
- Visitation Guarantee: UCBZero offers a uniform lower bound on (state, action, timestep) coverage; every triplet is visited at least
times, where , ensuring sufficient empirical coverage.
- Transition Model Estimation: From zero-reward trajectories, it is possible to estimate a transition model such that
after episodes, validating the use of any off-policy or batch RL method on the collected data.
- Conceptual Insights: UCBZero demonstrates that a combination of exhaustive exploration (without using reward signals) and Q-learning optimism suffices to match the best-reported sample efficiency for task-specific exploration, up to a factor. This is proved to be unavoidable when optimizing for tasks from a single pool of exploration data.
These insights establish UCBZero as a model-free, UCB-based approach for efficient, task-agnostic exploration, enabling a single data collection effort to be leveraged across multiple downstream RL tasks without reward-guided exploration and with near-optimal sample complexity in all core problem dimensions (Zhang et al., 2020).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free