Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP (1901.09311v2)

Published 27 Jan 2019 in cs.LG and stat.ML

Abstract: A fundamental question in reinforcement learning is whether model-free algorithms are sample efficient. Recently, Jin et al. \cite{jin2018q} proposed a Q-learning algorithm with UCB exploration policy, and proved it has nearly optimal regret bound for finite-horizon episodic MDP. In this paper, we adapt Q-learning with UCB-exploration bonus to infinite-horizon MDP with discounted rewards \emph{without} accessing a generative model. We show that the \textit{sample complexity of exploration} of our algorithm is bounded by $\tilde{O}({\frac{SA}{\epsilon2(1-\gamma)7}})$. This improves the previously best known result of $\tilde{O}({\frac{SA}{\epsilon4(1-\gamma)8}})$ in this setting achieved by delayed Q-learning \cite{strehl2006pac}, and matches the lower bound in terms of $\epsilon$ as well as $S$ and $A$ except for logarithmic factors.

Citations (93)

Summary

We haven't generated a summary for this paper yet.