Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis (2102.06548v4)

Published 12 Feb 2021 in stat.ML, cs.IT, cs.LG, math.IT, math.OC, math.ST, and stat.TH

Abstract: Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state-action pairs are drawn from a generative model in each iteration), substantial progress has been made towards understanding the sample efficiency of Q-learning. Consider a $\gamma$-discounted infinite-horizon MDP with state space $\mathcal{S}$ and action space $\mathcal{A}$: to yield an entrywise $\varepsilon$-approximation of the optimal Q-function, state-of-the-art theory for Q-learning requires a sample size exceeding the order of $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^{5\varepsilon^{2}}$,} which fails to match existing minimax lower bounds. This gives rise to natural questions: what is the sharp sample complexity of Q-learning? Is Q-learning provably sub-optimal? This paper addresses these questions for the synchronous setting: (1) when $|\mathcal{A}|=1$ (so that Q-learning reduces to TD learning), we prove that the sample complexity of TD learning is minimax optimal and scales as $\frac{|\mathcal{S}|}{(1-\gamma)^{3\varepsilon^2}$} (up to log factor); (2) when $|\mathcal{A}|\geq 2$, we settle the sample complexity of Q-learning to be on the order of $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^{4\varepsilon^2}$} (up to log factor). Our theory unveils the strict sub-optimality of Q-learning when $|\mathcal{A}|\geq 2$, and rigorizes the negative impact of over-estimation in Q-learning. Finally, we extend our analysis to accommodate asynchronous Q-learning (i.e., the case with Markovian samples), sharpening the horizon dependency of its sample complexity to be $\frac{1}{(1-\gamma)^4}$.

Authors (5)

Gen Li (143 papers)
Changxiao Cai (11 papers)
Yuxin Chen (195 papers)
Yuting Wei (47 papers)
Yuejie Chi (109 papers)

Citations (68)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis (2102.06548v4)

Summary

Related Papers