Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Model Selection for Generic Reinforcement Learning (2107.05849v2)

Published 13 Jul 2021 in stat.ML, cs.IT, cs.LG, and math.IT

Abstract: We address the problem of model selection for the finite horizon episodic Reinforcement Learning (RL) problem where the transition kernel $P*$ belongs to a family of models $\mathcal{P}*$ with finite metric entropy. In the model selection framework, instead of $\mathcal{P}*$, we are given $M$ nested families of transition kernels $\cP_1 \subset \cP_2 \subset \ldots \subset \cP_M$. We propose and analyze a novel algorithm, namely \emph{Adaptive Reinforcement Learning (General)} (\texttt{ARL-GEN}) that adapts to the smallest such family where the true transition kernel $P*$ lies. \texttt{ARL-GEN} uses the Upper Confidence Reinforcement Learning (\texttt{UCRL}) algorithm with value targeted regression as a blackbox and puts a model selection module at the beginning of each epoch. Under a mild separability assumption on the model classes, we show that \texttt{ARL-GEN} obtains a regret of $\Tilde{\mathcal{O}}(d_{\mathcal{E}}H2+\sqrt{d_{\mathcal{E}}^ \mathbb{M}* H2 T})$, with high probability, where $H$ is the horizon length, $T$ is the total number of steps, $d_{\mathcal{E}}*$ is the Eluder dimension and $\mathbb{M}*$ is the metric entropy corresponding to $\mathcal{P}*$. Note that this regret scaling matches that of an oracle that knows $\mathcal{P}*$ in advance. We show that the cost of model selection for \texttt{ARL-GEN} is an additive term in the regret having a weak dependence on $T$. Subsequently, we remove the separability assumption and consider the setup of linear mixture MDPs, where the transition kernel $P*$ has a linear function approximation. With this low rank structure, we propose novel adaptive algorithms for model selection, and obtain (order-wise) regret identical to that of an oracle with knowledge of the true model class.

Citations (1)

Summary

We haven't generated a summary for this paper yet.