Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning (2405.18793v3)
Abstract: We study Lipschitz MDPs in the infinite-horizon average-reward reinforcement learning (RL) setup in which an agent can play policies from a given set $\Phi$. The proposed algorithms zoom'' into
promising'' regions of the policy space, thereby achieving adaptivity gains. We upper bound their regret as $\tilde{\mathcal{O}}\big(T{1 - d_{\text{eff.}}{-1}}\big)$, where $d_{\text{eff.}} = d\Phi_z+2$ for model-free algorithm~\textit{PZRL-MF} and $d_{\text{eff.}} = 2d_\mathcal{S} + d\Phi_z + 3$ for model-based algorithm~\textit{PZRL-MB}. Here, $d_\mathcal{S}$ is the dimension of the state space, and $d\Phi_z$ is the zooming dimension. $d\Phi_z$ is a problem-dependent quantity that depends not only on the underlying MDP, but also on the class $\Phi$. This yields us a low regret in case the agent competes against a low-complexity $\Phi$ (that has a small $d\Phi_z$). We note that the preexisting notions of zooming dimension are inept at handling the non-episodic RL and do not yield adaptivity gains. The current work shows how to capture adaptivity gains for infinite-horizon average-reward RL in terms of $d\Phi_z$. When specialized to the case of finite-dimensional policy space, we obtain that $d_{\text{eff.}}$ scales as the dimension of this space under mild technical conditions; and also obtain $d_{\text{eff.}} = 0$, or equivalently $\tilde{\mathcal{O}}(\sqrt{T})$ regret for \textit{PZRL-MF}, under a curvature condition on the average reward function that is commonly used in the multi-armed bandit (MAB) literature. Simulation experiments validate the gains arising due to adaptivity.