Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning (2405.18793v3)

Published 29 May 2024 in cs.LG

Abstract: We study Lipschitz MDPs in the infinite-horizon average-reward reinforcement learning (RL) setup in which an agent can play policies from a given set $\Phi$. The proposed algorithms zoom'' intopromising'' regions of the policy space, thereby achieving adaptivity gains. We upper bound their regret as $\tilde{\mathcal{O}}\big(T^{1 - d_{\text{eff.}}^{-1}}\big)$, where $d_{\text{eff.}} = d^\Phi_z+2$ for model-free algorithm~\textit{PZRL-MF} and $d_{\text{eff.}} = 2d_\mathcal{S} + d^\Phi_z + 3$ for model-based algorithm~\textit{PZRL-MB}. Here, $d_\mathcal{S}$ is the dimension of the state space, and $d^\Phi_z$ is the zooming dimension. $d^\Phi_z$ is a problem-dependent quantity that depends not only on the underlying MDP, but also on the class $\Phi$. This yields us a low regret in case the agent competes against a low-complexity $\Phi$ (that has a small $d^\Phi_z$). We note that the preexisting notions of zooming dimension are inept at handling the non-episodic RL and do not yield adaptivity gains. The current work shows how to capture adaptivity gains for infinite-horizon average-reward RL in terms of $d^\Phi_z$. When specialized to the case of finite-dimensional policy space, we obtain that $d_{\text{eff.}}$ scales as the dimension of this space under mild technical conditions; and also obtain $d_{\text{eff.}} = 0$, or equivalently $\tilde{\mathcal{O}}(\sqrt{T})$ regret for \textit{PZRL-MF}, under a curvature condition on the average reward function that is commonly used in the multi-armed bandit (MAB) literature. Simulation experiments validate the gains arising due to adaptivity.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning (2405.18793v3)

Summary

Related Papers

Tweets