Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation (2409.10772v2)
Abstract: This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear Markov decision processes (MDPs) and linear mixture MDPs under the BeLLMan optimality condition. While guaranteeing computational efficiency, our algorithm for linear MDPs achieves the best-known regret upper bound of $\widetilde{\mathcal{O}}(d{3/2}\mathrm{sp}(v*)\sqrt{T})$ over $T$ time steps where $\mathrm{sp}(v*)$ is the span of the optimal bias function $v*$ and $d$ is the dimension of the feature mapping. For linear mixture MDPs, our algorithm attains a regret bound of $\widetilde{\mathcal{O}}(d\cdot\mathrm{sp}(v*)\sqrt{T})$. The algorithm applies novel techniques to control the covering number of the value function class and the span of optimistic estimators of the value function, which is of independent interest.