Settle online RL in discounted infinite-horizon MDPs
Establish minimax-optimal algorithms and regret/sample-complexity guarantees for online tabular reinforcement learning in discounted infinite-horizon Markov decision processes, resolving the current theoretical gap relative to the finite-horizon nonstationary setting.
References
Its counterpart in discounted infinite-horizon MDPs has not been fully settled (see, e.g., ) and calls for further investigation.
— Statistical and Algorithmic Foundations of Reinforcement Learning
(2507.14444 - Chi et al., 19 Jul 2025) in Section 4.1 (Problem formulation)