Settle online RL in discounted infinite-horizon MDPs

Establish minimax-optimal algorithms and regret/sample-complexity guarantees for online tabular reinforcement learning in discounted infinite-horizon Markov decision processes, resolving the current theoretical gap relative to the finite-horizon nonstationary setting.

Background

The tutorial develops minimax lower bounds and regret-optimal model-based algorithms for online episodic RL in finite-horizon nonstationary MDPs, culminating in the MVP algorithm that achieves full-range optimal regret. By contrast, the authors note that the analogous problem in discounted infinite-horizon MDPs does not yet have a fully settled theory.

The open question is to extend the finite-horizon understanding to the discounted infinite-horizon setting, including identifying algorithms and tight regret/sample complexity bounds that match information-theoretic limits in the online tabular regime.

References

Its counterpart in discounted infinite-horizon MDPs has not been fully settled (see, e.g., ) and calls for further investigation.

— Statistical and Algorithmic Foundations of Reinforcement Learning (2507.14444 - Chi et al., 19 Jul 2025) in Section 4.1 (Problem formulation)

Settle online RL in discounted infinite-horizon MDPs

Sponsor

Background

References

Related Problems