Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation
Abstract: We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. Specifically, we propose a novel algorithmic framework named Local-fitted Optimization with OPtimism (LOOP), which incorporates both model-based and value-based incarnations. In particular, LOOP features a novel construction of confidence sets and a low-switching policy updating scheme, which are tailored to the average-reward and function approximation setting. Moreover, for AMDPs, we propose a novel complexity measure -- average-reward generalized eluder coefficient (AGEC) -- which captures the challenge of exploration in AMDPs with general function approximation. Such a complexity measure encompasses almost all previously known tractable AMDP models, such as linear AMDPs and linear mixture AMDPs, and also includes newly identified cases such as kernel AMDPs and AMDPs with Bellman eluder dimensions. Using AGEC, we prove that LOOP achieves a sublinear $\tilde{\mathcal{O}}(\mathrm{poly}(d, \mathrm{sp}(V*)) \sqrt{T\beta} )$ regret, where $d$ and $\beta$ correspond to AGEC and log-covering number of the hypothesis class respectively, $\mathrm{sp}(V*)$ is the span of the optimal state bias function, $T$ denotes the number of steps, and $\tilde{\mathcal{O}} (\cdot) $ omits logarithmic factors. When specialized to concrete AMDP models, our regret bounds are comparable to those established by the existing algorithms designed specifically for these special cases. To the best of our knowledge, this paper presents the first comprehensive theoretical framework capable of handling nearly all AMDPs.
- Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pages 3692–3702. PMLR.
- Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24.
- Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646. PMLR.
- Vo q𝑞qitalic_q l: Towards optimal regret in model-free rl with nonlinear function approximation. arXiv preprint arXiv:2212.06069.
- Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21.
- Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463–474. PMLR.
- Provably efficient q-learning with low switching cost. Advances in Neural Information Processing Systems, 32.
- Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. arXiv preprint arXiv:1205.2661.
- Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283–1294. PMLR.
- Understanding domain randomization for sim-to-real transfer. International Conference on Learning Representations.
- A general framework for sample-efficient function approximation in reinforcement learning. arXiv preprint arXiv:2209.15634.
- Stochastic linear optimization under bandit feedback.
- A provably efficient model-free posterior sampling method for episodic reinforcement learning. Advances in Neural Information Processing Systems, 34:12040–12051.
- A kernel-based approach to non-stationary reinforcement learning in metric spaces. In International Conference on Artificial Intelligence and Statistics, pages 3538–3546. PMLR.
- Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pages 2826–2836. PMLR.
- Tight guarantees for interactive decision making with the decision-estimation coefficient. arXiv preprint arXiv:2301.08215.
- The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487.
- Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, pages 1578–1586. PMLR.
- Adaptive approximate policy iteration. In International Conference on Artificial Intelligence and Statistics, pages 523–531. PMLR.
- Nearly minimax optimal reinforcement learning for linear markov decision processes. arXiv preprint arXiv:2212.06132.
- Hernández-Lerma, O. (2012). Adaptive Markov control processes, volume 79. Springer Science & Business Media.
- Provable sim-to-real transfer in continuous domain with partial observations. arXiv preprint arXiv:2210.15598.
- Tackling heavy-tailed rewards in reinforcement learning with function approximation: Minimax optimal and instance-dependent regret bounds. arXiv preprint arXiv:2306.06836.
- Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563–1600.
- Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR.
- Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR.
- Online sub-sampling for reinforcement learning with general function approximation. arXiv preprint arXiv:2106.07203.
- Pac reinforcement learning with rich observations. Advances in Neural Information Processing Systems, 29.
- Improved regret bound and experience replay in regularized policy iteration. In International Conference on Machine Learning, pages 6032–6042. PMLR.
- Variance-aware robust reinforcement learning with linear function approximation with heavy-tailed rewards. arXiv preprint arXiv:2303.05606.
- When is partially observable reinforcement learning not scary? In Conference on Learning Theory, pages 5175–5220. PMLR.
- Optimistic mle: A generic model-based algorithm for partially observable sequential decision making. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 363–376.
- One objective to rule them all: A maximization objective fusing estimation and planning for exploration. arXiv preprint arXiv:2305.18258.
- Ortner, R. (2020). Regret bounds for reinforcement learning via markov chain concentration. Journal of Artificial Intelligence Research, 67:115–128.
- Learning unknown markov decision processes: A thompson sampling approach. Advances in neural information processing systems, 30.
- Scaling model-based average-reward reinforcement learning for product delivery. In European Conference on Machine Learning, pages 735–742. Springer.
- Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
- Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26.
- Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995.
- Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory, pages 2898–2933. PMLR.
- Reinforcement learning: An introduction. MIT press.
- Szepesvári, C. (2010). Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning, 4(1):1–103.
- Near sample-optimal reduction-based policy learning for average reward mdp. arXiv preprint arXiv:2212.00603.
- Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33:6123–6135.
- Optimism in reinforcement learning with generalized linear function approximation. arXiv preprint arXiv:1912.04136.
- Learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3007–3015. PMLR.
- Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. In International conference on machine learning, pages 10170–10180. PMLR.
- Nearly minimax optimal regret for learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3883–3913. PMLR.
- A general framework for sequential decision-making under adaptivity constraints. arXiv preprint arXiv:2306.14468.
- Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004. PMLR.
- Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pages 10978–10989. PMLR.
- Regret minimization for reinforcement learning by evaluating the optimal bias function. Advances in Neural Information Processing Systems, 32.
- Sharper model-free reinforcement learning for average-reward markov decision processes. In The Thirty Sixth Annual Conference on Learning Theory, pages 5476–5477. PMLR.
- Almost optimal model-free reinforcement learningvia reference-advantage decomposition. Advances in Neural Information Processing Systems, 33:15198–15207.
- Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency. arXiv preprint arXiv:2302.10371.
- Gec: A unified framework for interactive decision making in mdp, pomdp, and beyond. arXiv preprint arXiv:2211.01962.
- A theoretical analysis of optimistic proximal policy optimization in linear markov decision processes. arXiv preprint arXiv:2305.08841.
- Computationally efficient horizon-free reinforcement learning for linear mixture mdps. arXiv preprint arXiv:2205.11507.
- Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532–4576. PMLR.
- Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning, pages 12793–12802. PMLR.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.