Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes (2309.01922v3)

Published 5 Sep 2023 in cs.LG and cs.AI

Abstract: In this paper, we consider an infinite horizon average reward Markov Decision Process (MDP). Distinguishing itself from existing works within this context, our approach harnesses the power of the general policy gradient-based algorithm, liberating it from the constraints of assuming a linear MDP structure. We propose a policy gradient-based algorithm and show its global convergence property. We then prove that the proposed algorithm has $\tilde{\mathcal{O}}({T}{3/4})$ regret. Remarkably, this paper marks a pioneering effort by presenting the first exploration into regret-bound computation for the general parameterized policy gradient algorithm in the context of average reward scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, 64–66. PMLR.
  2. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1): 4431–4506.
  3. Concave Utility Reinforcement Learning with Zero-Constraint Violations. Transactions on Machine Learning Research.
  4. Regret guarantees for model-based reinforcement learning with long-term average constraints. In Uncertainty in Artificial Intelligence, 22–31. PMLR.
  5. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. Advances in Neural Information Processing Systems, 30.
  6. Deeppool: Distributed model-free algorithm for ride-sharing using deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 20(12): 4714–4727.
  7. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21.
  8. Learning infinite-horizon average-reward Markov decision process with constraints. In International Conference on Machine Learning, 3246–3270. PMLR.
  9. Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes. In Advances in Neural Information Processing Systems, volume 33, 8378–8390. Curran Associates, Inc.
  10. Adapting to mixing time in stochastic optimization with Markovian data. In International Conference on Machine Learning, 5429–5446. PMLR.
  11. Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies. arXiv:2302.01734.
  12. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, 1578–1586. PMLR.
  13. A multi-agent reinforcement learning perspective on distributed traffic engineering. In 2020 IEEE 28th International Conference on Network Protocols (ICNP), 1–11. IEEE.
  14. Achieving Sub-linear Regret in Infinite Horizon Average Reward Constrained MDP with Linear Function Approximation. In The Eleventh International Conference on Learning Representations.
  15. A Duality Approach for Regret Minimization in Average-Award Ergodic Markov Decision Processes. In Learning for Dynamics and Control, 862–883. PMLR.
  16. Is Q-Learning Provably Efficient? In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
  17. Provably efficient reinforcement learning with linear function approximation. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, 2137–2143. PMLR.
  18. Bandit algorithms. Cambridge University Press.
  19. Cooperating Graph Neural Networks with Deep Reinforcement Learning for Vaccine Prioritization. arXiv preprint arXiv:2305.05163.
  20. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33: 7624–7636.
  21. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, 1928–1937. PMLR.
  22. Human-level control through deep reinforcement learning. nature, 518(7540): 529–533.
  23. Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes. arXiv preprint arXiv:2310.11677.
  24. Learning unknown markov decision processes: A thompson sampling approach. Advances in neural information processing systems, 30.
  25. IMED-RL: Regret optimal learning of ergodic Markov decision processes. In NeurIPS 2022-Thirty-sixth Conference on Neural Information Processing Systems.
  26. Trust region policy optimization. In International conference on machine learning, 1889–1897. PMLR.
  27. Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic. In International Conference on Machine Learning, 33240–33267. PMLR.
  28. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12.
  29. Neural Policy Gradient Methods: Global Optimality and Rates of Convergence. arXiv:1909.01150.
  30. Learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, 3007–3015. PMLR.
  31. Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. In International conference on machine learning, 10170–10180. PMLR.
  32. A provably-efficient model-free algorithm for infinite-horizon average-reward constrained Markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 3868–3876.
  33. On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method. Advances in Neural Information Processing Systems, 34: 2228–2240.
Citations (7)

Summary

We haven't generated a summary for this paper yet.