Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast Policy Learning for Linear Quadratic Control with Entropy Regularization (2311.14168v3)

Published 23 Nov 2023 in math.OC and cs.LG

Abstract: This paper proposes and analyzes two new policy learning methods: regularized policy gradient (RPG) and iterative policy optimization (IPO), for a class of discounted linear-quadratic control (LQC) problems over an infinite time horizon with entropy regularization. Assuming access to the exact policy evaluation, both proposed approaches are proven to converge linearly in finding optimal policies of the regularized LQC. Moreover, the IPO method can achieve a super-linear convergence rate once it enters a local region around the optimal policy. Finally, when the optimal policy for an RL problem with a known environment is appropriately transferred as the initial policy to an RL problem with an unknown environment, the IPO method is shown to enable a super-linear convergence rate if the two environments are sufficiently close. Performances of these proposed algorithms are supported by numerical examples.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pages 64–66. PMLR, 2020.
  2. Understanding the impact of entropy on policy optimization. In International conference on machine learning, pages 151–160. PMLR, 2019.
  3. Logarithmic regret for episodic continuous-time linear-quadratic reinforcement learning over a finite-time horizon. The Journal of Machine Learning Research, 23(1):8015–8048, 2022.
  4. Richard Bellman. A markovian decision process. Journal of mathematics and mechanics, pages 679–684, 1957.
  5. Neuro-dynamic programming. Athena Scientific, 1996.
  6. Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786, 2019.
  7. Lqr through the lens of first order methods: Discrete-time case. arXiv preprint arXiv:1907.08921, 2019.
  8. Policy gradient-based algorithms for continuous-time linear quadratic control. arXiv preprint arXiv:2006.09178, 2020.
  9. Feasibility of transfer learning: A mathematical framework. arXiv preprint arXiv:2305.12985, 2023.
  10. Risk of transfer learning and its applications in finance, 2023.
  11. Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563–2578, 2022.
  12. Global convergence of policy gradient methods for the linear quadratic regulator. In International conference on machine learning, pages 1467–1476. PMLR, 2018.
  13. Convergence of policy gradient methods for finite-horizon stochastic linear-quadratic control problems. arXiv preprint arXiv:2211.00617, 2022.
  14. Learning optimal controllers for linear systems with multiplicative noise via policy gradient. IEEE Transactions on Automatic Control, 66(11):5283–5298, 2020.
  15. Dynamic programming principles for mean-field controls with learning. Operations Research, 2023.
  16. Markov α𝛼\alphaitalic_α-potential games: Equilibrium approximation and regret analysis. arXiv preprint arXiv:2305.12553, 2023.
  17. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
  18. Policy gradient methods for the noisy linear quadratic regulator over a finite horizon. SIAM Journal on Control and Optimization, 59(5):3359–3391, 2021.
  19. Recent advances in reinforcement learning in finance. Mathematical Finance, 33(3):437–503, 2023.
  20. Policy gradient learning methods for stochastic control with exit time and applications to share repurchase pricing. Applied Mathematical Finance, 29(6):439–456, 2022.
  21. Policy gradient converges to the globally optimal policy for nearly linear-quadratic regulators. arXiv preprint arXiv:2303.08431, 2023.
  22. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pages 2681–2691. PMLR, 2019.
  23. Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. The Journal of Machine Learning Research, 23(1):12603–12652, 2022.
  24. q-learning in continuous time. J. Mach. Learn. Res., 24:161–1, 2023.
  25. On the analysis of model-free methods for the linear quadratic regulator. arXiv preprint arXiv:2007.03861, 2020.
  26. Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
  27. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In The 22nd international conference on artificial intelligence and statistics, pages 2916–2925. PMLR, 2019.
  28. Policy-gradient algorithms have no guarantees of convergence in linear quadratic games. arXiv preprint arXiv:1907.03712, 2019.
  29. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, pages 6820–6829. PMLR, 2020.
  30. A decade survey of transfer learning (2010–2020). IEEE Transactions on Artificial Intelligence, 1(2):151–166, 2020.
  31. Relative entropy policy search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, pages 1607–1612, 2010.
  32. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5668–5675, 2020.
  33. Reinforcement learning: An introduction. MIT press, 2018.
  34. Exploration-exploitation trade-off for continuous-time episodic reinforcement learning with linear-convex models. arXiv preprint arXiv:2112.10264, 2021.
  35. Optimal scheduling of entropy regulariser for continuous-time linear-quadratic reinforcement learning. arXiv preprint arXiv:2208.04466, 2022.
  36. Exploratory hjb equations and their convergence. SIAM Journal on Control and Optimization, 60(6):3191–3216, 2022.
  37. Risk-constrained linear-quadratic regulators. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 3040–3047. IEEE, 2020.
  38. Reinforcement learning in continuous time and space: A stochastic control approach. The Journal of Machine Learning Research, 21(1):8145–8178, 2020.
  39. Exploration versus exploitation in reinforcement learning: A stochastic control approach. Journal of Machine Learning Research, 21:1–34, 2020.
  40. Policy evaluation in distributional lqr. In Learning for Dynamics and Control Conference, pages 1245–1256. PMLR, 2023.
  41. A survey of transfer learning. Journal of Big data, 3(1):1–40, 2016.
  42. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.
  43. Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR), 55(1):1–36, 2021.
  44. Global convergence of policy gradient primal-dual methods for risk-constrained lqrs. IEEE Transactions on Automatic Control, 2023.
  45. Transfer learning in deep reinforcement learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.