Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Risk-Sensitive Reinforcement Learning with Exponential Criteria (2212.09010v6)

Published 18 Dec 2022 in eess.SY, cs.AI, cs.LG, and cs.SY

Abstract: While reinforcement learning has shown experimental success in a number of applications, it is known to be sensitive to noise and perturbations in the parameters of the system, leading to high variance in the total reward amongst different episodes in slightly different environments. To introduce robustness, as well as sample efficiency, risk-sensitive reinforcement learning methods are being thoroughly studied. In this work, we provide a definition of robust reinforcement learning policies and formulate a risk-sensitive reinforcement learning problem to approximate them, by solving an optimization problem with respect to a modified objective based on exponential criteria. In particular, we study a model-free risk-sensitive variation of the widely-used Monte Carlo Policy Gradient algorithm and introduce a novel risk-sensitive online Actor-Critic algorithm based on solving a multiplicative BeLLMan equation using stochastic approximation updates. Analytical results suggest that the use of exponential criteria generalizes commonly used ad-hoc regularization approaches, improves sample efficiency, and introduces robustness with respect to perturbations in the model parameters and the environment. The implementation, performance, and robustness properties of the proposed methods are evaluated in simulated experiments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Maximum a posteriori policy optimisation..
  2. Wasserstein robust reinforcement learning..
  3. Neuronlike adaptive elements that can solve difficult learning control problems.  IEEE transactions on systems, man, and cybernetics, SMC-13(5), 834–846.
  4. H-infinity Optimal Control and Related Minimax Design Problems: A Dynamic Game Approach. Springer Science & Business Media.
  5. Natural actor–critic algorithms.  Automatica, 45(11), 2471 – 2482.
  6. Borkar, V. S. (2002). Q-learning for risk-sensitive control.  Mathematics of operations research, 27(2), 294–311.
  7. Borkar, V. S. (2009). Stochastic approximation: a dynamical systems viewpoint, Vol. 48. Springer.
  8. Borkar, V. S. (2010). Learning algorithms for risk-sensitive control.  Proceedings of the 19th International Symposium on Mathematical Theory of Networks and Systems, 5(9).
  9. Algorithms for CVaR Optimization in MDPs.  Advances in Neural Information Processing Systems, 27, 3509–3517.
  10. Connections between stochastic control and dynamic games.  Mathematics of Control, Signals and Systems, 9(4), 303–326.
  11. Percentile optimization for markov decision processes with parameter uncertainty.  Operations research, 58(1), 203–213.
  12. Policy Gradients with Variance Related Risk Criteria.  In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, UK.
  13. Maximum entropy rl (provably) solves some robust rl problems..
  14. Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning.  Advances in Neural Information Processing Systems, 34.
  15. Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret.  Advances in Neural Information Processing Systems, 33, 22384–22395.
  16. Convex Measures of Risk and Trading Constraints.  Finance and stochastics, 6(4), 429–447.
  17. Information Asymmetry in KL-regularized RL.  ArXiv, abs/1905.01240.
  18. Backpropagation Through The Void: Optimizing Control Variates for Black-box Gradient Estimation..
  19. Efficient risk-averse reinforcement learning.  Advances in Neural Information Processing Systems, 35, 32639–32652.
  20. Continuous deep q-learning with model-based acceleration.  In International Conference on Machine Learning, pp. 2829–2838. PMLR.
  21. Interpolated Policy Gradient: Merging On-policy and Off-policy Gradient Estimation for Deep Reinforcement Learning.  In Advances in Neural Information Processing Systems, pp. 3846–3855.
  22. Reinforcement Learning with Deep Energy-Based Policies.  In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, p. 1352–1361. JMLR.org.
  23. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.  In Dy, J.,  & Krause, A. (Eds.), Proceedings of the 35th International Conference on Machine Learning, Vol. 80 of Proceedings of Machine Learning Research, pp. 1861–1870, Stockholmsmässan, Stockholm Sweden. PMLR.
  24. Robustness.  In Robustness. Princeton university press.
  25. Jacobson, D. (1973). Optimal Stochastic Linear Systems with Exponential Performance Criteria and Their Relation to Deterministic Differential Games.  IEEE Transactions on Automatic Control, 18(2), 124–131.
  26. Robust H∞subscript𝐻H_{\infty}italic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT output feedback control for nonlinear systems.  IEEE Transactions on Automatic Control, 40(6), 1007–1017.
  27. Kakade, S. M. (2001). A Natural Policy Gradient.  Advances in Neural Information Processing Systems, 14, 1531–1538.
  28. Actor-critic algorithms.  Advances in neural information processing systems, 12.
  29. On Actor-Critic Algorithms.  SIAM J. Control Optim., 42(4), 1143–1166.
  30. Actor-critic algorithms for risk-sensitive mdps.  Advances in neural information processing systems, 26.
  31. Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers.  IEEE Control Systems Magazine, 32(6), 76–105.
  32. R2PG: Risk-Sensitive and Reliable Policy Gradient.  In The Workshops of the The Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018, Vol. WS-18 of AAAI Workshops, pp. 682–687. AAAI Press.
  33. Action-depedent Control Variates for Policy Optimization via Stein’s Identity..
  34. Risk sensitivity and entropy regularization in prototype-based learning.  In 2022 30th Mediterranean Conference on Control and Automation (MED), pp. 194–199. IEEE.
  35. Vector quantization for adaptive state aggregation in reinforcement learning.  In 2021 American Control Conference (ACC), pp. 2187–2192. IEEE.
  36. Annealing optimization for progressive learning with stochastic approximation.  IEEE Transactions on Automatic Control, 70, 1–13.
  37. Maximum-entropy progressive state aggregation for reinforcement learning.  In 2021 60th IEEE Conference on Decision and Control (CDC), pp. 5144–5149. IEEE.
  38. Robust reinforcement learning: A review of foundations and recent advances.  Machine Learning and Knowledge Extraction, 4(1), 276–315.
  39. Entropic Risk Measure in Policy Search.  In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1101–1106.
  40. Risk-sensitive reinforce: A monte carlo policy gradient algorithm for exponential performance criteria.  In 2021 60th IEEE Conference on Decision and Control (CDC), pp. 1522–1527. IEEE.
  41. Risk-sensitive reinforcement learning and robust learning for control.  In 2021 60th IEEE Conference on Decision and Control (CDC), pp. 2976–2981. IEEE.
  42. Embracing risk in reinforcement learning: The connection between risk-sensitive exponential and distributionally robust criteria.  In 2022 American Control Conference (ACC), pp. 2703–2708.
  43. Osogami, T. (2012). Robustness and risk-sensitivity in markov decision processes.  Advances in Neural Information Processing Systems, 25.
  44. Risk averse robust adversarial reinforcement learning.  In 2019 International Conference on Robotics and Automation (ICRA), pp. 8522–8528. IEEE.
  45. Constrained reinforcement learning has zero duality gap.  Advances in Neural Information Processing Systems, 32.
  46. Robust Adversarial Reinforcement Learning.  In International Conference on Machine Learning, pp. 2817–2826. PMLR.
  47. Prashanth, L. A. (2014). Policy gradients for cvar-constrained mdps.  In Auer, P., Clark, A., Zeugmann, T., & Zilles, S. (Eds.), Algorithmic Learning Theory, pp. 155–169, Cham. Springer International Publishing.
  48. Risk averse reinforcement learning for mixed multi-agent environments.  In Proceedings of the 18th international conference on autonomous agents and multiagent systems, pp. 2171–2173.
  49. Scarf, H. E. (1957). A min-max solution of an inventory problem. Rand Corporation Santa Monica.
  50. Trust Region Policy Optimization.  In Bach, F.,  & Blei, D. (Eds.), Proceedings of the 32nd International Conference on Machine Learning, Vol. 37 of Proceedings of Machine Learning Research, pp. 1889–1897, Lille, France. PMLR.
  51. Proximal Policy Optimization Algorithms..
  52. Shapley, L. S. (1953). Stochastic games.  Proceedings of the national academy of sciences, 39(10), 1095–1100.
  53. Sutton, R. S. (1995). Generalization in reinforcement learning: Successful examples using sparse coarse coding.  Advances in neural information processing systems, 8.
  54. Reinforcement learning: An introduction. MIT press.
  55. Policy gradient methods for reinforcement learning with function approximation.  Advances in neural information processing systems, 12.
  56. Policy Gradient Methods for Reinforcement Learning With Function Approximation.  In Advances in Neural Information Processing Systems, pp. 1057–1063.
  57. Tamar, A. (2015). Risk-Sensitive and Efficient Reinforcement Learning Algorithms..
  58. Scaling up robust mdps by reinforcement learning..
  59. Policy Gradient Methods for Reinforcement Learning with Function Approximation and Action-dependent Baselines..
  60. Todorov, E. (2007). Linearly-solvable Markov Decision Problems.  In Advances in Neural Information Processing Systems, pp. 1369–1376.
  61. An analysis of temporal-difference learning with function approximation.  IEEE Transactions on Automatic Control, 42(5), 674–690.
  62. The Mirage of Action-Dependent Baselines in Reinforcement Learning.  In Dy, J.,  & Krause, A. (Eds.), Proceedings of the 35th International Conference on Machine Learning, Vol. 80 of Proceedings of Machine Learning Research, pp. 5015–5024, Stockholmsmässan, Stockholm Sweden. PMLR.
  63. The optimal reward baseline for gradient-based reinforcement learning..
  64. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning.  Machine learning, 8(3), 229–256.
  65. Function Optimization Using Connectionist Reinforcement Learning Algorithms.  Connection Science, 3(3), 241–268.
  66. Variance Reduction for Policy Gradient with Action-dependent Factorized Baselines..
  67. Maximum Entropy Inverse Reinforcement Learning.  In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, AAAI’08, p. 1433–1438. AAAI Press.
Citations (8)

Summary

We haven't generated a summary for this paper yet.