2000 character limit reached
A policy gradient approach for optimization of smooth risk measures (2202.11046v4)
Published 22 Feb 2022 in cs.LG
Abstract: We propose policy gradient algorithms for solving a risk-sensitive reinforcement learning (RL) problem in on-policy as well as off-policy settings. We consider episodic Markov decision processes, and model the risk using the broad class of smooth risk measures of the cumulative discounted reward. We propose two template policy gradient algorithms that optimize a smooth risk measure in on-policy and off-policy RL settings, respectively. We derive non-asymptotic bounds that quantify the rate of convergence of our proposed algorithms to a stationary point of the smooth risk measure. As special cases, we establish that our algorithms apply to optimization of mean-variance and distortion risk measures, respectively.
- C. Acerbi. Spectral measures of risk: A coherent representation of subjective risk aversion. Journal of Banking & Finance, 26(7):1505–1518, 2002.
- Coherent measures of risk. Mathematical Finance, 9(3):203–228, 1999.
- Neuro-Dynamic Programming. Athena Scientific, 1st edition, 1996.
- Stochastic recursive algorithms for optimization. simultaneous perturbation methods. Lecture Notes in Control and Inform. Sci., 434, 2013.
- V. S. Borkar. Learning algorithms for risk-sensitive control. In Proceedings of the 19th International Symposium on Mathematical Theory of Networks and Systems–MTNS, volume 5, 2010.
- V. S. Borkar and R. Jain. Risk-constrained Markov decision processes. In IEEE Conference on Decision and Control, pages 2664–2669, 2010.
- Risk-constrained reinforcement learning with percentile risk criteria. J. Mach. Learn. Res., 18(1):6070–6120, 2017.
- D. Denneberg. Distorted probabilities and insurance premiums. Methods of Operations Research, 63(3):3–5, 1990.
- Online convex optimization in the bandit setting: Gradient descent without a gradient. In ACM-SIAM Symposium on Discrete Algorithms, pages 385–394, 2005.
- On the information-adaptive variants of the admm: An iteration complexity perspective. J. Sci. Comput., 76(1):327–363, 2018.
- Computing sensitivities for distortion risk measures. INFORMS J. Comp., 2021.
- M. J. Holland and E. Mehdi Haress. Spectral risk-based learning using unbounded losses. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 1871–1886, 2022.
- W. Huang and W. B. Haskell. Risk-aware Q-learning for Markov decision processes. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 4928–4933. IEEE, 2017.
- J. Kim. Bias correction for estimated distortion risk measure using the bootstrap. Insur.: Math. Econ., 47:198–205, 2010.
- H. Markowitz. Portfolio selection. The Journal of Finance, 7(1):77–91, 1952.
- Introduction to the Theory of Statistics. McGraw Hill, 3rd edition, 1974.
- Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Found. Comut. Math., 17:527– 566, 2017.
- Y. E. Nesterov. Introductory Lectures on Convex Optimization - A Basic Course, volume 87 of Applied Optimization. 2004.
- Stochastic variance-reduced policy gradient. In ICML, 2018.
- L. A. Prashanth. Policy gradients for CVaR-constrained MDPs. In Algorithmic Learning Theory (ALT), pages 155–169, 2014.
- L. A. Prashanth and M. Ghavamzadeh. Variance-constrained actor-critic algorithms for discounted and average reward MDPs. Machine Learning, 105(3):367–417, Dec 2016.
- L.A. Prashanth and M. Fu. Risk-sensitive reinforcement learning via policy gradient search. Foundations and Trends in Machine Learning, 15(5):537–693, 2022.
- L.A. Prashanth and M. Ghavamzadeh. Actor-critic algorithms for risk-sensitive mdps. In Adv. Neural Inf. Process. Syst., volume 26, 2013.
- Cumulative prospect theory meets reinforcement learning: Prediction and control. In ICML, volume 48, pages 1406–1415, 2016.
- R. T. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of risk, 2:21–42, 2000.
- O. Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. J. Mach. Learn. Res., 18(1):1703–1713, 2017.
- Hessian aided policy gradient. In ICML, pages 5729–5738, 2019.
- A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Adv. Neural Inf. Process. Syst., volume 21, 2009.
- Policy gradients with variance related risk criteria. In Proceedings of the Twenty-Ninth International Conference on Machine Learning, pages 387–396, 2012.
- Policy gradient for coherent risk measures. In Adv. Neural Inf. Process. Syst., 2015a.
- Policy gradient for coherent risk measures. In Advances in Neural Information Processing Systems, volume 28, pages 1468–1476, 2015b.
- A. Tversky and D. Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. J. Risk Uncertain., 5, 1992.
- N. Vijayan and L.A. Prashanth. Smoothed functional-based gradient algorithms for off-policy reinforcement learning: A non-asymptotic viewpoint. Systems & Control Letters, 155:104988, 2021. ISSN 0167-6911.
- N. Vijayan and L.A. Prashanth. Policy gradient methods for distortion risk measures. arXiv preprint arXiv:2107.04422, 2023a.
- N. Vijayan and L.A. Prashanth. A policy gradient approach for optimization of smooth risk measures. arXiv preprint arXiv:2202.11046, 2023b.
- Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM J. Control. Optim., 58(6):3586–3612, 2020.