Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds (2210.14051v3)

Published 25 Oct 2022 in cs.LG, cs.AI, and stat.ML

Abstract: We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distributional reinforcement learning (DRL) methods. In particular, we consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. By leveraging a key property of the EntRM, the independence property, we establish the risk-sensitive distributional dynamic programming framework. We then propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one. We prove that they both attain $\tilde{\mathcal{O}}(\frac{\exp(|\beta| H)-1}{|\beta|}H\sqrt{S2AK})$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number of states, actions, episodes, and the time horizon, respectively. It matches RSVI2 proposed in \cite{fei2021exponential}, with novel distributional analysis. To the best of our knowledge, this is the first regret analysis that bridges DRL and RSRL in terms of sample complexity. Acknowledging the computational inefficiency associated with the model-free DRL algorithm, we propose an alternative DRL algorithm with distribution representation. This approach not only maintains the established regret bounds but also significantly amplifies computational efficiency. We also prove a tighter minimax lower bound of $\Omega(\frac{\exp(\beta H/6)-1}{\beta H}H\sqrt{SAT})$ for the $\beta>0$ case, which recovers the tight lower bound $\Omega(H\sqrt{SAT})$ in the risk-neutral setting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. M. Achab and G. Neu. Robustness and risk management via distributional dynamic programming. arXiv preprint arXiv:2112.15430, 2021.
  2. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
  3. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617, 2018.
  4. N. Bäuerle and U. Rieder. More risk-sensitive markov decision processes. Mathematics of Operations Research, 39(1):105–120, 2014.
  5. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449–458. PMLR, 2017.
  6. Distributional Reinforcement Learning. MIT Press, 2023. http://www.distributional-rl.org.
  7. D. P. Bertsekas et al. Dynamic programming and optimal control: Vol. 1. Athena scientific Belmont, 2000.
  8. Risk sensitive asset allocation. Journal of Economic Dynamics and Control, 24(8):1145–1177, 2000.
  9. V. S. Borkar. A sensitivity formula for risk-sensitive cost and the actor–critic algorithm. Systems & Control Letters, 44(5):339–346, 2001.
  10. V. S. Borkar. Q-learning for risk-sensitive control. Mathematics of operations research, 27(2):294–311, 2002.
  11. V. S. Borkar. Learning algorithms for risk-sensitive control. In Proceedings of the 19th International Symposium on Mathematical Theory of Networks and Systems–MTNS, volume 5, 2010.
  12. Risk-sensitive optimal control for markov decision processes with monotone cost. Mathematics of Operations Research, 27(1):192–209, 2002.
  13. R. Cavazos-Cadena and D. Hernández-Hernández. Discounted approximations for risk-sensitive average criteria in markov decision chains with finite state space. Mathematics of Operations Research, 36(1):133–146, 2011.
  14. Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pages 1096–1105. PMLR, 2018a.
  15. Distributional reinforcement learning with quantile regression. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018b.
  16. M. Davis and S. Lleo. Risk-sensitive benchmarked asset management. Quantitative Finance, 8(4):415–426, 2008.
  17. E. Delage and S. Mannor. Percentile optimization for markov decision processes with parameter uncertainty. Operations research, 58(1):203–213, 2010.
  18. D. Dentcheva and A. Ruszczynski. Common mathematical foundations of expected utility and dual utility theories. SIAM Journal on Optimization, 23(1):381–405, 2013.
  19. G. B. Di Masi and Ł. Stettner. Infinite horizon risk sensitive control of discrete time markov processes under minorization property. SIAM Journal on Control and Optimization, 46(1):231–252, 2007.
  20. G. B. Di Masi et al. Infinite horizon risk sensitive control of discrete time markov processes with small risk. Systems & control letters, 40(1):15–20, 2000.
  21. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pages 578–598. PMLR, 2021.
  22. Provably efficient risk-sensitive reinforcement learning: Iterated cvar and worst path. In The Eleventh International Conference on Learning Representations, 2022.
  23. Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control, pages 667–672. IEEE, 2006.
  24. Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. arXiv preprint arXiv:2006.13827, 2020.
  25. Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021.
  26. H. Föllmer and A. Schied. Stochastic finance. In Stochastic Finance. de Gruyter, 2016.
  27. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 44(2):377–399, 2019.
  28. Robustness. In Robustness. Princeton university press, 2011.
  29. Risk-sensitive markov decision processes. Management science, 18(7):356–369, 1972.
  30. A. Jaśkiewicz. Average optimality for risk-sensitive control with general state space. The annals of applied probability, 17(2):654–675, 2007.
  31. Being optimistic to be conservative: Quickly learning a cvar policy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4436–4443, 2020.
  32. M. Kupper and W. Schachermayer. Representation results for law invariant time consistent functions. Mathematics and Financial Economics, 2:189–210, 2009.
  33. H. Liang and Z.-q. Luo. A distribution optimization framework for confidence bounds of risk measures. arXiv preprint arXiv:2306.07059, 2023.
  34. A comparative analysis of expected and distributional reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4504–4511, 2019.
  35. Dsac: Distributional soft actor critic for risk-sensitive reinforcement learning. arXiv preprint arXiv:2004.14547, 2020.
  36. Conservative offline distributional reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021.
  37. O. Mihatsch and R. Neuneier. Risk-sensitive reinforcement learning. Machine learning, 49(2):267–290, 2002.
  38. Entropic risk measure in policy search. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1101–1106. IEEE, 2019.
  39. T. Osogami. Robustness and risk-sensitivity in markov decision processes. Advances in Neural Information Processing Systems, 25:233–241, 2012.
  40. S. D. Patek. On terminating markov decision processes with a risk-averse objective function. Automatica, 37(9):1379–1386, 2001.
  41. An analysis of categorical distributional reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 29–37. PMLR, 2018.
  42. Lectures on stochastic programming: modeling and theory. SIAM, 2021.
  43. Risk-sensitive markov control processes. SIAM Journal on Control and Optimization, 51(5):3652–3672, 2013.
  44. Risk-sensitive reinforcement learning. Neural computation, 26(7):1298–1328, 2014.
  45. Improving robustness via risk averse distributional reinforcement learning. In Learning for Dynamics and Control, pages 958–968. PMLR, 2020.
  46. Reinforcement learning: An introduction. MIT press, 2018.
  47. J. Von Neumann and O. Morgenstern. Theory of games and economic behavior, 2nd rev. 1947.
  48. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
  49. Fully parameterized quantile function for distributional reinforcement learning. Advances in neural information processing systems, 32, 2019.
  50. Distributional reinforcement learning for multi-dimensional reward functions. Advances in Neural Information Processing Systems, 34:1519–1529, 2021.
Citations (11)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com