Reinforcement learning with non-ergodic reward increments: robustness via ergodicity transformations (2310.11335v3)
Abstract: Envisioned application areas for reinforcement learning (RL) include autonomous driving, precision agriculture, and finance, which all require RL agents to make decisions in the real world. A significant challenge hindering the adoption of RL methods in these domains is the non-robustness of conventional algorithms. In particular, the focus of RL is typically on the expected value of the return. The expected value is the average over the statistical ensemble of infinitely many trajectories, which can be uninformative about the performance of the average individual. For instance, when we have a heavy-tailed return distribution, the ensemble average can be dominated by rare extreme events. Consequently, optimizing the expected value can lead to policies that yield exceptionally high returns with a probability that approaches zero but almost surely result in catastrophic outcomes in single long trajectories. In this paper, we develop an algorithm that lets RL agents optimize the long-term performance of individual trajectories. The algorithm enables the agents to learn robust policies, which we show in an instructive example with a heavy-tailed return distribution and standard RL benchmarks. The key element of the algorithm is a transformation that we learn from data. This transformation turns the time series of collected returns into one for whose increments expected value and the average over a long trajectory coincide. Optimizing these increments results in robust policies.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
- D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017.
- Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in International Conference on Machine Learning, 2016, pp. 1329–1338.
- O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev et al., “Grandmaster level in StarCraft II using multi-agent reinforcement learning,” Nature, vol. 575, no. 7782, pp. 350–354, 2019.
- D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in AI safety,” arXiv preprint arXiv:1606.06565, 2016.
- J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg, “AI safety gridworlds,” arXiv preprint arXiv:1711.09883, 2017.
- S. Russell, D. Dewey, and M. Tegmark, “Research priorities for robust and beneficial artificial intelligence,” AI Magazine, vol. 36, no. 4, pp. 105–114, 2015.
- L. Prashanth, M. C. Fu et al., “Risk-sensitive reinforcement learning via policy gradient search,” Foundations and Trends® in Machine Learning, vol. 15, no. 5, pp. 537–693, 2022.
- L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta, “Robust adversarial reinforcement learning,” in International Conference on Machine Learning. PMLR, 2017, pp. 2817–2826.
- D. Bernoulli, “Exposition of a new theory on the measurement of risk,” Econometrica, vol. 22, no. 1, pp. 23–36, 1954.
- D. Kahneman and A. Tversky, “Prospect theory: An analysis of decision under risk,” Econometrica, vol. 47, no. 2, pp. 263–292, 1997.
- O. Peters and A. Adamou, “The time interpretation of expected utility theory,” arXiv preprint arXiv:1801.03680, 2018.
- D. Meder, F. Rabe, T. Morville, K. H. Madsen, M. T. Koudahl, R. J. Dolan, H. R. Siebner, and O. J. Hulme, “Ergodicity-breaking reveals time optimal decision making in humans,” PLoS Computational Biology, vol. 17, no. 9, p. e1009217, 2021.
- A. Vanhoyweghen, B. Verbeken, C. Macharis, and V. Ginis, “The influence of ergodicity on risk affinity of timed and non-timed respondents,” Scientific Reports, vol. 12, no. 1, pp. 1–9, 2022.
- O. Peters, “The ergodicity problem in economics,” Nature Physics, vol. 15, no. 12, pp. 1216–1221, 2019.
- O. Hulme, A. Vanhoyweghen, C. Connaughton, O. Peters, S. Steinkamp, A. Adamou, D. Baumann, V. Ginis, B. Verbruggen, J. Price, and B. Skjold, “Reply to” the limitations of growth-optimal approaches to decision making under uncertainty”,” Econ Journal Watch, vol. 20, no. 2, pp. 335–348, 2023.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning implementations,” Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021.
- M. S. Bartlett, “The use of transformations,” Biometrics, vol. 3, no. 1, pp. 39–52, 1947.
- R. Tibshirani, “Estimating transformations for regression via additivity and variance stabilization,” Journal of the American Statistical Association, vol. 83, no. 402, pp. 394–405, 1988.
- W. S. Cleveland, “Robust locally weighted regression and smoothing scatterplots,” Journal of the American Statistical Association, vol. 74, no. 368, pp. 829–836, 1979.
- O. Peters and W. Klein, “Ergodicity breaking in geometric Brownian motion,” Physical Review Letters, vol. 110, no. 10, p. 100603, 2013.
- A. Charpentier, R. Elie, and C. Remlinger, “Reinforcement learning in economics and finance,” Computational Economics, pp. 1–38, 2021.
- S. Zheng, A. Trott, S. Srinivasa, D. C. Parkes, and R. Socher, “The AI economist: Taxation policy design via two-level deep multiagent reinforcement learning,” Science Advances, vol. 8, no. 18, p. eabk2607, 2022.
- D. S. Ornstein, “An application of ergodic theory to probability theory,” The Annals of Probability, vol. 1, no. 1, pp. 43–58, 1973.
- L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learning-based control to safe reinforcement learning,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, pp. 411–444, 2022.
- F. Pesquerel and O.-A. Maillard, “IMED-RL: Regret optimal learning of ergodic Markov decision processes,” Advances in Neural Information Processing Systems, pp. 26 363–26 374, 2022.
- J. Ok, A. Proutiere, and D. Tranos, “Exploration in structured reinforcement learning,” Advances in Neural Information Processing Systems, 2018.
- M. Agarwal, Q. Bai, and V. Aggarwal, “Regret guarantees for model-based reinforcement learning with long-term average constraints,” in Uncertainty in Artificial Intelligence, 2022, pp. 22–31.
- M. Turchetta, F. Berkenkamp, and A. Krause, “Safe exploration in finite Markov decision processes with Gaussian processes,” Advances in Neural Information Processing Systems, 2016.
- S. Heim, A. von Rohr, S. Trimpe, and A. Badri-Spröwitz, “A learnable safety measure,” in Conference on Robot Learning, 2020, pp. 627–639.
- S. J. Majeed and M. Hutter, “On Q-learning convergence for non-Markov decision processes.” in International Joint Conference on Artificial Intelligence, 2018, pp. 2546–2552.
- O. Mihatsch and R. Neuneier, “Risk-sensitive reinforcement learning,” Machine Learning, vol. 49, no. 2, pp. 267–290, 2002.
- Y. Shen, M. J. Tobia, T. Sommer, and K. Obermayer, “Risk-sensitive reinforcement learning,” Neural Computation, vol. 26, no. 7, pp. 1298–1328, 2014.
- Y. Fei, Z. Yang, and Z. Wang, “Risk-sensitive reinforcement learning with function approximation: A debiasing approach,” in International Conference on Machine Learning. PMLR, 2021, pp. 3198–3207.
- E. Noorani and J. S. Baras, “Risk-sensitive reinforce: A Monte Carlo policy gradient algorithm for exponential performance criteria,” in IEEE Conference on Decision and Control, 2021, pp. 1522–1527.
- E. Noorani, C. Mavridis, and J. Baras, “Risk-sensitive reinforcement learning with exponential criteria,” arXiv preprint arXiv:2212.09010, 2022.
- J. Peters and S. Schaal, “Reinforcement learning by reward-weighted regression for operational space control,” in International Conference on Machine Learning, 2007, pp. 745–750.
- ——, “Learning to control in operational space,” The International Journal of Robotics Research, vol. 27, no. 2, pp. 197–212, 2008.
- D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber, “Episodic reinforcement learning by logistic reward-weighted regression,” in International Conference on Artificial Neural Networks, 2008, pp. 407–416.
- A. Abdolmaleki, J. T. Springenberg, J. Degrave, S. Bohez, Y. Tassa, D. Belov, N. Heess, and M. Riedmiller, “Relative entropy regularized policy iteration,” arXiv preprint arXiv:1812.02256, 2018.
- X. B. Peng, A. Kumar, G. Zhang, and S. Levine, “Advantage-weighted regression: Simple and scalable off-policy reinforcement learning,” arXiv preprint arXiv:1910.00177, 2019.
- V. Nguyen, S. Schulze, and M. Osborne, “Bayesian optimization for iterative learning,” Advances in Neural Information Processing Systems, pp. 9361–9371, 2020.
- D. Blackwell, “Discrete dynamic programming,” The Annals of Mathematical Statistics, pp. 719–726, 1962.
- A. F. Veinott, “On finding optimal policies in discrete dynamic programming with no discounting,” The Annals of Mathematical Statistics, vol. 37, no. 5, pp. 1284–1294, 1966.
- S. Mahadevan, “Average reward reinforcement learning: Foundations, algorithms, and empirical results,” Machine Learning, vol. 22, pp. 159–195, 1996.
- Y. Zhang and K. W. Ross, “On-policy deep reinforcement learning for the average-reward criterion,” in International Conference on Machine Learning, 2021, pp. 12 535–12 545.
- C.-Y. Wei, M. J. Jahromi, H. Luo, H. Sharma, and R. Jain, “Model-free reinforcement learning in infinite-horizon average-reward Markov decision processes,” in International Conference on Machine Learning, 2020, pp. 10 170–10 180.
- H. Wei, X. Liu, and L. Ying, “A provably-efficient model-free algorithm for infinite-horizon average-reward constrained markov decision processes,” in AAAI Conference on Artificial Intelligence, 2022, pp. 3868–3876.
- G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI Gym,” arXiv preprint arXiv:1606.01540, 2016.
- R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, pp. 229–256, 1992.
- L. Fant, O. Mazzarisi, E. Panizon, and J. Grilli, “Stable cooperation emerges in stochastic multiplicative growth,” Physical Review E, vol. 108, no. 1, p. L012401, 2023.
- O. Peters and A. Adamou, “The ergodicity solution of the cooperation puzzle,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 380, no. 2227, p. 20200425, 2022.
- A. Adamou, Y. Berman, D. Mavroyiannis, and O. Peters, “Microfoundations of discounting,” Decision Analysis, vol. 18, no. 4, pp. 257–272, 2021.
- K. Itô, “Stochastic integral,” Proceedings of the Imperial Academy, vol. 20, no. 8, pp. 519–524, 1944.
- Z. Wang and N. de Freitas, “Theoretical analysis of bayesian optimisation with unknown gaussian process hyper-parameters,” arXiv preprint arXiv:1406.7758, 2014.
- D. R. Jones, M. Schonlau, and W. J. Welch, “Efficient global optimization of expensive black-box functions,” Journal of Global optimization, vol. 13, no. 4, p. 455, 1998.
- H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in AAAI Conference on Artificial Intelligence, 2016, pp. 2094–2100.
- V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016, pp. 1928–1937.