Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How to Choose a Reinforcement-Learning Algorithm (2407.20917v1)

Published 30 Jul 2024 in cs.LG, cs.AI, cs.CV, and stat.ML

Abstract: The field of reinforcement learning offers a large variety of concepts and methods to tackle sequential decision-making problems. This variety has become so large that choosing an algorithm for a task at hand can be challenging. In this work, we streamline the process of choosing reinforcement-learning algorithms and action-distribution families. We provide a structured overview of existing methods and their properties, as well as guidelines for when to choose which methods. An interactive version of these guidelines is available online at https://rl-picker.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (127)
  1. Maximum a posteriori policy optimisation. In International Conference on Learning Representations, 2018.
  2. What matters for on-policy deep actor-critic methods? a large-scale study. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=nIAxjsniDzg.
  3. A survey of robot learning from demonstration. Robotics Auton. Syst., 57:469–483, 2009.
  4. A survey on intrinsic motivation in reinforcement learning. CoRR, abs/1908.06976, 2019. URL http://arxiv.org/abs/1908.06976.
  5. Agent57: Outperforming the Atari human benchmark. In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 507–517. PMLR, 13–18 Jul 2020. URL http://proceedings.mlr.press/v119/badia20a.html.
  6. Maximum entropy reinforcement learning with mixture policies. CoRR, abs/2103.10176, 2021. URL https://arxiv.org/abs/2103.10176.
  7. Distributed distributional deterministic policy gradients. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SyZipzbCb.
  8. A distributional perspective on reinforcement learning. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 449–458. PMLR, 06–11 Aug 2017. URL http://proceedings.mlr.press/v70/bellemare17a.html.
  9. R. Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60(6):503 – 515, 1954. doi: bams/1183519147. URL https://doi.org/.
  10. J. A. Boyan. Least-squares temporal difference learning. Machine Learning, 49(2/3):233–246, 2002. ISSN 0885-6125. URL http://dx.doi.org/10.1023/A:1017936530646.
  11. Boltzmann Exploration Done Right. In NIPS 2017 - 31st Annual Conference on Neural Information Processing Systems, Long Beach, United States, Dec. 2017. URL https://hal.inria.fr/hal-01916978.
  12. Reinforcement learning in economics and finance. Comput Econ, 2021. doi: 10.1007/s10614-021-10119-4.
  13. Randomized ensembled double Q-learning: Learning fast without a model. In International Conference on Learning Representations, 2021.
  14. F. Chollet. Deep Learning with Python, Second Edition. Manning Publications, Shelter Island, NY, 2021. ISBN 9781617296864. OCLC: 1199971904.
  15. Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 834–843, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/chou17a.html.
  16. Distributional reinforcement learning with quantile regression. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  17. Off-policy actor-critic. In ICML’12, page 179–186, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851.
  18. Discrete and continuous action representation for practical RL in video games. CoRR, abs/1912.11077, 2019. URL http://arxiv.org/abs/1912.11077.
  19. GAN Q-learning, 2018.
  20. Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors. IEEE transactions on neural networks and learning systems, PP, 2021.
  21. Benchmarking deep reinforcement learning for continuous control. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1329–1338, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/duan16.html.
  22. Implementation matters in deep RL: A case study on PPO and TRPO. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1etN1rtPB.
  23. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1407–1416, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/espeholt18a.html.
  24. SEED RL: Scalable and efficient deep-RL with accelerated central inference. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgvXlrKwH.
  25. B. Eysenbach and S. Levine. Maximum entropy RL (provably) solves some robust RL problems. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=PtSAD3caaA2.
  26. Implicit reparameterization gradients. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 439–450, Red Hook, NY, USA, 2018. Curran Associates Inc.
  27. Stochastic neural networks for hierarchical reinforcement learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=B1oK8aoxe.
  28. Noisy networks for exploration. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rywHCPkAW.
  29. Addressing function approximation error in actor-critic methods. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1587–1596. PMLR, 10–15 Jul 2018. URL http://proceedings.mlr.press/v80/fujimoto18a.html.
  30. Off-policy deep reinforcement learning without exploration. In ICML, 2019.
  31. Reinforcement learning from imperfect demonstrations. CoRR, abs/1802.05313, 2018. URL http://arxiv.org/abs/1802.05313.
  32. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  33. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SyzKd1bCW.
  34. Variance reduction techniques for gradient estimates in reinforcement learning. J. Mach. Learn. Res., 5:1471–1530, dec 2004. ISSN 1532-4435.
  35. The reactor: A fast and sample-efficient actor-critic agent for reinforcement learning. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rkHVZWZAZ.
  36. Q-prop: Sample-efficient policy gradient with an off-policy critic. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=SJ3rcZcxl.
  37. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1352–1361, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/haarnoja17a.html.
  38. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. International Conference on Machine Learning (ICML), 2018.
  39. On the role of planning in model-based deep reinforcement learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=IrM64DGB21.
  40. Temporal difference learning for model predictive control. arXiv preprint arXiv:2203.04955, 2022.
  41. Deep reinforcement learning with double Q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, page 2094–2100. AAAI Press, 2016.
  42. M. J. Hausknecht and P. Stone. Deep recurrent Q-learning for partially observable MDPs. CoRR, abs/1507.06527, 2015. URL http://arxiv.org/abs/1507.06527.
  43. Bayesian policy gradients via alpha divergence dropout inference. In Bayesian Deep Learning Workshop (NIPS), 2017. URL https://arxiv.org/pdf/1712.02037.pdf.
  44. Deep reinforcement learning that matters. In AAAI, 2018.
  45. Rainbow: Combining improvements in deep reinforcement learning. CoRR, abs/1710.02298, 2017. URL http://arxiv.org/abs/1710.02298.
  46. Deep Q-learning from demonstrations. In AAAI, 2018.
  47. H. Hf. Learning and satiation of response in intrinsically motivated complex puzzle performance by monkeys. Journal of Comparative and Physiological Psychology, 43:289–294, 1950.
  48. J. Ho and S. Ermon. Generative adversarial imitation learning. In NIPS, 2016.
  49. Distributed prioritized experience replay. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1Dy---0Z.
  50. Categorical reparameterization with gumbel-softmax. In 5th International Conference on Learning Representations, ICLR 2017, 2017. URL https://openreview.net/forum?id=rkE3y85ee.
  51. Deep learning for video game playing. IEEE Transactions on Games, 12:1–20, 2020.
  52. Reinforcement learning: A survey. J. Artif. Int. Res., 4(1):237–285, May 1996. ISSN 1076-9757.
  53. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Proceedings of The 2nd Conference on Robot Learning, volume 87 of Proceedings of Machine Learning Research, pages 651–673. PMLR, 29–31 Oct 2018.
  54. Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1lyTjAqYX.
  55. D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
  56. Reinforcement learning in robotics: A survey. The International Journal Of Robotics Research, 32(11):1238–1274, 2013. ISSN 1741-3176. doi: 10.1177/0278364913495721. URL https://pub.uni-bielefeld.de/record/2636036.
  57. Imitation learning via off-policy distribution matching. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hyg-JC4FDr.
  58. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In NIPS, 2016.
  59. URLB: Unsupervised reinforcement learning benchmark. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=lwrPkQP_is.
  60. S. M. LaValle. Planning algorithms. Cambridge University Press, Cambridge ; New York, 2006. ISBN 9780521862059. OCLC: ocm65301992.
  61. Deep learning. Nature, 521(7553):436–444, 2015. doi: 10.1038/nature14539. URL https://doi.org/10.1038/nature14539.
  62. Sub-policy adaptation for hierarchical reinforcement learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=ByeWogStDS.
  63. Y. Li. Deep reinforcement learning. CoRR, abs/1810.06339, 2018. URL http://arxiv.org/abs/1810.06339.
  64. Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1509.02971.
  65. Stein variational policy gradient. CoRR, abs/1704.02399, 2017. URL http://arxiv.org/abs/1704.02399.
  66. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Communications Surveys Tutorials, 21(4):3133–3174, 2019. doi: 10.1109/COMST.2019.2916583.
  67. H. R. Maei. Gradient Temporal-Difference Learning Algorithms. PhD thesis, University of Alberta, 2011.
  68. The problem with DDPG: understanding failures in deterministic environments with sparse rewards. CoRR, abs/1911.11679, 2019. URL http://arxiv.org/abs/1911.11679.
  69. Leveraging exploration in off-policy algorithms via normalizing flows. In Proceedings of the Conference on Robot Learning, volume 100 of Proceedings of Machine Learning Research, pages 430–444. PMLR, 30 Oct–01 Nov 2020. URL http://proceedings.mlr.press/v100/mazoure20a.html.
  70. Discrete sequential prediction of continuous actions for deep RL. CoRR, abs/1705.05035, 2017. URL http://arxiv.org/abs/1705.05035.
  71. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015.
  72. Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1928–1937, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/mniha16.html.
  73. A framework for reinforcement learning and planning, 2020.
  74. Model-based reinforcement learning: A survey, 2021.
  75. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/c3992e9a68c5ae12bd18488bc579b30d-Paper.pdf.
  76. Bridging the gap between value and policy based reinforcement learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/facf9f743b083008a894eee7baa16469-Paper.pdf.
  77. Data-efficient hierarchical reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 3307–3317, Red Hook, NY, USA, 2018. Curran Associates Inc.
  78. Improving stability in deep reinforcement learning with weight averaging. In UAI Workshop: Uncertainty in Deep Learning, 2018.
  79. Deep exploration via bootstrapped DQN. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/8d8818c8e140c64c743113f563cf750f-Paper.pdf.
  80. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1–62, 2019. URL http://jmlr.org/papers/v20/18-339.html.
  81. Hierarchical reinforcement learning: A comprehensive survey. ACM Comput. Surv., 54(5), jun 2021. ISSN 0360-0300. doi: 10.1145/3453160. URL https://doi.org/10.1145/3453160.
  82. J. Peng and R. J. Williams. Incremental multi-step Q-learning. In Proceedings of the Eleventh International Conference on International Conference on Machine Learning, ICML’94, page 226–232, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. ISBN 1558603352.
  83. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. CoRR, abs/1910.00177, 2019. URL https://arxiv.org/abs/1910.00177.
  84. Natural actor-critic. In Machine Learning: ECML 2005, pages 280–291, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. ISBN 978-3-540-31692-3.
  85. Deep model-based reinforcement learning for high-dimensional problems, a survey, 2020.
  86. Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, page 759–766, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1558607072.
  87. M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience, Hoboken, New Jersey, 2005. ISBN 0471727822.
  88. Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6284–6291, 2018. doi: 10.1109/ICRA.2018.8461039.
  89. Learning to locomote: Understanding how environment design matters for deep reinforcement learning. In Motion, Interaction and Games, MIG ’20, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450381710. doi: 10.1145/3424636.3426907. URL https://doi.org/10.1145/3424636.3426907.
  90. Stochastic backpropagation and approximate inference in deep generative models. In E. P. Xing and T. Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1278–1286, Bejing, China, 22–24 Jun 2014. PMLR. URL https://proceedings.mlr.press/v32/rezende14.html.
  91. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 627–635. PMLR, 2011. URL http://proceedings.mlr.press/v15/ross11a.html.
  92. The Cross-Entropy Method. Springer, 2004. ISBN 978-1-4757-4321-0. URL https://www.springer.com/de/book/9780387212401.
  93. G. A. Rummery and M. Niranjan. On-line Q-learning using connectionist systems. Technical report, University of Cambridge, 1994.
  94. Prioritized experience replay. In International Conference on Learning Representations, Puerto Rico, 2016.
  95. Mastering Atari, Go, chess and shogi by planning with a learned model. Nature, 588 7839:604–609, 2020.
  96. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1889–1897, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/schulman15.html.
  97. High-dimensional continuous control using generalized advantage estimation. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1506.02438.
  98. Equivalence between policy gradients and soft Q-learning. CoRR, abs/1704.06440, 2017a. URL http://arxiv.org/abs/1704.06440.
  99. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017b. URL http://arxiv.org/abs/1707.06347.
  100. A survey of deep reinforcement learning in video games. CoRR, abs/1912.10944, 2019. URL http://arxiv.org/abs/1912.10944.
  101. Soft policy gradient method for maximum entropy deep reinforcement learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 3425–3431. International Joint Conferences on Artificial Intelligence Organization, 7 2019. doi: 10.24963/ijcai.2019/475. URL https://doi.org/10.24963/ijcai.2019/475.
  102. C. Shorten and T. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6:1–48, 2019.
  103. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 387–395, Bejing, China, 22–24 Jun 2014. PMLR. URL http://proceedings.mlr.press/v32/silver14.html.
  104. Mastering the game of Go without human knowledge. Nature, 550:354–359, 2017.
  105. Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2(2):70–82, 2010. doi: 10.1109/TAMD.2010.2051031.
  106. R. S. Sutton. Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):9–44, Aug. 1988. ISSN 0885-6125. doi: 10.1023/A:1022633531479. URL https://doi.org/10.1023/A:1022633531479.
  107. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html.
  108. Y. Tang and A. Shipra. Implicit policy for reinforcement learning. CoRR, abs/1806.06798, 2018. URL http://arxiv.org/abs/1806.06798.
  109. Recent advances in imitation learning from observation. In IJCAI, 2019.
  110. On the theory of the brownian motion. Phys. Rev., 36:823–841, Sep 1930. doi: 10.1103/PhysRev.36.823. URL https://link.aps.org/doi/10.1103/PhysRev.36.823.
  111. Q-learning in enormous action spaces via amortized approximate maximization. CoRR, abs/2001.08116, 2020. URL https://arxiv.org/abs/2001.08116.
  112. Expected eligibility traces, 2021.
  113. J. Vitay. Deep reinforcement learning. https://julien-vitay.net/deeprl/, 2022. Accessed: 2022-05-26.
  114. Benchmarking model-based reinforcement learning. CoRR, abs/1907.02057, 2019. URL http://arxiv.org/abs/1907.02057.
  115. Dueling network architectures for deep reinforcement learning. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1995–2003, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/wangf16.html.
  116. Sample efficient actor-critic with experience replay. ArXiv, abs/1611.01224, 2017.
  117. Improving exploration in soft-actor-critic with normalizing flows policies. Invertible Neural Networks and Normalizing Flows (INNF) Workshop, International Conference on Machine Learning (ICML), 2019.
  118. C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3):279–292, 1992.
  119. C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, May 1989. URL http://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf.
  120. L. Weng. A (long) peek into reinforcement learning, Feb 2018. URL https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html.
  121. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8:229–256, 1992. doi: 10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696.
  122. R. J. Williams and J. Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3:241–268, 1991.
  123. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  124. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=GY6-6sTvGaf.
  125. Reinforcement learning in healthcare: A survey. CoRR, abs/1908.08796, 2019. URL http://arxiv.org/abs/1908.08796.
  126. H. Zhang and T. Yu. Taxonomy of Reinforcement Learning Algorithms, pages 125–133. Springer Singapore, Singapore, 2020. ISBN 978-981-15-4095-0. doi: 10.1007/978-981-15-4095-0˙3. URL https://doi.org/10.1007/978-981-15-4095-0_3.
  127. Off-policy imitation learning from observations. ArXiv, abs/2102.13185, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Fabian Bongratz (13 papers)
  2. Vladimir Golkov (16 papers)
  3. Lukas Mautner (2 papers)
  4. Luca Della Libera (14 papers)
  5. Frederik Heetmeyer (1 paper)
  6. Felix Czaja (1 paper)
  7. Julian Rodemann (17 papers)
  8. Daniel Cremers (274 papers)

Summary

  • The paper's main contribution is offering a systematic guide to choose RL algorithms by consolidating algorithmic properties and environmental factors.
  • It categorizes methods such as on/off-policy and model-based/model-free approaches to streamline the selection process for specific tasks.
  • It provides practical insights on neural network architectures and stabilization techniques to improve training performance in diverse settings.

Overview of "How to Choose a Reinforcement-Learning Algorithm"

The paper "How to Choose a Reinforcement-Learning Algorithm" by Fabian Bongratz et al. undertakes the formidable task of providing a systematic guide to selecting appropriate reinforcement learning (RL) algorithms for various tasks. The field of RL, known for its plethora of algorithms and varying methodologies, often makes the decision process for selecting the right algorithm cumbersome. The authors address this by offering structured guidelines, taking into account algorithmic properties and environmental conditions.

Abstract and Introduction

Reinforcement Learning (RL) involves a broad spectrum of methods designed to tackle sequential decision-making problems. Recently, deep RL has shown exceptional performance across various applications, such as game-playing, robotics, and finance. Despite this progress, selecting the right algorithm remains a challenge due to the dispersed nature of information across textbooks and research papers. This paper aims to consolidate this information and provide a comprehensive guide to streamline the algorithm selection process, thereby aiding researchers and practitioners alike.

Reinforcement Learning Algorithms

This section presents an exhaustive list of RL algorithms, categorized based on several key properties such as on-policy vs. off-policy learning, value-based vs. policy-based approaches, and methods leveraging model-free vs. model-based RL. Tables catalog the specific attributes of each algorithm, such as their adaptability to various situations and environments, which help in making an informed selection.

Action-Distribution Families

The authors delve into different action-distribution families that can be used within RL algorithms. Each distribution family has its own set of parameters and expressive power, which can significantly influence the learning process and final performance. The structured guidelines simplify the decision of selecting an appropriate distribution family based on the specific requirements of the task at hand.

Selection Criteria

The core strength of the paper is in its detailed decision tables. These tables provide a nuanced view of when to employ specific algorithms based on various properties:

  • Model-Free vs. Model-Based RL: The decision pivots on factors like training stability, data efficiency, and whether the environment dynamics are known and learnable.
  • Hierarchical RL: Suitable for tasks requiring complex action sequences that can be broken down into sub-routines.
  • Imitation Learning: Recommended when expert demonstrations are available, offering substantial benefits in terms of training stability and performance.
  • Distributed Algorithms: Beneficial in settings where computational resources allow parallel execution, significantly reducing training time.
  • Distributional Algorithms: Useful when risk estimation of actions is critical.
  • On-Policy vs. Off-Policy Learning: Dependent on factors such as the importance of training stability, sample efficiency, and requirement for good exploration.

Practical Considerations and Neural Network Architectures

The paper also offers practical advice on the implementation and optimization of neural networks for RL:

  • Fully Connected NNs: Ideal for environments with small state spaces.
  • Time-Recurrent NNs: Suitable for partially observable MDPs or fully observable MDPs requiring temporal dependencies.
  • CNNs: Recommended for tasks involving Euclidean-grid data like images or videos.
  • Dueling NNs: Optimize value function learning by separating state-dependent and action-dependent values.
  • Parameter Sharing and Broadcasting: Efficient techniques to handle complex state and action representations, reducing the overall network parameter load.

Training Techniques

The paper emphasizes stabilizing the RL training process through methods such as:

  • Trust Regions and Clipping: To ensure stable updates.
  • Target Networks: Updating parameters less frequently to stabilize bootstrapping targets.
  • Weight Averaging: Mitigating overfitting and enhancing training stability.

To improve test performance, the paper discusses:

  • Learning a Diverse Set of Policies: Leveraging multiple policies to enhance robustness and adaptability.
  • Intrinsic Motivation: Augmenting external rewards with intrinsic motivation to improve exploration.
  • Data Augmentation: Enhancing the diversity of training samples to generalize better across tasks.

Conclusion

The paper offers a comprehensive, structured guide for choosing RL algorithms based on the needs and constraints of various tasks. While no single algorithm can address all RL challenges, the paper's guidelines enable the selection of methods most suited to specific situations. This ensures efficient use of resources and maximizes the performance of RL agents. The paper stands as a valuable resource for researchers and practitioners navigating the complex landscape of RL algorithm selection and optimization.

Github Logo Streamline Icon: https://streamlinehq.com