Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Reinforcement Learning via Decoupling Exploration and Utilization (2312.15965v4)

Published 26 Dec 2023 in cs.LG

Abstract: Reinforcement Learning (RL), recognized as an efficient learning approach, has achieved remarkable success across multiple fields and applications, including gaming, robotics, and autonomous vehicles. Classical single-agent reinforcement learning grapples with the imbalance of exploration and exploitation as well as limited generalization abilities. This methodology frequently leads to algorithms settling for suboptimal solutions that are tailored only to specific datasets. In this work, our aim is to train agent with efficient learning by decoupling exploration and utilization, so that agent can escaping the conundrum of suboptimal Solutions. In reinforcement learning, the previously imposed pessimistic punitive measures have deprived the model of its exploratory potential, resulting in diminished exploration capabilities. To address this, we have introduced an additional optimistic Actor to enhance the model's exploration ability, while employing a more constrained pessimistic Actor for performance evaluation. The above idea is implemented in the proposed OPARL (Optimistic and Pessimistic Actor Reinforcement Learning) algorithm. This unique amalgamation within the reinforcement learning paradigm fosters a more balanced and efficient approach. It facilitates the optimization of policies that concentrate on high-reward actions via pessimistic exploitation strategies while concurrently ensuring extensive state coverage through optimistic exploration. Empirical and theoretical investigations demonstrate that OPARL enhances agent capabilities in both utilization and exploration. In the most tasks of DMControl benchmark and Mujoco environment, OPARL performed better than state-of-the-art methods. Our code has released on https://github.com/yydsok/OPARL

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Human-level control through deep reinforcement learning. Nature, page 529–533, Feb 2015.
  2. Mastering the game of go with deep neural networks and tree search. Nature, page 484–489, Jan 2016.
  3. Agent57: Outperforming the atari human benchmark. arXiv: Learning,arXiv: Learning, Mar 2020.
  4. Robust multi-agent reinforcement learning against adversaries on observation. 2022.
  5. Multi-agent incentive communication via decentralized teammate modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 9466–9474, 2022.
  6. Discovering generalizable multi-agent coordination skills from multi-task offline data. In The Eleventh International Conference on Learning Representations, 2022.
  7. Human-level control through deep reinforcement learning. Nat., 518(7540):529–533, 2015.
  8. Generative adversarial user model for reinforcement learning based recommendation system. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 1052–1061. PMLR, 2019.
  9. Grandmaster level in starcraft II using multi-agent reinforcement learning. Nat., 575(7782):350–354, 2019.
  10. A minimalist approach to offline reinforcement learning. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 20132–20145, 2021.
  11. Issues in using function approximation for reinforcement learning. Jan 1999.
  12. Addressing function approximation error in actor-critic methods. arXiv: Artificial Intelligence,arXiv: Artificial Intelligence, Feb 2018.
  13. Off-policy evaluation with online adaptation for robot exploration in challenging environments. IEEE Robotics Autom. Lett., 8(6):3780–3787, 2023.
  14. Transfer learning for reinforcement learning domains: A survey. J. Mach. Learn. Res., 10:1633–1685, 2009.
  15. Off-policy deep reinforcement learning without exploration. arXiv: Learning,arXiv: Learning, Dec 2018.
  16. Better exploration with optimistic actor-critic. arXiv: Machine Learning,arXiv: Machine Learning, Oct 2019.
  17. Tactical optimism and pessimism for deep reinforcement learning. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 12849–12863, 2021.
  18. How to train your robot with deep reinforcement learning: lessons we have learned. The International Journal of Robotics Research, 40(4-5):698–721, 2021.
  19. Review on deep learning applications in frequency analysis and control of modern power system. International Journal of Electrical Power & Energy Systems, 136:107744, 2022.
  20. Hadovan Hasselt. Double q-learning. IEEE Intelligent Systems,IEEE Intelligent Systems, Jan 2010.
  21. Continuous control with deep reinforcement learning. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
  22. Improving exploration in actor–critic with weakly pessimistic value estimation and optimistic policy optimization. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  23. Approximating the value of collaborative team actions for efficient multiagent navigation in uncertain graphs. 2023.
  24. Belief states and categorical-choice biases determine reward-based learning under perceptual uncertainty. bioRxiv, pages 2020–09, 2020.
  25. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR, 2018.
  26. Deep reinforcement learning with double q-learning. In Dale Schuurmans and Michael P. Wellman, editors, Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, pages 2094–2100. AAAI Press, 2016.
  27. Addressing function approximation error in actor-critic methods. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1582–1591. PMLR, 2018.
  28. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1856–1865. PMLR, 2018.
  29. Bias-corrected q-learning to control max-operator bias in q-learning. In 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), Apr 2013.
  30. Weighted double q-learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Aug 2017.
  31. Mixing update q-value for deep reinforcement learning. In 2019 International Joint Conference on Neural Networks (IJCNN), Jul 2019.
  32. Taming the noise in reinforcement learning via soft updates. arXiv: Learning,arXiv: Learning, Dec 2015.
  33. The effect of multi-step methods on overestimation in deep reinforcement learning. In 2020 25th International Conference on Pattern Recognition (ICPR), Jan 2021.
  34. Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. International Conference on Machine Learning,International Conference on Machine Learning, Jul 2021.
  35. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. arXiv: Learning,arXiv: Learning, May 2020.
  36. An optimistic perspective on offline reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 104–114. PMLR, 2020.
  37. Maxmin q-learning: Controlling the estimation bias of q-learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  38. Randomized ensembled double q-learning: Learning fast without a model. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  39. Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8:229–256, 1992.
  40. Asynchronous methods for deep reinforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1928–1937. JMLR.org, 2016.
  41. Deep reinforcement learning for continuous electric vehicles charging control with dynamic user behaviors. IEEE Trans. Smart Grid, 12(6):5124–5134, 2021.
  42. Mastering the game of go without human knowledge. Nat., 550(7676):354–359, 2017.
  43. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  44. Proximal policy optimization algorithms. arXiv: Learning,arXiv: Learning, Jul 2017.
  45. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct 2012.
  46. Deepmind control suite. Cornell University - arXiv,Cornell University - arXiv, Jan 2018.
  47. Eigensubspace of temporal-difference dynamics and how it improves value approximation in reinforcement learning. In Danai Koutra, Claudia Plant, Manuel Gomez Rodriguez, Elena Baralis, and Francesco Bonchi, editors, Machine Learning and Knowledge Discovery in Databases: Research Track - European Conference, ECML PKDD 2023, Turin, Italy, September 18-22, 2023, Proceedings, Part IV, volume 14172 of Lecture Notes in Computer Science, pages 573–589. Springer, 2023.
  48. Openai gym. arXiv: Learning,arXiv: Learning, Jun 2016.
  49. Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning.
  50. Decoupled reinforcement learning to stabilise intrinsically-motivated exploration.

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets