Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Guided Exploration in Reinforcement Learning via Monte Carlo Critic Optimization (2206.12674v2)

Published 25 Jun 2022 in cs.LG and cs.AI

Abstract: The class of deep deterministic off-policy algorithms is effectively applied to solve challenging continuous control problems. Current approaches commonly utilize random noise as an exploration method, which has several drawbacks, including the need for manual adjustment for a given task and the absence of exploratory calibration during the training process. We address these challenges by proposing a novel guided exploration method that uses an ensemble of Monte Carlo Critics for calculating exploratory action correction. The proposed method enhances the traditional exploration scheme by dynamically adjusting exploration. Subsequently, we present a novel algorithm that leverages the proposed exploratory module for both policy and critic modification. The presented algorithm demonstrates superior performance compared to modern reinforcement learning algorithms across a variety of problems in the DMControl suite.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Joshua Achiam. 2018. Spinning Up in Deep Reinforcement Learning. (2018).
  2. A Survey of Exploration Methods in Reinforcement Learning. arXiv preprint arXiv:2109.00157 (2021).
  3. Never Give Up: Learning Directed Exploration Strategies. In International Conference on Learning Representations.
  4. Real-time learning and control using asynchronous dynamic programming. University of Massachusetts at Amherst, Department of Computer and ….
  5. Daniel E Berlyne. 1966. Curiosity and exploration. Science 153, 3731 (1966), 25–33.
  6. Openai gym. arXiv preprint arXiv:1606.01540 (2016).
  7. Exploration by random network distillation. arXiv preprint arXiv:1810.12894 (2018).
  8. Better exploration with optimistic actor-critic. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 1787–1798.
  9. Characteristics of the rewarder and intrinsic motivation of the rewardee. Journal of personality and social psychology 40, 1 (1981), 1.
  10. Automatic goal generation for reinforcement learning agents. In International conference on machine learning. PMLR, 1515–1528.
  11. Noisy Networks For Exploration. In International Conference on Learning Representations.
  12. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning. PMLR, 1587–1596.
  13. Javier Garcıa and Fernando Fernández. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16, 1 (2015), 1437–1480.
  14. Samuel J Gershman. 2019. Uncertainty and exploration. Decision 6, 3 (2019), 277.
  15. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 315–323.
  16. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 1861–1870.
  17. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905 (2018).
  18. Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
  19. VIME: Variational Information Maximizing Exploration. Advances in Neural Information Processing Systems 29 (2016), 1109–1117.
  20. When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems 32 (2019).
  21. Celeste Kidd and Benjamin Y Hayden. 2015. The psychology and neuroscience of curiosity. Neuron 88, 3 (2015), 449–460.
  22. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  23. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In International Conference on Machine Learning. PMLR, 5556–5566.
  24. Igor Kuznetsov and Andrey Filchenkov. 2021. Solving Continuous Control with Episodic Memory. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. 2651–2657. Main Track.
  25. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
  26. Long-Ji Lin. 1992. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8, 3-4 (1992), 293–321.
  27. Daniel Ying-Jeh Little and Friedrich Tobias Sommer. 2013. Learning and exploration in action-perception loops. Frontiers in neural circuits 7 (2013), 37.
  28. Leveraging exploration in off-policy algorithms via normalizing flows. In Conference on Robot Learning. PMLR, 430–444.
  29. Discovering and achieving goals via world models. Advances in Neural Information Processing Systems 34 (2021).
  30. Information-based learning by agents in unbounded state spaces. Advances in Neural Information Processing Systems 27 (2014), 3023–3031.
  31. Andrew William Moore. 1990. Efficient memory-based learning for robot control. (1990).
  32. Visual reinforcement learning with imagined goals. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 9209–9220.
  33. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning. PMLR, 2778–2787.
  34. Self-supervised exploration via disagreement. In International conference on machine learning. PMLR, 5062–5071.
  35. Jürgen Schmidhuber. 1991. Curious model-building control systems. In Proc. international joint conference on neural networks. 1458–1463.
  36. Deterministic policy gradient algorithms. In International conference on machine learning. PMLR, 387–395.
  37. Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play. In International Conference on Learning Representations.
  38. Richard S Sutton. 1990. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings 1990. Elsevier, 216–224.
  39. Deepmind control suite. arXiv preprint arXiv:1801.00690 (2018).
  40. Sebastian B Thrun. 1992. Efficient exploration in reinforcement learning. (1992).
  41. An exploration of the relationship between use of safety-seeking behaviours and psychosis: A systematic review and meta-analysis. Clinical psychology & psychotherapy 24, 6 (2017), 1384–1405.
  42. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30.
  43. Generalized exploration in policy search. Machine Learning 106, 9 (2017), 1705–1724.
  44. Christopher M Vigorito. 2016. Intrinsically Motivated Exploration in Hierarchical Reinforcement Learning. (2016).
  45. Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning 8, 3-4 (1992), 279–292.
  46. Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 3 (1992), 229–256.
  47. Yijie Zhang and Herke Van Hoof. 2021. Deep Coherent Exploration for Continuous Control. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 12567–12577.
Citations (2)

Summary

We haven't generated a summary for this paper yet.