Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sample Efficient Deep Reinforcement Learning via Local Planning (2301.12579v2)

Published 29 Jan 2023 in cs.LG and cs.AI

Abstract: The focus of this work is sample-efficient deep reinforcement learning (RL) with a simulator. One useful property of simulators is that it is typically easy to reset the environment to a previously observed state. We propose an algorithmic framework, named uncertainty-first local planning (UFLP), that takes advantage of this property. Concretely, in each data collection iteration, with some probability, our meta-algorithm resets the environment to an observed state which has high uncertainty, instead of sampling according to the initial-state distribution. The agent-environment interaction then proceeds as in the standard online RL setting. We demonstrate that this simple procedure can dramatically improve the sample cost of several baseline RL algorithms on difficult exploration tasks. Notably, with our framework, we can achieve super-human performance on the notoriously hard Atari game, Montezuma's Revenge, with a simple (distributional) double DQN. Our work can be seen as an efficient approximate implementation of an existing algorithm with theoretical guarantees, which offers an interpretation of the positive empirical results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. PC-PG: Policy cover directed exploration for provable policy gradient learning. arXiv preprint arXiv:2007.08459, 2020.
  2. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
  3. S. Aradi. Survey of deep reinforcement learning for motion planning of autonomous vehicles. IEEE Transactions on Intelligent Transportation Systems, 2020.
  4. Agent57: Outperforming the atari human benchmark. In International Conference on Machine Learning, pages 507–517. PMLR, 2020a.
  5. Never give up: Learning directed exploration strategies. arXiv preprint arXiv:2002.06038, 2020b.
  6. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13(5):834–846, 1983.
  7. Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1-2):81–138, 1995.
  8. DeepMind Lab. arXiv preprint arXiv:1612.03801, 2016.
  9. Unifying count-based exploration and intrinsic motivation. Advances in Neural Information Processing Systems, 29, 2016.
  10. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  11. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449–458. PMLR, 2017.
  12. D. P. Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9(3):310–335, 2011.
  13. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
  14. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
  15. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
  16. UCB exploration via Q-ensembles. arXiv preprint arXiv:1706.01502, 2017.
  17. R. Coulom. Efficient selectivity and backup operators in Monte-Carlo tree search. In International Conference on Computers and Games, pages 72–83. Springer, 2006.
  18. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414–419, 2022.
  19. Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations, 2020.
  20. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
  21. First return, then explore. Nature, 590(7847):580–586, 2021.
  22. Confident least square value iteration with local access to a simulator. The 25th International Conference on Artificial Intelligence and Statistics, 2022.
  23. Acme: A research framework for distributed reinforcement learning. arXiv preprint arXiv:2006.00979, 2020.
  24. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
  25. S. M. Kakade. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003.
  26. Human-level Atari 200x faster. arXiv preprint arXiv:2209.07550, 2022.
  27. L. Kocsis and C. Szepesvári. Bandit based Monte-Carlo planning. In Machine Learning: ECML 2006: 17th European Conference on Machine Learning Berlin, Germany, September 18-22, 2006 Proceedings 17, pages 282–293. Springer, 2006.
  28. Can agents run relay race with strangers? generalization of RL to out-of-distribution trajectories. arXiv preprint arXiv:2304.13424, 2023.
  29. Improved regret bound and experience replay in regularized policy iteration. arXiv preprint arXiv:2102.12611, 2021.
  30. Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting. arXiv preprint arXiv:2105.08024, 2021.
  31. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
  32. Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees. In Proceedings of the 22nd International Conference on Machine Learning, pages 569–576, 2005.
  33. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  34. Deep exploration via bootstrapped DQN. Advances in Neural Information Processing Systems, 29, 2016.
  35. Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018.
  36. Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568, 2019.
  37. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, pages 2778–2787. PMLR, 2017.
  38. Modeling and simulation of 5 dof educational robot arm. In 2010 2nd International Conference on Advanced Computer Control, volume 5, pages 569–574. IEEE, 2010.
  39. A. Rahimi and B. Recht. Random features for large-scale kernel machines. Advances in Neural Information Processing Systems, 20, 2007.
  40. T. Salimans and R. Chen. Learning montezuma’s revenge from a single demonstration. arXiv preprint arXiv:1812.03381, 2018.
  41. Bayesian real-time dynamic programming. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI-09), pages 1784–1789. IJCAI-INT JOINT CONF ARTIF INTELL, 2009.
  42. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897. PMLR, 2015.
  43. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  44. J. Sherman and W. J. Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21(1):124–127, 1950.
  45. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5192–5202, 2018.
  46. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  47. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  48. T. Smith and R. Simmons. Focused real-time dynamic programming for MDPs: Squeezing more out of a heuristic. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1227–1232, 2006.
  49. # exploration: A study of count-based exploration for deep reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
  50. DeepMind control suite. arXiv preprint arXiv:1801.00690, 2018.
  51. Exploring restart distributions. In 4th Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2019), 2019.
  52. W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
  53. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012.
  54. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
  55. Confident approximate policy iteration for efficient local planning in Qπsubscript𝑄𝜋{Q_{\pi}}italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT-realizable MDPs. arXiv preprint arXiv:2210.15755, 2022.
  56. L. Yang and M. Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004. PMLR, 2019.
  57. Efficient local planning with linear function approximation. The 33rd International Conference on Algorithmic Learning Theory, 2022.
  58. Cautiously optimistic policy optimization and exploration with linear function approximation. In Conference on Learning Theory (COLT), 2021.
Citations (4)

Summary

  • The paper introduces UFLP, a framework that uses state-history buffers and uncertainty-based resets to enhance sample efficiency in simulators.
  • It demonstrates super-human performance on tasks like Montezuma’s Revenge by integrating UFLP with a distributional double DQN.
  • Empirical results across benchmarks such as Deep Sea and Cartpole Swingup validate UFLP’s capacity to reduce sample costs in sparse-reward environments.

Sample Efficient Deep Reinforcement Learning via Local Planning

The paper presents a framework named Uncertainty-First Local Planning (UFLP) to enhance the sample efficiency of Deep Reinforcement Learning (RL) when used with simulators. The authors highlight the utilitarian aspect of simulators, which can easily reset the environment to previously observed states. By leveraging this property, UFLP significantly optimizes sample usage through a novel approach of state revisitation based on uncertainty.

The UFLP framework modifies the agent-environment interaction by defaulting the data collection process not to commence from a randomly initialized state, but rather from a state-history buffer. With probability 1 - ϵ\epsilon, UFLP identifies states with high uncertainty to initiate subsequent exploration. This method adopts a strategic starting point, thereby enhancing exploration efficiency and curtailing unnecessary sample consumption, particularly in environments requiring arduous exploration.

A pivotal example of UFLP's efficacy is demonstrated in the challenging Atari game, Montezuma's Revenge. The authors report that the UFLP exhibited super-human performance integrating a simple distributional double DQN. This aligns with the main goal of the research, which is to furnish a methodology translating theoretical RL efficiency such as core-set state revisitation into a pragmatic implementation with tangible success cases.

The theoretical comparison between local and traditional online access in reinforcement learning scenarios is noted. Standard RL algorithms mimic real-world learning protocols, with agents interacting dynamically without direct state revisitation. Conversely, UFLP's local access harnessing historical states reflects a pragmatic adjustment to simulator-based learning paradigms.

Several baseline RL algorithms were enriched using UFLP, showing a notable reduction in sample costs in exploration-persistent environments like the Deep Sea and Cartpole Swingup tasks of bsuite, and other challenging Atari games like PrivateEye and Venture. The paper further details specific implementation aspects encompassing base agents—employing Double DQN, bootstrapped DDQN, and policy iteration—and their respective uncertainty metrics like ensemble predictions, feature covariance, and random network distillation.

Empirical results demonstrate how local access can markedly improve the sample efficiency for various RL agents, as local sampling from high-uncertainty states enhances exploration in sparse-reward environments. The UFLP framework facilitated unprecedented exploration perspectives, driven by the capability to reset environments to previously important checkpoints, proving beneficial for tasks with non-trivial exploration.

The implications of these findings are dual-fold. Practically, the proposal delineates a tangible performance enhancement methodology for existing RL models, whereas theoretically, it extends the analytical boundaries and utility of simulators within the loop of RL research. This introduces prospects for future studies, particularly in refining uncertainty modeling and extending UFLP's principles into partially observable or stochastic domains.

In the landscape of RL research, this work positions simulators not just as testbeds but as strategic enablers for efficient exploration, challenging preconceived boundaries of trial-based learning protocols. The proposed UFLP framework emphasizes an intersection of strategy-driven exploration and algorithmic efficiency, providing a substantive contribution to both the RL theory and its application.

Youtube Logo Streamline Icon: https://streamlinehq.com