Sample Efficient Deep Reinforcement Learning via Local Planning (2301.12579v2)
Abstract: The focus of this work is sample-efficient deep reinforcement learning (RL) with a simulator. One useful property of simulators is that it is typically easy to reset the environment to a previously observed state. We propose an algorithmic framework, named uncertainty-first local planning (UFLP), that takes advantage of this property. Concretely, in each data collection iteration, with some probability, our meta-algorithm resets the environment to an observed state which has high uncertainty, instead of sampling according to the initial-state distribution. The agent-environment interaction then proceeds as in the standard online RL setting. We demonstrate that this simple procedure can dramatically improve the sample cost of several baseline RL algorithms on difficult exploration tasks. Notably, with our framework, we can achieve super-human performance on the notoriously hard Atari game, Montezuma's Revenge, with a simple (distributional) double DQN. Our work can be seen as an efficient approximate implementation of an existing algorithm with theoretical guarantees, which offers an interpretation of the positive empirical results.
- PC-PG: Policy cover directed exploration for provable policy gradient learning. arXiv preprint arXiv:2007.08459, 2020.
- Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
- S. Aradi. Survey of deep reinforcement learning for motion planning of autonomous vehicles. IEEE Transactions on Intelligent Transportation Systems, 2020.
- Agent57: Outperforming the atari human benchmark. In International Conference on Machine Learning, pages 507–517. PMLR, 2020a.
- Never give up: Learning directed exploration strategies. arXiv preprint arXiv:2002.06038, 2020b.
- Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13(5):834–846, 1983.
- Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1-2):81–138, 1995.
- DeepMind Lab. arXiv preprint arXiv:1612.03801, 2016.
- Unifying count-based exploration and intrinsic motivation. Advances in Neural Information Processing Systems, 29, 2016.
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449–458. PMLR, 2017.
- D. P. Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9(3):310–335, 2011.
- End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
- OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
- Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
- UCB exploration via Q-ensembles. arXiv preprint arXiv:1706.01502, 2017.
- R. Coulom. Efficient selectivity and backup operators in Monte-Carlo tree search. In International Conference on Computers and Games, pages 72–83. Springer, 2006.
- Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414–419, 2022.
- Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations, 2020.
- Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
- First return, then explore. Nature, 590(7847):580–586, 2021.
- Confident least square value iteration with local access to a simulator. The 25th International Conference on Artificial Intelligence and Statistics, 2022.
- Acme: A research framework for distributed reinforcement learning. arXiv preprint arXiv:2006.00979, 2020.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
- S. M. Kakade. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003.
- Human-level Atari 200x faster. arXiv preprint arXiv:2209.07550, 2022.
- L. Kocsis and C. Szepesvári. Bandit based Monte-Carlo planning. In Machine Learning: ECML 2006: 17th European Conference on Machine Learning Berlin, Germany, September 18-22, 2006 Proceedings 17, pages 282–293. Springer, 2006.
- Can agents run relay race with strangers? generalization of RL to out-of-distribution trajectories. arXiv preprint arXiv:2304.13424, 2023.
- Improved regret bound and experience replay in regularized policy iteration. arXiv preprint arXiv:2102.12611, 2021.
- Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting. arXiv preprint arXiv:2105.08024, 2021.
- Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
- Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees. In Proceedings of the 22nd International Conference on Machine Learning, pages 569–576, 2005.
- Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Deep exploration via bootstrapped DQN. Advances in Neural Information Processing Systems, 29, 2016.
- Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018.
- Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568, 2019.
- Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, pages 2778–2787. PMLR, 2017.
- Modeling and simulation of 5 dof educational robot arm. In 2010 2nd International Conference on Advanced Computer Control, volume 5, pages 569–574. IEEE, 2010.
- A. Rahimi and B. Recht. Random features for large-scale kernel machines. Advances in Neural Information Processing Systems, 20, 2007.
- T. Salimans and R. Chen. Learning montezuma’s revenge from a single demonstration. arXiv preprint arXiv:1812.03381, 2018.
- Bayesian real-time dynamic programming. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI-09), pages 1784–1789. IJCAI-INT JOINT CONF ARTIF INTELL, 2009.
- Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897. PMLR, 2015.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- J. Sherman and W. J. Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21(1):124–127, 1950.
- Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5192–5202, 2018.
- Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
- T. Smith and R. Simmons. Focused real-time dynamic programming for MDPs: Squeezing more out of a heuristic. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1227–1232, 2006.
- # exploration: A study of count-based exploration for deep reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
- DeepMind control suite. arXiv preprint arXiv:1801.00690, 2018.
- Exploring restart distributions. In 4th Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2019), 2019.
- W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012.
- Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
- Confident approximate policy iteration for efficient local planning in Qπsubscript𝑄𝜋{Q_{\pi}}italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT-realizable MDPs. arXiv preprint arXiv:2210.15755, 2022.
- L. Yang and M. Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004. PMLR, 2019.
- Efficient local planning with linear function approximation. The 33rd International Conference on Algorithmic Learning Theory, 2022.
- Cautiously optimistic policy optimization and exploration with linear function approximation. In Conference on Learning Theory (COLT), 2021.