Dr. Strategy: Model-Based Generalist Agents with Strategic Dreaming (2402.18866v2)
Abstract: Model-based reinforcement learning (MBRL) has been a primary approach to ameliorating the sample efficiency issue as well as to make a generalist agent. However, there has not been much effort toward enhancing the strategy of dreaming itself. Therefore, it is a question whether and how an agent can "dream better" in a more structured and strategic way. In this paper, inspired by the observation from cognitive science suggesting that humans use a spatial divide-and-conquer strategy in planning, we propose a new MBRL agent, called Dr. Strategy, which is equipped with a novel Dreaming Strategy. The proposed agent realizes a version of divide-and-conquer-like strategy in dreaming. This is achieved by learning a set of latent landmarks and then utilizing these to learn a landmark-conditioned highway policy. With the highway policy, the agent can first learn in the dream to move to a landmark, and from there it tackles the exploration and achievement task in a more focused way. In experiments, we show that the proposed model outperforms prior pixel-based MBRL methods in various visually complex and partially observable navigation tasks.
- Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
- Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
- Exploration by random network distillation. In International Conference on Learning Representations, 2019.
- Explore, discover and learn: Unsupervised discovery of state-covering skills. In International Conference on Machine Learning, pp. 1317–1327. PMLR, 2020.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- Contextual cueing: Implicit learning and memory of visual context guides spatial attention. Cognitive psychology, 36(1):28–71, 1998.
- First return, then explore. Nature, 590(7847):580–586, 2021.
- Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019a.
- Search on the replay buffer: Bridging planning and rl. Advances in Neural Information Processing Systems, 2019b.
- Minerl: A large-scale dataset of minecraft demonstrations. arXiv preprint arXiv:1907.13440, 2019.
- Recurrent world models facilitate policy evolution. Advances in neural information processing systems, 31, 2018.
- Learning latent dynamics for planning from pixels. In International conference on machine learning, pp. 2555–2565. PMLR, 2019.
- Mastering atari with discrete world models. In International Conference on Learning Representations, 2020.
- Deep hierarchical planning from pixels. Advances in Neural Information Processing Systems, 35:26091–26104, 2022.
- Planning goals for exploration. In International Conference on Learning Representations, 2023.
- Learning to discover skills through guidance. In Advances in Neural Information Processing Systems, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Revisiting k-means: New algorithms via bayesian nonparametrics. arXiv preprint arXiv:1111.0352, 2011.
- Curiosity-driven exploration via latent bayesian surprise. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 7752–7760, 2022a.
- Choreographer: Learning and adapting skills in imagination. In International Conference on Learning Representations, 2022b.
- Discovering and achieving goals via world models. Advances in Neural Information Processing Systems, 34:24379–24391, 2021.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. PMLR, 2016.
- Lipschitz-constrained unsupervised skill discovery. In International Conference on Learning Representations, 2022.
- METRA: Scalable unsupervised RL with metric-aware abstraction. In International Conference on Learning Representations, 2024.
- Evaluating long-term memory in 3d mazes. arXiv preprint arXiv:2210.13383, 2022.
- Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pp. 2778–2787. PMLR, 2017.
- Self-supervised exploration via disagreement. In International conference on machine learning, pp. 5062–5071. PMLR, 2019.
- Long-horizon visual planning with goal-conditioned hierarchical predictors. Advances in Neural Information Processing Systems, 33:17321–17333, 2020.
- Maximum entropy gain exploration for long horizon multi-goal reinforcement learning. In International Conference on Machine Learning, pp. 7750–7761. PMLR, 2020.
- Skew-fit: State-covering self-supervised reinforcement learning. In International Conference on Machine Learning, pp. 7783–7792. PMLR, 2020.
- Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pp. 1278–1286. PMLR, 2014.
- Unlocking the power of representations in long-term novelty-based exploration. arXiv preprint arXiv:2305.01521, 2023.
- Planning to explore via self-supervised world models. In International Conference on Machine Learning, pp. 8583–8592. PMLR, 2020.
- Nearest neighbor estimates of entropy. American journal of mathematical and management sciences, 23(3-4):301–321, 2003.
- Sutton, R. S. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
- Reinforcement learning: An introduction. MIT press, 2018.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Alphastar: Mastering the real-time strategy game starcraft ii. DeepMind blog, 2:20, 2019.
- Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
- Reinforcement learning with prototypical representations. In International Conference on Machine Learning, pp. 11920–11931. PMLR, 2021.