Reward-Free Curricula for Training Robust World Models (2306.09205v2)
Abstract: There has been a recent surge of interest in developing generally-capable agents that can adapt to new tasks without additional training in the environment. Learning world models from reward-free exploration is a promising approach, and enables policies to be trained using imagined experience for new tasks. However, achieving a general agent requires robustness across different environments. In this work, we address the novel problem of generating curricula in the reward-free setting to train robust world models. We consider robustness in terms of minimax regret over all environment instantiations and show that the minimax regret can be connected to minimising the maximum error in the world model across environment instances. This result informs our algorithm, WAKER: Weighted Acquisition of Knowledge across Environments for Robustness. WAKER selects environments for data collection based on the estimated error of the world model for each environment. Our experiments demonstrate that WAKER outperforms several baselines, resulting in improved robustness, efficiency, and generalisation.
- Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Process Systems, 34:29304–29320, 2021.
- Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
- Purposive behavior acquisition for a real robot by vision-based reinforcement learning. Machine learning, 23:279–303, 1996.
- Ready policy one: World building through active learning. In International Conference on Machine Learning, pages 591–601. PMLR, 2020.
- Unifying count-based exploration and intrinsic motivation. Advances in Neural Information Process Systems, 29, 2016.
- Robust optimization, volume 28. Princeton University Press, 2009.
- Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
- Information prioritization through empowerment in visual model-based RL. International Conference on Learning Representations, 2022.
- Do as I can, not as I say: Grounding language in robotic affordances. In Conference on Robot Learning, pages 287–318. PMLR, 2022.
- Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in Neural Information Process Systems, 31, 2018.
- Active learning with statistical models. Journal of artificial intelligence research, 4:129–145, 1996.
- Intrinsic motivation and self-determination in human behavior. Springer Science & Business Media, 1985.
- Emergent complexity and zero-shot transfer via unsupervised environment design. Advances in Neural Information Process Systems, 33:13049–13061, 2020.
- Self-paced context evaluation for contextual reinforcement learning. In International Conference on Machine Learning, pages 2948–2958. PMLR, 2021.
- Automatic goal generation for reinforcement learning agents. In International Conference on Machine Learning, pages 1515–1528. PMLR, 2018.
- Reverse curriculum generation for reinforcement learning. In Conference on Robot Learning, pages 482–495. PMLR, 2017.
- Automated curriculum learning for neural networks. In international conference on machine learning, pages 1311–1320. PMLR, 2017.
- Recurrent world models facilitate policy evolution. Advances in Neural Information Process Systems, 31, 2018.
- Learning to play with intrinsically-motivated, self-aware agents. Advances in Neural Information Process Systems, 31, 2018.
- Dream to control: Learning behaviors by latent imagination. International Conference on Learning Representations, 2020.
- Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pages 2555–2565. PMLR, 2019.
- Mastering Atari with discrete world models. International Conference on Learning Representations, 2021.
- Marcus Hutter. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2004.
- Nick Jakobi. Evolutionary robotics and the radical envelope-of-noise hypothesis. Adaptive behavior, 6(2):325–368, 1997.
- When to trust your model: Model-based policy optimization. Advances in Neural Information Process Systems, 32, 2019.
- Replay-guided adversarial environment design. Advances in Neural Information Processing Systems, 34:1884–1897, 2021.
- Grounding aleatoric uncertainty for unsupervised environment design. In Advances in Neural Information Processing Systems.
- Prioritized level replay. In International Conference on Machine Learning, pages 4940–4950. PMLR, 2021.
- General intelligence requires rethinking exploration. arXiv preprint arXiv:2211.07819, 2022.
- Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
- Deep variational bayes filters: Unsupervised learning of state space models from raw data. International Conference on Learning Representations, 2017.
- Near-optimal reinforcement learning in polynomial time. Machine learning, 49:209–232, 2002.
- Towards continual reinforcement learning: A review and perspectives. Journal of Artificial Intelligence Research, 75:1401–1476, 2022.
- MOREL: Model-based offline reinforcement learning. Advances in Neural Information Process Systems, 33:21810–21823, 2020.
- Empowerment: A universal agent-centric measure of control. In IEEE Congress on Evolutionary Computation, volume 1, pages 128–135. IEEE, 2005.
- URLB: Unsupervised reinforcement learning benchmark. arXiv preprint arXiv:2110.15191, 2021.
- Revisiting design choices in offline model-based reinforcement learning. International Conference on Learning Representations, 2022.
- Teacher–student curriculum learning. IEEE Transactions on Neural Networks and Learning Systems, 31(9):3732–3740, 2019.
- Active domain randomization. In Conference on Robot Learning, pages 1162–1176. PMLR, 2020.
- Discovering and achieving goals via world models. Advances in Neural Information Processing Systems, 34:24379–24391, 2021.
- Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Curriculum learning for reinforcement learning domains: A framework and survey. The Journal of Machine Learning Research, 21(1):7382–7431, 2020.
- Count-based exploration with neural density models. In International Conference on Machine Learning, pages 2721–2730. PMLR, 2017.
- Evolving curricula with regret-based environment design. In International Conference on Machine Learning, pages 17473–17498. PMLR, 2022.
- Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, pages 2778–2787. PMLR, 2017.
- Ken Perlin. Improving noise. In Proceedings of the Annual Conference on Computer Graphics and Interactive Techniques, pages 681–682, 2002.
- Teacher algorithms for curriculum learning of deep RL in continuously parameterized environments. In Conference on Robot Learning, pages 835–853. PMLR, 2020.
- Automatic curriculum learning for deep rl: A short survey. In IJCAI 2020-International Joint Conference on Artificial Intelligence, 2021.
- Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 7(1):2, 2019.
- A generalist agent. Transactions on Machine Learning Research, 2022.
- Robust policy computation in reward-uncertain MDPs using nondominated policies. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, pages 1127–1133, 2010.
- Minimax regret optimisation for robust planning in uncertain Markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11930–11938, 2021.
- RAMBO-RL: Robust adversarial model-based offline reinforcement learning. Advances in Neural Information Processing Systems, 2022.
- One risk to rule them all: Addressing distributional shift in offline reinforcement learning via risk-aversion. arXiv preprint arXiv:2212.00124, 2023.
- Optimization of conditional value-at-risk. Journal of risk, 2:21–42, 2000.
- CAD2RL: Real single-image flight without a single real image. Robotics: Science and Systems, 2017.
- Leonard J Savage. The theory of statistical decision. Journal of the American Statistical Association, 46(253):55–67, 1951.
- Jürgen Schmidhuber. Reinforcement learning in Markovian and non-Markovian environments. Advances in Neural Information Process Systems, 3, 1990.
- Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proceedings of the International Conference on Simulation of Adaptive Behavior, pages 222–227, 1991.
- Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
- Mastering Atari, Go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
- Planning to explore via self-supervised world models. In International Conference on Machine Learning, pages 8583–8592. PMLR, 2020.
- Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009.
- Model-based active exploration. In International Conference on Machine Learning, pages 5779–5788. PMLR, 2019.
- Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- Open-ended learning leads to generally capable agents. arXiv preprint arXiv:2107.12808, 2021.
- Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
- Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
- Domain randomization for transferring deep neural networks from simulation to the real world. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 23–30. IEEE, 2017.
- Grandmaster level in starcraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- Paired open-ended trailblazer (POET): Endlessly generating increasingly complex and diverse learning environments and their solutions. arXiv preprint arXiv:1901.01753, 2019.
- Parametric regret in uncertain markov decision processes. In Proceedings of the IEEE Conference on Decision and Control, pages 3606–3613. IEEE, 2009.
- Learning general world models in a handful of reward-free deployments. Advances in Neural Information Processing Systems, 35:26820–26838, 2022.
- MOPO: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
- SOLAR: Deep structured representations for model-based reinforcement learning. In International Conference on Machine Learning, pages 7444–7453. PMLR, 2019.