Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
113 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
24 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Reward-Free Curricula for Training Robust World Models (2306.09205v2)

Published 15 Jun 2023 in cs.LG

Abstract: There has been a recent surge of interest in developing generally-capable agents that can adapt to new tasks without additional training in the environment. Learning world models from reward-free exploration is a promising approach, and enables policies to be trained using imagined experience for new tasks. However, achieving a general agent requires robustness across different environments. In this work, we address the novel problem of generating curricula in the reward-free setting to train robust world models. We consider robustness in terms of minimax regret over all environment instantiations and show that the minimax regret can be connected to minimising the maximum error in the world model across environment instances. This result informs our algorithm, WAKER: Weighted Acquisition of Knowledge across Environments for Robustness. WAKER selects environments for data collection based on the estimated error of the world model for each environment. Our experiments demonstrate that WAKER outperforms several baselines, resulting in improved robustness, efficiency, and generalisation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Process Systems, 34:29304–29320, 2021.
  2. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
  3. Purposive behavior acquisition for a real robot by vision-based reinforcement learning. Machine learning, 23:279–303, 1996.
  4. Ready policy one: World building through active learning. In International Conference on Machine Learning, pages 591–601. PMLR, 2020.
  5. Unifying count-based exploration and intrinsic motivation. Advances in Neural Information Process Systems, 29, 2016.
  6. Robust optimization, volume 28. Princeton University Press, 2009.
  7. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  8. Information prioritization through empowerment in visual model-based RL. International Conference on Learning Representations, 2022.
  9. Do as I can, not as I say: Grounding language in robotic affordances. In Conference on Robot Learning, pages 287–318. PMLR, 2022.
  10. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in Neural Information Process Systems, 31, 2018.
  11. Active learning with statistical models. Journal of artificial intelligence research, 4:129–145, 1996.
  12. Intrinsic motivation and self-determination in human behavior. Springer Science & Business Media, 1985.
  13. Emergent complexity and zero-shot transfer via unsupervised environment design. Advances in Neural Information Process Systems, 33:13049–13061, 2020.
  14. Self-paced context evaluation for contextual reinforcement learning. In International Conference on Machine Learning, pages 2948–2958. PMLR, 2021.
  15. Automatic goal generation for reinforcement learning agents. In International Conference on Machine Learning, pages 1515–1528. PMLR, 2018.
  16. Reverse curriculum generation for reinforcement learning. In Conference on Robot Learning, pages 482–495. PMLR, 2017.
  17. Automated curriculum learning for neural networks. In international conference on machine learning, pages 1311–1320. PMLR, 2017.
  18. Recurrent world models facilitate policy evolution. Advances in Neural Information Process Systems, 31, 2018.
  19. Learning to play with intrinsically-motivated, self-aware agents. Advances in Neural Information Process Systems, 31, 2018.
  20. Dream to control: Learning behaviors by latent imagination. International Conference on Learning Representations, 2020.
  21. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pages 2555–2565. PMLR, 2019.
  22. Mastering Atari with discrete world models. International Conference on Learning Representations, 2021.
  23. Marcus Hutter. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2004.
  24. Nick Jakobi. Evolutionary robotics and the radical envelope-of-noise hypothesis. Adaptive behavior, 6(2):325–368, 1997.
  25. When to trust your model: Model-based policy optimization. Advances in Neural Information Process Systems, 32, 2019.
  26. Replay-guided adversarial environment design. Advances in Neural Information Processing Systems, 34:1884–1897, 2021.
  27. Grounding aleatoric uncertainty for unsupervised environment design. In Advances in Neural Information Processing Systems.
  28. Prioritized level replay. In International Conference on Machine Learning, pages 4940–4950. PMLR, 2021.
  29. General intelligence requires rethinking exploration. arXiv preprint arXiv:2211.07819, 2022.
  30. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
  31. Deep variational bayes filters: Unsupervised learning of state space models from raw data. International Conference on Learning Representations, 2017.
  32. Near-optimal reinforcement learning in polynomial time. Machine learning, 49:209–232, 2002.
  33. Towards continual reinforcement learning: A review and perspectives. Journal of Artificial Intelligence Research, 75:1401–1476, 2022.
  34. MOREL: Model-based offline reinforcement learning. Advances in Neural Information Process Systems, 33:21810–21823, 2020.
  35. Empowerment: A universal agent-centric measure of control. In IEEE Congress on Evolutionary Computation, volume 1, pages 128–135. IEEE, 2005.
  36. URLB: Unsupervised reinforcement learning benchmark. arXiv preprint arXiv:2110.15191, 2021.
  37. Revisiting design choices in offline model-based reinforcement learning. International Conference on Learning Representations, 2022.
  38. Teacher–student curriculum learning. IEEE Transactions on Neural Networks and Learning Systems, 31(9):3732–3740, 2019.
  39. Active domain randomization. In Conference on Robot Learning, pages 1162–1176. PMLR, 2020.
  40. Discovering and achieving goals via world models. Advances in Neural Information Processing Systems, 34:24379–24391, 2021.
  41. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  42. Curriculum learning for reinforcement learning domains: A framework and survey. The Journal of Machine Learning Research, 21(1):7382–7431, 2020.
  43. Count-based exploration with neural density models. In International Conference on Machine Learning, pages 2721–2730. PMLR, 2017.
  44. Evolving curricula with regret-based environment design. In International Conference on Machine Learning, pages 17473–17498. PMLR, 2022.
  45. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, pages 2778–2787. PMLR, 2017.
  46. Ken Perlin. Improving noise. In Proceedings of the Annual Conference on Computer Graphics and Interactive Techniques, pages 681–682, 2002.
  47. Teacher algorithms for curriculum learning of deep RL in continuously parameterized environments. In Conference on Robot Learning, pages 835–853. PMLR, 2020.
  48. Automatic curriculum learning for deep rl: A short survey. In IJCAI 2020-International Joint Conference on Artificial Intelligence, 2021.
  49. Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  50. Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 7(1):2, 2019.
  51. A generalist agent. Transactions on Machine Learning Research, 2022.
  52. Robust policy computation in reward-uncertain MDPs using nondominated policies. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, pages 1127–1133, 2010.
  53. Minimax regret optimisation for robust planning in uncertain Markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11930–11938, 2021.
  54. RAMBO-RL: Robust adversarial model-based offline reinforcement learning. Advances in Neural Information Processing Systems, 2022.
  55. One risk to rule them all: Addressing distributional shift in offline reinforcement learning via risk-aversion. arXiv preprint arXiv:2212.00124, 2023.
  56. Optimization of conditional value-at-risk. Journal of risk, 2:21–42, 2000.
  57. CAD2RL: Real single-image flight without a single real image. Robotics: Science and Systems, 2017.
  58. Leonard J Savage. The theory of statistical decision. Journal of the American Statistical Association, 46(253):55–67, 1951.
  59. Jürgen Schmidhuber. Reinforcement learning in Markovian and non-Markovian environments. Advances in Neural Information Process Systems, 3, 1990.
  60. Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proceedings of the International Conference on Simulation of Adaptive Behavior, pages 222–227, 1991.
  61. Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
  62. Mastering Atari, Go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  63. Planning to explore via self-supervised world models. In International Conference on Machine Learning, pages 8583–8592. PMLR, 2020.
  64. Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009.
  65. Model-based active exploration. In International Conference on Machine Learning, pages 5779–5788. PMLR, 2019.
  66. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  67. Open-ended learning leads to generally capable agents. arXiv preprint arXiv:2107.12808, 2021.
  68. Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
  69. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
  70. Domain randomization for transferring deep neural networks from simulation to the real world. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 23–30. IEEE, 2017.
  71. Grandmaster level in starcraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  72. Paired open-ended trailblazer (POET): Endlessly generating increasingly complex and diverse learning environments and their solutions. arXiv preprint arXiv:1901.01753, 2019.
  73. Parametric regret in uncertain markov decision processes. In Proceedings of the IEEE Conference on Decision and Control, pages 3606–3613. IEEE, 2009.
  74. Learning general world models in a handful of reward-free deployments. Advances in Neural Information Processing Systems, 35:26820–26838, 2022.
  75. MOPO: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  76. SOLAR: Deep structured representations for model-based reinforcement learning. In International Conference on Machine Learning, pages 7444–7453. PMLR, 2019.
Citations (5)

Summary

  • The paper presents WAKER, a novel algorithm that leverages reward-free exploration and minimax regret to enhance world model robustness.
  • The approach uses a unified recurrent neural network to learn latent representations across underspecified POMDPs for improved policy generalization.
  • Experimental results indicate that WAKER significantly outperforms baseline domain randomization, especially in out-of-distribution continuous control tasks.

Efficient Learning of Robust World Models in Reward-Free Settings

Introduction

The capability of agents to generalize across various tasks and to quickly adapt to new ones without further training is fundamental for developing generally-capable AI systems. One promising approach towards this goal is to leverage reward-free exploration for learning world models. A world model encapsulates an agent's understanding of its environment's dynamics, such that it can "imagine" and plan for future scenarios without additional data collection. The challenge arises in ensuring these models are robust across diverse environments, particularly under the reward-free paradigm where explicit task objectives are absent during the learning phase. This work introduces Weighted Acquisition of Knowledge across Environments for Robustness (WAKER), a novel algorithm targeting the efficient learning of robust world models without reliance on reward signals. Our approach significantly enhances the robustness and generality of learned policies, especially when facing out-of-distribution (OOD) environments.

Preliminaries

A reward-free exploratory phase precedes task-specific learning, where an agent accumulates environmental knowledge without any task-specific rewards. Notably, we utilize a reward-free Partially Observable Markov Decision Process (POMDP) and extend it to accommodate underspecified environments (UPOMDPs), introducing variability through a parameter set that defines different environmental conditions. The world model is crucial in this setting, aiming to encapsulate environment dynamics accurately within a learned latent space representation.

Approach

The crux of our approach lies in framing the challenge of world model generation as a Reward-Free Minimax Regret problem, focusing on minimizing regret across all possible environments and downstream tasks. This objective essentially translates to enhancing the world model's robustness by minimizing its maximum expected latent dynamics error across all environmental conditions. Our proposed solution, WAKER, optimizes this objective by biasing environmental sampling towards scenarios where the model demonstrates the highest error estimates, hence prioritizing learning in areas of maximum uncertainty.

World Models for Underspecified POMDPs

A single, unified world model, represented as W={q,T}W = \{q, T\}, is utilized across different environmental settings, leveraging a recurrent neural network to predict environment dynamics in a compact latent space. This configuration facilitates the learning of a generalized representation applicable across varied environmental parameters, aiding in robust policy formation.

Reward-Free Minimax Regret

We extend the concept of minimax regret, commonly used in robust optimization, to the reward-free world model training context. Here, the objective shifts towards learning a world model that minimizes regret across all possible reward functions and environment configurations. This novel perspective underscores the goal of achieving near-optimal policy performance for any given task within an underspecified environment, without prior knowledge of specific reward functions during the learning phase.

Weighted Acquisition of Knowledge across Environments for Robustness (WAKER)

WAKER specifically addresses how to select environments for data collection to train the world model most effectively. By estimating the error associated with each environment using an ensemble of neural networks and then sampling more frequently from those with higher estimated errors, WAKER intuitively pushes the learning process towards scenarios where the model's predictions are least accurate, thus driving improvement in model robustness. This method stands in contrast to naive domain randomization, showcasing superior performance in developing policies that generalize well across both seen and unseen environments.

Experiments

Our evaluation spans multiple continuous control tasks within pixel-based simulation environments, highlighting tasks with varying dynamics and complexity. The results demonstrate that WAKER significantly outstrips the performance of baseline domain randomization techniques, particularly in OOD scenarios and across different exploration policies. These findings not only underscore the efficacy of our approach in enhancing the robustness and generalization of learned policies but also spotlight the potential of reward-free exploration strategies in cultivating broadly capable agents.

Concluding Remarks

This work lays theoretical and empirical groundwork for advancing the robustness of world models learned in a reward-free setting. By innovatively applying the minimax regret principle to unsupervised environment exploration and learning, we provide a methodology that systematically improves world model accuracy and policy robustness. Future directions include scaling WAKER to more complex domains and integrating more advanced generative modeling techniques to further push the boundaries of general-purpose, adaptive AI systems.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com