Reward-Free Curricula for Training Robust World Models (2306.09205v2)

Published 15 Jun 2023 in cs.LG

Abstract: There has been a recent surge of interest in developing generally-capable agents that can adapt to new tasks without additional training in the environment. Learning world models from reward-free exploration is a promising approach, and enables policies to be trained using imagined experience for new tasks. However, achieving a general agent requires robustness across different environments. In this work, we address the novel problem of generating curricula in the reward-free setting to train robust world models. We consider robustness in terms of minimax regret over all environment instantiations and show that the minimax regret can be connected to minimising the maximum error in the world model across environment instances. This result informs our algorithm, WAKER: Weighted Acquisition of Knowledge across Environments for Robustness. WAKER selects environments for data collection based on the estimated error of the world model for each environment. Our experiments demonstrate that WAKER outperforms several baselines, resulting in improved robustness, efficiency, and generalisation.

References (76)

Citations (5)

View on Semantic Scholar

Summary

The paper presents WAKER, a novel algorithm that leverages reward-free exploration and minimax regret to enhance world model robustness.
The approach uses a unified recurrent neural network to learn latent representations across underspecified POMDPs for improved policy generalization.
Experimental results indicate that WAKER significantly outperforms baseline domain randomization, especially in out-of-distribution continuous control tasks.

Efficient Learning of Robust World Models in Reward-Free Settings

Introduction

The capability of agents to generalize across various tasks and to quickly adapt to new ones without further training is fundamental for developing generally-capable AI systems. One promising approach towards this goal is to leverage reward-free exploration for learning world models. A world model encapsulates an agent's understanding of its environment's dynamics, such that it can "imagine" and plan for future scenarios without additional data collection. The challenge arises in ensuring these models are robust across diverse environments, particularly under the reward-free paradigm where explicit task objectives are absent during the learning phase. This work introduces Weighted Acquisition of Knowledge across Environments for Robustness (WAKER), a novel algorithm targeting the efficient learning of robust world models without reliance on reward signals. Our approach significantly enhances the robustness and generality of learned policies, especially when facing out-of-distribution (OOD) environments.

Preliminaries

A reward-free exploratory phase precedes task-specific learning, where an agent accumulates environmental knowledge without any task-specific rewards. Notably, we utilize a reward-free Partially Observable Markov Decision Process (POMDP) and extend it to accommodate underspecified environments (UPOMDPs), introducing variability through a parameter set that defines different environmental conditions. The world model is crucial in this setting, aiming to encapsulate environment dynamics accurately within a learned latent space representation.

Approach

The crux of our approach lies in framing the challenge of world model generation as a Reward-Free Minimax Regret problem, focusing on minimizing regret across all possible environments and downstream tasks. This objective essentially translates to enhancing the world model's robustness by minimizing its maximum expected latent dynamics error across all environmental conditions. Our proposed solution, WAKER, optimizes this objective by biasing environmental sampling towards scenarios where the model demonstrates the highest error estimates, hence prioritizing learning in areas of maximum uncertainty.

World Models for Underspecified POMDPs

A single, unified world model, represented as $W = \{q, T\}$ , is utilized across different environmental settings, leveraging a recurrent neural network to predict environment dynamics in a compact latent space. This configuration facilitates the learning of a generalized representation applicable across varied environmental parameters, aiding in robust policy formation.

Reward-Free Minimax Regret

We extend the concept of minimax regret, commonly used in robust optimization, to the reward-free world model training context. Here, the objective shifts towards learning a world model that minimizes regret across all possible reward functions and environment configurations. This novel perspective underscores the goal of achieving near-optimal policy performance for any given task within an underspecified environment, without prior knowledge of specific reward functions during the learning phase.

Weighted Acquisition of Knowledge across Environments for Robustness (WAKER)

WAKER specifically addresses how to select environments for data collection to train the world model most effectively. By estimating the error associated with each environment using an ensemble of neural networks and then sampling more frequently from those with higher estimated errors, WAKER intuitively pushes the learning process towards scenarios where the model's predictions are least accurate, thus driving improvement in model robustness. This method stands in contrast to naive domain randomization, showcasing superior performance in developing policies that generalize well across both seen and unseen environments.

Experiments

Our evaluation spans multiple continuous control tasks within pixel-based simulation environments, highlighting tasks with varying dynamics and complexity. The results demonstrate that WAKER significantly outstrips the performance of baseline domain randomization techniques, particularly in OOD scenarios and across different exploration policies. These findings not only underscore the efficacy of our approach in enhancing the robustness and generalization of learned policies but also spotlight the potential of reward-free exploration strategies in cultivating broadly capable agents.

Concluding Remarks

This work lays theoretical and empirical groundwork for advancing the robustness of world models learned in a reward-free setting. By innovatively applying the minimax regret principle to unsupervised environment exploration and learning, we provide a methodology that systematically improves world model accuracy and policy robustness. Future directions include scaling WAKER to more complex domains and integrating more advanced generative modeling techniques to further push the boundaries of general-purpose, adaptive AI systems.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (3)

Tweets

https://twitter.com/MLStreetTalk/status/1769434043232243851

https://twitter.com/burny_tech/status/1770624255895425032

https://twitter.com/MarcRigter/status/1750455300891553953

https://twitter.com/Montreal_AI/status/1750572742401724760

https://twitter.com/ceobillionaire/status/1750575701483237793

https://twitter.com/Quebec_AI/status/1750576270784491732

YouTube

Show All Videos