Recurrent World Models Facilitate Policy Evolution (1809.01999v1)

Published 4 Sep 2018 in cs.LG and stat.ML

Abstract: A generative recurrent neural network is quickly trained in an unsupervised manner to model popular reinforcement learning environments through compressed spatio-temporal representations. The world model's extracted features are fed into compact and simple policies trained by evolution, achieving state of the art results in various environments. We also train our agent entirely inside of an environment generated by its own internal world model, and transfer this policy back into the actual environment. Interactive version of paper at https://worldmodels.github.io

Citations (833)

View on Semantic Scholar

Summary

The paper demonstrates that using a generative recurrent world model with a VAE for visual encoding and an MDN-RNN for latent dynamics facilitates efficient RL policy evolution.
The experiments in CarRacing-v0 and DoomTakeCover-v0 show robust performance, with an average score of 906 ± 21 in CarRacing and effective policy transfer in DoomTakeCover.
The approach minimizes dependence on actual environment interactions by training in a simulated latent space, paving the way for more efficient and resilient RL algorithms.

Recurrent World Models Facilitate Policy Evolution

In the paper "Recurrent World Models Facilitate Policy Evolution" by David Ha and Jürgen Schmidhuber, the authors describe a novel approach that leverages generative recurrent neural networks to facilitate the evolution of policies in reinforcement learning (RL). This work explores the use of a world model (M) trained in an unsupervised manner to model RL environments, which is then used to train a compact controller (C) for policy evolution. The key components of the agent's architecture include a visual sensory component (V), a memory component (M), and a decision-making component (C).

Conceptual Framework

The proposed model-building strategy is reminiscent of human cognitive systems, where abstract representations of sensory inputs are formed and used to predict future events. V encodes high-dimensional input observations into latent vectors using a Variational Autoencoder (VAE). M, implemented as a Mixture Density Network-Recurrent Neural Network (MDN-RNN), predicts the distribution of future latent vectors. C, a linear model, uses these representations to decide actions. Notably, the use of a simple C minimizes the credit assignment problem, relegating complexity to the V and M components, which can be efficiently trained with backpropagation.

Experimental Validation

The efficacy of this approach is demonstrated through experiments in two popular RL environments: CarRacing-v0 and DoomTakeCover-v0. The training process can be summarized in the following steps: collecting random policy rollouts, training V to encode observations into latent space, training M to model the dynamics in the latent space, and finally optimizing C using Covariance-Matrix Adaptation Evolution Strategy (CMA-ES).

CarRacing-v0 Experiment:
- V and M were used to extract spatial and temporal features, which C utilized to achieve high performance in a car racing task.
- The agent achieved an average score of 906 ± 21, surpassing the scores obtained by other algorithms like DQN and A3C.
DoomTakeCover-v0 Experiment:
- A world model was trained to simulate the VizDoom environment. The agents trained in this virtual environment successfully transferred their policies back to the actual environment.
- The best agent scored 1092 ± 556 time steps when tested in the actual DoomTakeCover-v0 environment, indicating robust policy transfer.

Implications and Future Directions

This research highlights several significant implications and future pathways:

Model-Based RL: By training in a generated environment, the method reduces dependencies on the actual environment, offering a computationally efficient training mechanism.
Adversarial Robustness: Adjusting the temperature parameter in the MDN-RNN prevents C from exploiting modeling inaccuracies, suggesting an avenue for developing robust RL agents.
Automated Feature Learning: The use of VAE and MDN-RNN allows the system to autonomously learn relevant features and dynamics, which could generalize to different tasks and environments.

Future research could focus on scaling this approach to more complex environments, integrating artificial curiosity and intrinsic motivation to encourage exploration of new and useful states. Additionally, hierarchical planning and leveraging external memory could refine the model's ability to simulate long-term dependencies and intricate dynamics.

Conclusion

The paper demonstrates that recurrent world models can effectively facilitate policy evolution in RL by breaking down the problem into more manageable components. The use of generative models to simulate environments and the application of evolution strategies for policy optimization provide a promising framework for efficient and flexible training of RL agents. As AI continues to advance, approaches that blend model-based learning with neural network architectures are likely to play a crucial role in developing sophisticated, autonomous agents capable of handling complex, real-world tasks.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (2)

GitHub

https://worldmodels.github.io

YouTube

Show All Videos