This paper introduces Dreamer, a model-based reinforcement learning agent designed to learn complex behaviors from high-dimensional image inputs by leveraging "latent imagination." The core idea is to learn a world model from past experience and then train an actor-critic agent entirely within the compact latent space of this model, allowing for efficient learning of long-horizon tasks.
1. Agent Architecture and Workflow
Dreamer consists of three main components that operate concurrently:
- Dynamics Learning: A world model is learned from a dataset of past experiences . This model learns to:
- Encode observations and previous states/actions into a compact latent state (Representation Model: ).
- Predict future latent states without seeing observations (Transition Model: ).
- Predict rewards from latent states (Reward Model: ).
- The paper explores different ways to train this world model, primarily using image reconstruction (like PlaNet) or contrastive objectives. The Recurrent State Space Model (RSSM) architecture is used for the transition model.
- Behavior Learning: An actor-critic algorithm operates purely on imagined trajectories generated by the learned world model.
- Starting from latent states sampled from real experience sequences, the agent "imagines" trajectories of length using the transition model , the reward model , and an action model .
- An Action Model (actor, ) learns a policy within the latent space. It typically outputs parameters for a distribution (e.g., Tanh-transformed Gaussian for continuous actions).
- A Value Model (critic, ) learns to predict the expected future rewards (value) obtainable from a given latent state under the current action model within the imagination.
- The key innovation is training the action and value models using analytic gradients propagated back through the learned dynamics model over the imagination horizon .
- Environment Interaction: The learned action model is used to select actions in the real environment. The agent first computes the current latent state based on the history of observations and actions, then samples an action from the action model (adding exploration noise), executes it, and adds the resulting experience to the dataset .
2. Learning Behaviors by Latent Imagination
- Addressing Finite Horizon: Model-based RL often suffers from shortsightedness due to finite imagination horizons (). Dreamer addresses this by learning the value function , which estimates the sum of future rewards beyond the imagination horizon.
- Value Estimation: To train the actor and critic, the paper uses -returns () calculated over the imagined trajectories. This combines multi-step imagined reward sums with the value function bootstrap estimates () to balance bias and variance:
where .
- Learning Objectives:
Value Model: Updated via mean squared error loss to match the computed targets (with stopped gradients on the targets):
- Action Model: Updated to maximize the expected value estimates by backpropagating gradients through the value estimates and the learned dynamics:
This backpropagation through the dynamics model () is efficient because it operates entirely in the low-dimensional latent space.
3. Implementation Details and Considerations
- World Model Training: The paper evaluates three objectives for training the world model:
- Reconstruction: Maximize ELBO with image reconstruction loss (like PlaNet). Works well empirically.
- Contrastive: Use Noise Contrastive Estimation (NCE) to maximize mutual information between states and observations, avoiding pixel generation. Performs decently but less consistently than reconstruction.
- Reward Prediction Only: Train only on reward prediction. Insufficient on its own in these experiments.
Architecture: Uses CNNs for image encoding/decoding, RSSM for latent dynamics, and MLPs for reward, value, and action models. Latent states are typically 30-dimensional Gaussians.
- Optimization: Adam optimizer is used. Gradient clipping is applied.
- Computational Efficiency: Learning in the latent space is much faster than planning or learning directly in image space. Dreamer trains significantly faster than PlaNet (online planning) and model-free methods like D4PG. Training takes ~3 hours per million steps on a V100 GPU.
- Hyperparameters: A single set of hyperparameters is used across all continuous control tasks (e.g., batch size=50, sequence length=50, imagination horizon H=15, , ).
4. Experiments and Results
- Tasks: Evaluated on 20 challenging continuous control tasks from the DeepMind Control Suite using pixel inputs, plus some discrete Atari and DeepMind Lab tasks.
- Performance: Dreamer achieves state-of-the-art results on the continuous control benchmark, surpassing the final performance of strong model-free agents like D4PG ($823$ vs $786$ average score) while using far less data ( vs steps) and computation time. It maintains the data efficiency of PlaNet while significantly improving asymptotic performance.
- Long Horizon: Experiments show that learning the value function () makes Dreamer robust to the choice of imagination horizon , outperforming alternatives like planning (PlaNet) or learning only an action model without value estimation, especially on tasks requiring long-term credit assignment.
- Representation Learning: Results confirm that the quality of the learned world model significantly impacts performance, with reconstruction yielding the best results among the tested methods.
5. Conclusion and Practical Implications
Dreamer demonstrates that learning behaviors entirely within the latent space of a learned world model, using analytic gradients backpropagated through the model dynamics, is a highly effective and efficient approach for solving complex visual control tasks. It combines the data efficiency of model-based methods with the strong asymptotic performance often associated with model-free methods. For practitioners, Dreamer offers a promising framework that is computationally efficient and achieves high performance, particularly for tasks with long horizons. The choice of representation learning objective for the world model is crucial and remains an area for future improvement. The method can be implemented using standard deep learning frameworks and requires careful tuning of the world model and behavior learning components.