Dream to Control: Learning Behaviors by Latent Imagination (1912.01603v3)

Published 3 Dec 2019 in cs.LG, cs.AI, and cs.RO

Abstract: Learned world models summarize an agent's experience to facilitate learning complex behaviors. While learning world models from high-dimensional sensory inputs is becoming feasible through deep learning, there are many potential ways for deriving behaviors from them. We present Dreamer, a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model. On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance.

PDF Abstract

This paper introduces Dreamer, a model-based reinforcement learning agent designed to learn complex behaviors from high-dimensional image inputs by leveraging "latent imagination." The core idea is to learn a world model from past experience and then train an actor-critic agent entirely within the compact latent space of this model, allowing for efficient learning of long-horizon tasks.

1. Agent Architecture and Workflow

Dreamer consists of three main components that operate concurrently:

Dynamics Learning: A world model is learned from a dataset $\mathcal{D}$ $D$ of past experiences $(o_t, a_t, r_t)$ $(o_{t}, a_{t}, r_{t})$ . This model learns to:
- Encode observations $o_t$ and previous states/actions into a compact latent state $s_t$ (Representation Model: $p_\theta(s_t|s_{t-1}, a_{t-1}, o_t)$ ).
- Predict future latent states without seeing observations (Transition Model: $q_\theta(s_t|s_{t-1}, a_{t-1})$ ).
- Predict rewards from latent states (Reward Model: $q_\theta(r_t|s_t)$ ).
- The paper explores different ways to train this world model, primarily using image reconstruction (like PlaNet) or contrastive objectives. The Recurrent State Space Model (RSSM) architecture is used for the transition model.
Behavior Learning: An actor-critic algorithm operates purely on imagined trajectories generated by the learned world model.
- Starting from latent states $s_t$ sampled from real experience sequences, the agent "imagines" trajectories of length $H$ using the transition model $q_\theta(s_\tau|s_{\tau-1}, a_{\tau-1})$ , the reward model $q_\theta(r_\tau|s_\tau)$ , and an action model $q_\phi(a_\tau|s_\tau)$ .
- An Action Model (actor, $q_\phi(a_\tau|s_\tau)$ ) learns a policy within the latent space. It typically outputs parameters for a distribution (e.g., Tanh-transformed Gaussian for continuous actions).
- A Value Model (critic, $v_\psi(s_\tau)$ ) learns to predict the expected future rewards (value) obtainable from a given latent state $s_\tau$ under the current action model within the imagination.
- The key innovation is training the action and value models using analytic gradients propagated back through the learned dynamics model over the imagination horizon $H$ .
Environment Interaction: The learned action model $q_\phi(a_t|s_t)$ is used to select actions in the real environment. The agent first computes the current latent state $s_t$ based on the history of observations and actions, then samples an action from the action model (adding exploration noise), executes it, and adds the resulting experience $(o_{t+1}, a_t, r_t)$ to the dataset $\mathcal{D}$ .

2. Learning Behaviors by Latent Imagination

Addressing Finite Horizon: Model-based RL often suffers from shortsightedness due to finite imagination horizons ( $H$ ). Dreamer addresses this by learning the value function $v_\psi(s_\tau)$ , which estimates the sum of future rewards beyond the imagination horizon.
Value Estimation: To train the actor and critic, the paper uses $\lambda$ -returns ( $V_\lambda$ ) calculated over the imagined trajectories. This combines multi-step imagined reward sums with the value function bootstrap estimates ( $v_\psi(s_{t+H})$ ) to balance bias and variance:

$V_\lambda(s_\tau) = (1-\lambda) \sum_{n=1}^{H-1} \lambda^{n-1} V_{N^n}(s_\tau) + \lambda^{H-1}V_{N^H}(s_\tau)$

where $V_{N^k}(s_\tau) \approx \sum_{n=\tau}^{\min(\tau+k, t+H)-1} \gamma^{n-\tau} r_n + \gamma^{\min(\tau+k, t+H)-\tau} v_\psi(s_{\min(\tau+k, t+H)})$ .
Learning Objectives:
- Value Model: Updated via mean squared error loss to match the computed $V_\lambda$ targets (with stopped gradients on the targets):
  
  $\min_\psi \mathbb{E}_{q_\theta, q_\phi} \left[ \sum_{\tau=t}^{t+H} \frac{1}{2} \| v_\psi(s_\tau) - \text{stop\_grad}(V_\lambda(s_\tau)) \|^2 \right]$
- Action Model: Updated to maximize the expected value estimates by backpropagating gradients through the value estimates and the learned dynamics:
  
  $\max_\phi \mathbb{E}_{q_\theta, q_\phi} \left[ \sum_{\tau=t}^{t+H} V_\lambda(s_\tau) \right]$
This backpropagation through the dynamics model ( $q_\theta$ ) is efficient because it operates entirely in the low-dimensional latent space.

3. Implementation Details and Considerations

World Model Training: The paper evaluates three objectives for training the world model:
1. Reconstruction: Maximize ELBO with image reconstruction loss (like PlaNet). Works well empirically.
2. Contrastive: Use Noise Contrastive Estimation (NCE) to maximize mutual information between states and observations, avoiding pixel generation. Performs decently but less consistently than reconstruction.
3. Reward Prediction Only: Train only on reward prediction. Insufficient on its own in these experiments.
Architecture: Uses CNNs for image encoding/decoding, RSSM for latent dynamics, and MLPs for reward, value, and action models. Latent states are typically 30-dimensional Gaussians.
Optimization: Adam optimizer is used. Gradient clipping is applied.
Computational Efficiency: Learning in the latent space is much faster than planning or learning directly in image space. Dreamer trains significantly faster than PlaNet (online planning) and model-free methods like D4PG. Training takes ~3 hours per million steps on a V100 GPU.
Hyperparameters: A single set of hyperparameters is used across all continuous control tasks (e.g., batch size=50, sequence length=50, imagination horizon H=15, $\lambda=0.95$ , $\gamma=0.99$ ).

4. Experiments and Results

Tasks: Evaluated on 20 challenging continuous control tasks from the DeepMind Control Suite using pixel inputs, plus some discrete Atari and DeepMind Lab tasks.
Performance: Dreamer achieves state-of-the-art results on the continuous control benchmark, surpassing the final performance of strong model-free agents like D4PG ($823$ vs $786$ average score) while using far less data ( $5 \times 10^6$ vs $10^8$ steps) and computation time. It maintains the data efficiency of PlaNet while significantly improving asymptotic performance.
Long Horizon: Experiments show that learning the value function ( $v_\psi$ ) makes Dreamer robust to the choice of imagination horizon $H$ , outperforming alternatives like planning (PlaNet) or learning only an action model without value estimation, especially on tasks requiring long-term credit assignment.
Representation Learning: Results confirm that the quality of the learned world model significantly impacts performance, with reconstruction yielding the best results among the tested methods.

5. Conclusion and Practical Implications

Dreamer demonstrates that learning behaviors entirely within the latent space of a learned world model, using analytic gradients backpropagated through the model dynamics, is a highly effective and efficient approach for solving complex visual control tasks. It combines the data efficiency of model-based methods with the strong asymptotic performance often associated with model-free methods. For practitioners, Dreamer offers a promising framework that is computationally efficient and achieves high performance, particularly for tasks with long horizons. The choice of representation learning objective for the world model is crucial and remains an area for future improvement. The method can be implemented using standard deep learning frameworks and requires careful tuning of the world model and behavior learning components.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Danijar Hafner (32 papers)
Timothy Lillicrap (60 papers)
Jimmy Ba (55 papers)
Mohammad Norouzi (81 papers)

Citations (1,168)

View on Semantic Scholar

Related Papers

Mastering Atari with Discrete World Models (2020)
World Models (2018)
Learning Latent Dynamics for Planning from Pixels (2018)
Transformers are Sample-Efficient World Models (2022)
DayDreamer: World Models for Physical Robot Learning (2022)

Find Related Papers

Tweets

https://twitter.com/BoyuanChen0/status/1759331689438781727

https://twitter.com/AryanPa66861306/status/1888764892406300691

YouTube

Show All Videos