Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dream to Control: Learning Behaviors by Latent Imagination (1912.01603v3)

Published 3 Dec 2019 in cs.LG, cs.AI, and cs.RO

Abstract: Learned world models summarize an agent's experience to facilitate learning complex behaviors. While learning world models from high-dimensional sensory inputs is becoming feasible through deep learning, there are many potential ways for deriving behaviors from them. We present Dreamer, a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model. On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance.

This paper introduces Dreamer, a model-based reinforcement learning agent designed to learn complex behaviors from high-dimensional image inputs by leveraging "latent imagination." The core idea is to learn a world model from past experience and then train an actor-critic agent entirely within the compact latent space of this model, allowing for efficient learning of long-horizon tasks.

1. Agent Architecture and Workflow

Dreamer consists of three main components that operate concurrently:

  • Dynamics Learning: A world model is learned from a dataset D\mathcal{D} of past experiences (ot,at,rt)(o_t, a_t, r_t). This model learns to:
    • Encode observations oto_t and previous states/actions into a compact latent state sts_t (Representation Model: pθ(stst1,at1,ot)p_\theta(s_t|s_{t-1}, a_{t-1}, o_t)).
    • Predict future latent states without seeing observations (Transition Model: qθ(stst1,at1)q_\theta(s_t|s_{t-1}, a_{t-1})).
    • Predict rewards from latent states (Reward Model: qθ(rtst)q_\theta(r_t|s_t)).
    • The paper explores different ways to train this world model, primarily using image reconstruction (like PlaNet) or contrastive objectives. The Recurrent State Space Model (RSSM) architecture is used for the transition model.
  • Behavior Learning: An actor-critic algorithm operates purely on imagined trajectories generated by the learned world model.
    • Starting from latent states sts_t sampled from real experience sequences, the agent "imagines" trajectories of length HH using the transition model qθ(sτsτ1,aτ1)q_\theta(s_\tau|s_{\tau-1}, a_{\tau-1}), the reward model qθ(rτsτ)q_\theta(r_\tau|s_\tau), and an action model qϕ(aτsτ)q_\phi(a_\tau|s_\tau).
    • An Action Model (actor, qϕ(aτsτ)q_\phi(a_\tau|s_\tau)) learns a policy within the latent space. It typically outputs parameters for a distribution (e.g., Tanh-transformed Gaussian for continuous actions).
    • A Value Model (critic, vψ(sτ)v_\psi(s_\tau)) learns to predict the expected future rewards (value) obtainable from a given latent state sτs_\tau under the current action model within the imagination.
    • The key innovation is training the action and value models using analytic gradients propagated back through the learned dynamics model over the imagination horizon HH.
  • Environment Interaction: The learned action model qϕ(atst)q_\phi(a_t|s_t) is used to select actions in the real environment. The agent first computes the current latent state sts_t based on the history of observations and actions, then samples an action from the action model (adding exploration noise), executes it, and adds the resulting experience (ot+1,at,rt)(o_{t+1}, a_t, r_t) to the dataset D\mathcal{D}.

2. Learning Behaviors by Latent Imagination

  • Addressing Finite Horizon: Model-based RL often suffers from shortsightedness due to finite imagination horizons (HH). Dreamer addresses this by learning the value function vψ(sτ)v_\psi(s_\tau), which estimates the sum of future rewards beyond the imagination horizon.
  • Value Estimation: To train the actor and critic, the paper uses λ\lambda-returns (VλV_\lambda) calculated over the imagined trajectories. This combines multi-step imagined reward sums with the value function bootstrap estimates (vψ(st+H)v_\psi(s_{t+H})) to balance bias and variance:

    Vλ(sτ)=(1λ)n=1H1λn1VNn(sτ)+λH1VNH(sτ)V_\lambda(s_\tau) = (1-\lambda) \sum_{n=1}^{H-1} \lambda^{n-1} V_{N^n}(s_\tau) + \lambda^{H-1}V_{N^H}(s_\tau)

    where VNk(sτ)n=τmin(τ+k,t+H)1γnτrn+γmin(τ+k,t+H)τvψ(smin(τ+k,t+H))V_{N^k}(s_\tau) \approx \sum_{n=\tau}^{\min(\tau+k, t+H)-1} \gamma^{n-\tau} r_n + \gamma^{\min(\tau+k, t+H)-\tau} v_\psi(s_{\min(\tau+k, t+H)}).

  • Learning Objectives:

    • Value Model: Updated via mean squared error loss to match the computed VλV_\lambda targets (with stopped gradients on the targets):

      minψEqθ,qϕ[τ=tt+H12vψ(sτ)stop_grad(Vλ(sτ))2]\min_\psi \mathbb{E}_{q_\theta, q_\phi} \left[ \sum_{\tau=t}^{t+H} \frac{1}{2} \| v_\psi(s_\tau) - \text{stop\_grad}(V_\lambda(s_\tau)) \|^2 \right]

    • Action Model: Updated to maximize the expected value estimates by backpropagating gradients through the value estimates and the learned dynamics:

      maxϕEqθ,qϕ[τ=tt+HVλ(sτ)]\max_\phi \mathbb{E}_{q_\theta, q_\phi} \left[ \sum_{\tau=t}^{t+H} V_\lambda(s_\tau) \right]

    This backpropagation through the dynamics model (qθq_\theta) is efficient because it operates entirely in the low-dimensional latent space.

3. Implementation Details and Considerations

  • World Model Training: The paper evaluates three objectives for training the world model:

    1. Reconstruction: Maximize ELBO with image reconstruction loss (like PlaNet). Works well empirically.
    2. Contrastive: Use Noise Contrastive Estimation (NCE) to maximize mutual information between states and observations, avoiding pixel generation. Performs decently but less consistently than reconstruction.
    3. Reward Prediction Only: Train only on reward prediction. Insufficient on its own in these experiments.
  • Architecture: Uses CNNs for image encoding/decoding, RSSM for latent dynamics, and MLPs for reward, value, and action models. Latent states are typically 30-dimensional Gaussians.

  • Optimization: Adam optimizer is used. Gradient clipping is applied.
  • Computational Efficiency: Learning in the latent space is much faster than planning or learning directly in image space. Dreamer trains significantly faster than PlaNet (online planning) and model-free methods like D4PG. Training takes ~3 hours per million steps on a V100 GPU.
  • Hyperparameters: A single set of hyperparameters is used across all continuous control tasks (e.g., batch size=50, sequence length=50, imagination horizon H=15, λ=0.95\lambda=0.95, γ=0.99\gamma=0.99).

4. Experiments and Results

  • Tasks: Evaluated on 20 challenging continuous control tasks from the DeepMind Control Suite using pixel inputs, plus some discrete Atari and DeepMind Lab tasks.
  • Performance: Dreamer achieves state-of-the-art results on the continuous control benchmark, surpassing the final performance of strong model-free agents like D4PG ($823$ vs $786$ average score) while using far less data (5×1065 \times 10^6 vs 10810^8 steps) and computation time. It maintains the data efficiency of PlaNet while significantly improving asymptotic performance.
  • Long Horizon: Experiments show that learning the value function (vψv_\psi) makes Dreamer robust to the choice of imagination horizon HH, outperforming alternatives like planning (PlaNet) or learning only an action model without value estimation, especially on tasks requiring long-term credit assignment.
  • Representation Learning: Results confirm that the quality of the learned world model significantly impacts performance, with reconstruction yielding the best results among the tested methods.

5. Conclusion and Practical Implications

Dreamer demonstrates that learning behaviors entirely within the latent space of a learned world model, using analytic gradients backpropagated through the model dynamics, is a highly effective and efficient approach for solving complex visual control tasks. It combines the data efficiency of model-based methods with the strong asymptotic performance often associated with model-free methods. For practitioners, Dreamer offers a promising framework that is computationally efficient and achieves high performance, particularly for tasks with long horizons. The choice of representation learning objective for the world model is crucial and remains an area for future improvement. The method can be implemented using standard deep learning frameworks and requires careful tuning of the world model and behavior learning components.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Danijar Hafner (32 papers)
  2. Timothy Lillicrap (60 papers)
  3. Jimmy Ba (55 papers)
  4. Mohammad Norouzi (81 papers)
Citations (1,168)
Youtube Logo Streamline Icon: https://streamlinehq.com