Mastering Atari with Discrete World Models (2010.02193v4)

Published 5 Oct 2020 in cs.LG, cs.AI, and stat.ML

Abstract: Intelligent agents need to generalize from past experience to achieve goals in complex environments. World models facilitate such generalization and allow learning behaviors from imagined outcomes to increase sample-efficiency. While learning world models from image inputs has recently become feasible for some tasks, modeling Atari games accurately enough to derive successful behaviors has remained an open challenge for many years. We introduce DreamerV2, a reinforcement learning agent that learns behaviors purely from predictions in the compact latent space of a powerful world model. The world model uses discrete representations and is trained separately from the policy. DreamerV2 constitutes the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model. With the same computational budget and wall-clock time, Dreamer V2 reaches 200M frames and surpasses the final performance of the top single-GPU agents IQN and Rainbow. DreamerV2 is also applicable to tasks with continuous actions, where it learns an accurate world model of a complex humanoid robot and solves stand-up and walking from only pixel inputs.

PDF Abstract

This paper introduces DreamerV2, a reinforcement learning agent based on a world model that achieves human-level performance on the Atari benchmark (55 games, 200M steps). Significantly, DreamerV2 learns its behavior (policy) purely by planning within the latent space of its learned world model, without direct interaction with the environment during policy updates. This demonstrates the high fidelity of the learned world model.

Core Components:

World Model Learning:

Architecture: Builds upon the Recurrent State-Space Model (RSSM) used in PlaNet and DreamerV1. It consists of an image encoder (CNN), a recurrent model (GRU) computing deterministic states $h_t$ , a representation model inferring posterior stochastic states $z_t \sim \qp(z_t | h_t, x_t)$ from images, and a transition predictor estimating prior stochastic states $\hat{z}_t \sim \pp(\hat{z}_t | h_t)$ without seeing the current image. Predictors for image reconstruction, reward, and discount factor are attached.
Key Innovation 1: Discrete Latent States: Unlike predecessors using Gaussian latents, DreamerV2 uses a vector of categorical variables for the stochastic state $z_t$ . These are optimized using straight-through gradients. This change is hypothesized to better model multi-modal dynamics or non-smooth changes common in Atari and proved crucial for performance. The discrete state is represented as 32 categorical variables, each with 32 classes.
Key Innovation 2: KL Balancing: The KL divergence term in the world model's loss function $\KL[\qp(z_t | h_t,x_t) || \pp(z_t | h_t)]$ serves both to train the prior $\pp$ towards the posterior $\qp$ and regularize the posterior towards the prior. To prioritize learning an accurate prior (essential for imagination), KL balancing applies a higher learning rate ( $\alpha=0.8$ ) to the prior component and a lower rate ( $1-\alpha$ ) to the posterior component within the KL term.

Loss Function: The world model is trained end-to-end by maximizing the evidence lower bound (ELBO), minimizing reconstruction losses (image, reward, discount) and the KL divergence between the posterior and prior latent distributions.

L(\phi) = E_{q_\phi(z_{1:T} | a_{1:T}, x_{1:T})} [ \sum_{t=1}^T (
    -ln p_\phi(x_t | h_t, z_t)          // Image loss
    -ln p_\phi(r_t | h_t, z_t)          // Reward loss
    -ln p_\phi(\gamma_t | h_t, z_t)    // Discount loss
    + \beta KL[q_\phi(z_t | h_t, x_t) || p_\phi(z_t | h_t)] // KL loss
) ]

KL Balancing implementation:

1
2
3

# q = posterior distribution, p = prior distribution
kl_loss =      alpha  * compute_kl(stop_grad(q), p) \
        + (1 - alpha) * compute_kl(q, stop_grad(p))

Behavior Learning:
- Imagination MDP: An actor and critic are trained entirely on imagined trajectories generated by rolling out the learned world model's transition predictor $\pp(\hat{z}_t | h_t)$ and reward/discount predictors, starting from states $z_t$ encountered during real experience collection. The imagination horizon $H$ is set to 15 steps.
- Actor-Critic: Both are MLPs operating on the learned latent states $(\hat{h}_t, \hat{z}_t)$ . The critic $v_\xi(\hat{z}_t)$ estimates the expected sum of future rewards. The actor $p_\psi(\hat{a}_t | \hat{z}_t)$ outputs probabilities over discrete actions.
- Critic Training: Uses $\lambda$ -returns ( $V^\lambda_t$ ) as targets, computed over the imagined trajectories, minimizing a squared error loss. A target network is used for stability.
- Actor Training: Maximizes the expected $\lambda$ -returns. For Atari, it primarily uses Reinforce gradients (policy gradients) with the critic's value estimate as a baseline. For continuous control, it primarily uses dynamics backpropagation (backpropagating value gradients through the world model using straight-through estimators for discrete latents/actions). An entropy bonus is added for exploration. The choice between Reinforce ( $\rho=1$ for Atari) and dynamics backprop ( $\rho=0$ for continuous control) is a key hyperparameter.

Experiments and Results:

Atari Benchmark: DreamerV2 was evaluated on 55 Atari games with sticky actions for 200M environment steps, using a single GPU.
Performance: It surpasses the performance of strong single-GPU model-free agents like Rainbow and IQN, achieving a median human-normalized score above 100%.
Evaluation Metrics: The paper discusses limitations of standard metrics (Gamer Median, Gamer Mean) and proposes using "Clipped Record Mean" (normalizing scores by human world records and clipping at 100% before averaging) as a more robust measure. DreamerV2 excels across all metrics.
Ablations: Experiments confirm the significant benefits of using discrete latent variables and KL balancing compared to Gaussian latents and standard KL regularization. They also show that image reconstruction gradients are vital for learning useful representations, while reward prediction gradients are less critical and sometimes detrimental. Reinforce policy gradients were found superior for Atari compared to dynamics backpropagation.
Continuous Control: DreamerV2 was also shown to solve the challenging Humanoid walking task from pixel inputs by adapting the actor output to a continuous distribution and using dynamics backpropagation for policy learning.

Significance:

DreamerV2 demonstrates that learning accurate world models from high-dimensional inputs like images is feasible and that behaviors learned entirely within these models can achieve state-of-the-art performance on complex benchmarks like Atari, rivaling highly optimized model-free methods. It highlights the effectiveness of discrete latent representations and KL balancing for improving world model accuracy and provides a computationally efficient (single-GPU) framework for model-based RL.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Danijar Hafner (32 papers)
Timothy Lillicrap (60 papers)
Mohammad Norouzi (81 papers)
Jimmy Ba (55 papers)

Citations (723)

View on Semantic Scholar

Related Papers

Dream to Control: Learning Behaviors by Latent Imagination (2019)
World Models (2018)
Model-Based Reinforcement Learning for Atari (2019)
Transformers are Sample-Efficient World Models (2022)
Diffusion for World Modeling: Visual Details Matter in Atari (2024)

Find Related Papers

YouTube

Show All Videos