This paper introduces DreamerV2, a reinforcement learning agent based on a world model that achieves human-level performance on the Atari benchmark (55 games, 200M steps). Significantly, DreamerV2 learns its behavior (policy) purely by planning within the latent space of its learned world model, without direct interaction with the environment during policy updates. This demonstrates the high fidelity of the learned world model.
Core Components:
- World Model Learning:
- Architecture: Builds upon the Recurrent State-Space Model (RSSM) used in PlaNet and DreamerV1. It consists of an image encoder (CNN), a recurrent model (GRU) computing deterministic states , a representation model inferring posterior stochastic states $z_t \sim \qp(z_t | h_t, x_t)$ from images, and a transition predictor estimating prior stochastic states $\hat{z}_t \sim \pp(\hat{z}_t | h_t)$ without seeing the current image. Predictors for image reconstruction, reward, and discount factor are attached.
- Key Innovation 1: Discrete Latent States: Unlike predecessors using Gaussian latents, DreamerV2 uses a vector of categorical variables for the stochastic state . These are optimized using straight-through gradients. This change is hypothesized to better model multi-modal dynamics or non-smooth changes common in Atari and proved crucial for performance. The discrete state is represented as 32 categorical variables, each with 32 classes.
- Key Innovation 2: KL Balancing: The KL divergence term in the world model's loss function $\KL[\qp(z_t | h_t,x_t) || \pp(z_t | h_t)]$ serves both to train the prior $\pp$ towards the posterior $\qp$ and regularize the posterior towards the prior. To prioritize learning an accurate prior (essential for imagination), KL balancing applies a higher learning rate () to the prior component and a lower rate () to the posterior component within the KL term.
- Loss Function: The world model is trained end-to-end by maximizing the evidence lower bound (ELBO), minimizing reconstruction losses (image, reward, discount) and the KL divergence between the posterior and prior latent distributions.
KL Balancing implementation:1 2 3 4 5 6
L(\phi) = E_{q_\phi(z_{1:T} | a_{1:T}, x_{1:T})} [ \sum_{t=1}^T ( -ln p_\phi(x_t | h_t, z_t) // Image loss -ln p_\phi(r_t | h_t, z_t) // Reward loss -ln p_\phi(\gamma_t | h_t, z_t) // Discount loss + \beta KL[q_\phi(z_t | h_t, x_t) || p_\phi(z_t | h_t)] // KL loss ) ]
1 2 3
# q = posterior distribution, p = prior distribution kl_loss = alpha * compute_kl(stop_grad(q), p) \ + (1 - alpha) * compute_kl(q, stop_grad(p))
- Behavior Learning:
- Imagination MDP: An actor and critic are trained entirely on imagined trajectories generated by rolling out the learned world model's transition predictor $\pp(\hat{z}_t | h_t)$ and reward/discount predictors, starting from states encountered during real experience collection. The imagination horizon is set to 15 steps.
- Actor-Critic: Both are MLPs operating on the learned latent states . The critic estimates the expected sum of future rewards. The actor outputs probabilities over discrete actions.
- Critic Training: Uses -returns () as targets, computed over the imagined trajectories, minimizing a squared error loss. A target network is used for stability.
- Actor Training: Maximizes the expected -returns. For Atari, it primarily uses Reinforce gradients (policy gradients) with the critic's value estimate as a baseline. For continuous control, it primarily uses dynamics backpropagation (backpropagating value gradients through the world model using straight-through estimators for discrete latents/actions). An entropy bonus is added for exploration. The choice between Reinforce ( for Atari) and dynamics backprop ( for continuous control) is a key hyperparameter.
Experiments and Results:
- Atari Benchmark: DreamerV2 was evaluated on 55 Atari games with sticky actions for 200M environment steps, using a single GPU.
- Performance: It surpasses the performance of strong single-GPU model-free agents like Rainbow and IQN, achieving a median human-normalized score above 100%.
- Evaluation Metrics: The paper discusses limitations of standard metrics (Gamer Median, Gamer Mean) and proposes using "Clipped Record Mean" (normalizing scores by human world records and clipping at 100% before averaging) as a more robust measure. DreamerV2 excels across all metrics.
- Ablations: Experiments confirm the significant benefits of using discrete latent variables and KL balancing compared to Gaussian latents and standard KL regularization. They also show that image reconstruction gradients are vital for learning useful representations, while reward prediction gradients are less critical and sometimes detrimental. Reinforce policy gradients were found superior for Atari compared to dynamics backpropagation.
- Continuous Control: DreamerV2 was also shown to solve the challenging Humanoid walking task from pixel inputs by adapting the actor output to a continuous distribution and using dynamics backpropagation for policy learning.
Significance:
DreamerV2 demonstrates that learning accurate world models from high-dimensional inputs like images is feasible and that behaviors learned entirely within these models can achieve state-of-the-art performance on complex benchmarks like Atari, rivaling highly optimized model-free methods. It highlights the effectiveness of discrete latent representations and KL balancing for improving world model accuracy and provides a computationally efficient (single-GPU) framework for model-based RL.