DreamerV3: Model-Based RL Algorithm

Updated 10 December 2025

DreamerV3 is a model-based reinforcement learning algorithm that utilizes compact latent world models to enable efficient policy optimization through imagined rollouts.
It employs an RSSM architecture with parallel-trained encoder, actor, and critic components, ensuring robustness across continuous control, visual tasks, and sparse-reward environments.
Empirical results in applications like traffic signal control and pixel-based tasks highlight its sample efficiency and generality using minimal, domain-invariant hyperparameter tuning.

DreamerV3 is a general-purpose model-based reinforcement learning (RL) algorithm that learns compact latent world models to enable efficient policy optimization via imagined rollouts in latent space. The architecture is designed to be robust across a wide range of high-dimensional RL tasks, including continuous control, visual domains, and sparse-reward challenges such as Minecraft diamond collection, all without domain-specific hyperparameter tuning (Hafner et al., 2023).

1. Architectural Overview: World Model and Latent Imagination

DreamerV3 comprises three primary neural modules trained in parallel: a world model (typically an RSSM—Recurrent State Space Model), an actor for policy learning, and a critic for value estimation (Hafner et al., 2023). At each time step $t$ , the agent processes observation $x_t$ to generate a stochastic latent state $z_t$ via an encoder, maintaining a deterministic recurrent hidden state $h_t$ updated alongside previous action $a_{t-1}$ and latent $z_{t-1}$ . The model state is denoted $s_t = (h_t, z_t)$ .

World model:
- Encoder: $q_{\phi}(z_t|h_{t-1}, x_t)$
- Transition/prior: $p_{\theta}(z_t|h_{t-1}, a_{t-1})$
- RNN update: $h_t = f_{\mathrm{RNN}}(h_{t-1}, z_t, a_{t-1})$
Observation, reward, and continuation heads:
- $\hat{x}_t = \mathrm{dec}_x(h_t, z_t)$
- $\hat{r}_t = \mathrm{dec}_r(h_t, z_t)$
- $\hat{c}_t = \mathrm{dec}_c(h_t, z_t)$
Actor–critic:
- Actor $\pi_\theta(a|s)$ and critic $V_\psi(s)$ operate in latent space and are trained on trajectories generated by rollouts (imagination) under the learned world model.

Imagination-based training enables actor–critic learning to proceed entirely on simulated experience, reducing the sample complexity compared to model-free RL.

2. Mathematical Formulation and Optimization Objectives

The DreamerV3 world model is trained via variational inference, minimizing a weighted sum of losses over batches of experience:

Observation reconstruction loss:

$\mathcal{L}_{\rm recon} = \sum_{t=1}^T \mathbb{E}_{q_\phi}\left[ -\log p_{\mathrm{dec}}(x_t|h_t, z_t) \right]$

KL divergence (latent regularization):

$\mathcal{L}_{\rm KL} = \sum_{t=1}^T D_{\mathrm{KL}}\left[ q_{\phi}(z_t | h_{t-1}, x_t) \;\|\; p_{\theta}(z_t | h_{t-1}, a_{t-1}) \right]$

Reward and continuation prediction loss:

$\mathcal{L}_{\rm reward} = \sum_{t=1}^T \| \hat{r}_t - r_t \|^2,\quad \mathcal{L}_{\rm cont} = \sum_{t=1}^T \mathrm{CE}(\hat{c}_t, c_t)$

Total world-model loss:

$\mathcal{L}_{\rm world} = \mathcal{L}_{\rm recon} + \beta_{\rm KL} \mathcal{L}_{\rm KL} + \beta_{r} \mathcal{L}_{\rm reward} + \beta_c \mathcal{L}_{\rm cont}$

Imagination rollouts proceed from real $s_t$ , with the model iteratively generating future latents and actions: Sample $a_t\sim\pi_\theta(a_t|s_t)$ , $z_{t+1}\sim p_\theta(z_{t+1}|h_t,a_t)$ , $h_{t+1}=f_{\mathrm{RNN}}(h_t,z_{t+1},a_t)$ . The actor and critic are optimized via value targets computed as $\lambda$ -returns over imagined trajectories.

3. Robustness Mechanisms and Empirical Stability

DreamerV3 employs a suite of domain-agnostic architectural and optimization strategies to ensure stability and generality across domains (Hafner et al., 2023):

Symlog Transformation:

Scalar values are mapped as $\operatorname{symlog}(x) = \mathrm{sign}(x)\log(1+|x|)$ , reducing large target magnitudes and balancing gradients.

Discrete two-hot value regression:

Final values are discretized, then two-hot encoded for stable training under varying value target scales.

KL balancing and free bits:

KL-divergence regularization adapts without manual scheduling, allowing the model to retain informative representations across both simple and complex environments.

Unimix categorical distributions:

All categorical outputs mix $1\%$ uniform with $99\%$ network output, mitigating deterministic collapse in discrete distributions.

These mechanisms enable DreamerV3 to operate robustly without the domain-specific norm schedules and KL annealing critical to previous Dreamer variants.

4. Implementation and Hyperparameterization

DreamerV3 is designed for minimal, domain-invariant hyperparameter tuning. Key adjustable parameters are:

Model size:

Controls layer widths/hidden sizes in the encoder, recurrent cell, decoders, actor, and critic.

Training ratio ( $\rho$ ):

Number of gradient updates per environment step, balancing data reuse and overfitting risk.

For example, in the traffic signal control (TSC) domain, the paper found that:

Model size “S” achieves strong stability and data efficiency.
Training ratios $\rho$ in $[64, 512]$ , with $\rho=128$ strongly recommended.
Larger models (M, L) offer only modest gains and require narrower $\rho$ tuning (Li et al., 4 Mar 2025).

Default settings in the DreamerV3 codebase use: Replay capacity $10^6$ , batch size $16$, sequence length $64$, imagination horizon $15$, RSSM latent $32\times32$ discrete, Adam optimizer, LayerNorm+SiLU activations, no dropout or weight decay. The same configuration solves over 150 tasks without adjustment (Hafner et al., 2023).

Parameter	XS	S	M	L
Viable $\rho$ values	64,128,512	64–512	128,256	128 only
Time to stabilize	~3h	~2.5h	~2.2h	~2.0h
Best $\rho$	128	128	128	128

5. Applications and Empirical Performance

DreamerV3 has demonstrated state-of-the-art results in diverse tasks and domains:

Traffic Signal Control:

DreamerV3 trains a corridor TSC model in SUMO using queue lengths and signal phases as state, piecewise penalties for congestion, and discrete actions for split changes. Peak queue reductions from $>100$ vehicles to $<50$ were observed. Sample efficiency is confirmed, particularly with medium-size models and intermediate training ratios (Li et al., 4 Mar 2025).

Pixel-based RL (e.g., Minecraft, DeepMind Control, Atari):

DreamerV3 achieved first diamond collection from scratch in Minecraft, outperforming expert-dependent competitors, attained new SOTA on DMC vision tasks and Crafter (Hafner et al., 2023).

Stability and Generality:

The algorithm shows stable convergence curves post-initial exploration, fastest stabilization for larger models, and strong data-efficiency when compared with pure model-free baselines in control/multi-agent domains.

6. Comparative and Theoretical Analysis

DreamerV3 is distinguished from previous Dreamer variants by its robust, single-configuration training across domains, categorical/discrete value heads, and improved normalization/balancing. Notably, in the TSC paper, claims of accelerated convergence via increased training ratio $\rho$ —which hold in other environments—did not materialize; excessively high or low $\rho$ instead introduced instability (Li et al., 4 Mar 2025). The findings suggest that, for structured control domains, medium model/ratio choices are optimal and generalize across scenario changes.

A plausible implication is that the practical sample efficiency of DreamerV3 is problem-dependent, with configuration sweet-spots dictated by the complexity and smoothness of the task environment.

7. Significance and Future Considerations

DreamerV3 exemplifies a new class of world model-based RL agents capable of generalizing over domain boundaries without manual reconfiguration. The capacity to learn effective policy with far fewer real-environment interactions—enabled by RSSM-based latent imagination—makes it suitable for large-scale and real-time applications where sample efficiency is paramount.

Current limitations include pronounced early-episode reward fluctuations, narrow viable hyperparameter ranges for large models, and problem-dependent data-efficiency characteristics. Further work may investigate domain-adaptive scheduling for the training ratio, improved latent representation learning under distractors, and formal guarantees for convergence times across classes of environments (Li et al., 4 Mar 2025, Hafner et al., 2023).

PDF Markdown Chat (Pro)

References (2)

Mastering Diverse Domains through World Models (2023)

DreamerV3 for Traffic Signal Control: Hyperparameter Tuning and Performance (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to DreamerV3 Algorithm.