TD-MPC2: Robust Model-Based RL

Updated 22 August 2025

TD-MPC2 is a model-based reinforcement learning algorithm that utilizes latent trajectory planning, SimNorm normalization, and modern activation functions for robust continuous control.
It employs a discrete regression loss via cross-entropy objectives over rewards and value targets while using an ensemble of Q-functions to reduce overestimation bias.
Scalability is achieved through a single hyperparameter configuration that enables near-linear performance improvements across 80+ multi-task domains with large models and datasets.

TD-MPC2 is a model-based reinforcement learning algorithm designed for robust, scalable continuous control via local trajectory optimization in the latent space of an implicit, decoder-free world model. It systematically revises architectural and training facets from its predecessor, TD-MPC, delivering strong empirical performance across numerous task domains and facilitating the training of large multi-task agents with performance improving nearly monotonically with both model and dataset size.

1. Algorithmic Foundations and Architectural Innovations

TD-MPC2 replaces the original TD-MPC’s bare multilayer perceptron modules and ELU activations with networks utilizing LayerNorm and the Mish activation function throughout. Central to the stability of its deep latent world model is SimNorm—a normalization that partitions the latent state $z$ into multiple groups and projects each into a fixed-dimensional simplex via softmax, enforcing a sparse, low $\ell_2$ -norm representation that mitigates gradient instability and prevents the explosion phenomena observed in previous variants.

Further architectural mechanisms include:

Expanding the Q-function ensemble (from 2 to 5 or more members, depending on scale);
Application of Dropout ( $1\%$ ) after the first linear layer in each Q estimator;
TD-target computation using the minimum over two randomly sampled Q-functions from the ensemble, reducing overestimation bias.

In multi-task settings, each network component is conditioned on a learnable, normalized task embedding (bounded by $\|\cdot\|_2 \leq 1$ ), with zero-padding and action masking to accommodate different observation and action spaces, thereby minimizing domain-specific tuning.

2. Training Objectives and Loss Formulation

TD-MPC2 further diverges from its predecessor by employing discrete regression via cross-entropy objectives—rather than continuous regression (MSE)—for both reward and value targets. Rewards and value predictions are logarithmically transformed and discretized, granting robustness to large variances in reward magnitude across heterogeneous tasks. The regularization effect of SimNorm and low-variance dropout supports optimization stability.

The policy prior is trained with a maximum entropy objective similar to Soft Actor-Critic, replacing TD-MPC’s deterministic prior plus Gaussian noise. This facilitates exploration and maintains consistency across varied tasks and dimensionalities. The policy objective is:

$L_p(\theta) = \mathbb{E}_{(s, a)_{0:H} \sim B}\left[\sum_{t=0}^H \lambda^t[\alpha Q(z_t, p(z_t)) - \beta \mathcal{H}(p(\cdot|z_t))]\right]$

where entropy is only computed over valid action dimensions, crucial for multi-task setups.

3. Planning via Latent-Space Model Predictive Path Integral Control

TD-MPC2 performs planning in the latent space using Model Predictive Path Integral (MPPI) optimization. The routine fits the parameters $(\mu^*, \sigma^*)$ of multivariate Gaussians over action sequences:

$\mu^*,\sigma^* = \arg\max_{(\mu, \sigma)} \mathbb{E}_{a_{t:t+H} \sim \mathcal{N}(\mu, \sigma^2)}\left[\gamma^H Q(z_{t+H}, a_{t+H}) + \sum_{h=t}^{t+H-1} \gamma^h R(z_h, a_h)\right]$

Planning is warm-started from the previous solution, shifted temporally, and augmented by samples from the policy prior $p$ for improved convergence. Vectorization of sampling, elimination of momentum between iterations, and uniform replay buffer sampling increase throughput by $\approx 2\times$ with no loss of accuracy.

4. Scalability Across Diverse Domains and Tasks

TD-MPC2 demonstrates marked improvements over baselines (SAC, DreamerV3, TD-MPC) on 104 online RL tasks spanning DMControl, Meta-World, ManiSkill2, and MyoSuite domains. One hyperparameter configuration suffices for all tasks, indicating considerable robustness to reward scale, transition dynamics, and dimensionality (including high DoF Dog/Humanoid locomotion and complex manipulation). Scaling experiments show near-linear improvement in normalized score with log-model size; a single 317M-parameter agent trained on 545M transitions performs 80 distinct tasks without domain-specific tuning.

Model Size	Tasks	Normalized Score (trend)
5M	1	Strong
317M	80	Stronger

This scaling behavior, enabled by the described architectural and loss refinements, positions TD-MPC2 as a viable generalist agent framework.

5. Lessons, Risks, and Opportunities

Lessons learned include confirmation that robust normalization, discrete loss formulation, and multitask conditioning collectively minimize hyperparameter sensitivity and streamline “plug-and-play” deployment. This generality democratizes RL research by lowering the computational and engineering barrier for new applications.

Risks identified include reward misspecification, which can lead to unintended behavior in multi-task deployment. Physical deployment of large, generalist agents (e.g., for robotics) mandates strict safety checks—unconstrained extrapolation beyond valid, seen data can cause catastrophic failures. Concentration of data requirements in large, heterogeneous datasets potentially restricts access to resource-rich teams.

Opportunities lie in extensions to vision-LLMs, higher-level compositional planning, and facilitating zero-shot transfer and adaptation. The framework is predisposed toward scaling to new tasks or domains with minimal per-task tuning.

6. Technical Details and Supplementary Resources

Key formulas central to TD-MPC2 include the planning policy:

$\pi(s_t) = \arg\max_{a_{t:t+H}} \mathbb{E} \left[\sum_{i=0}^H \gamma^i R(s_{t+i}, a_{t+i}) \right]$

with latent-space rollouts estimating the expectation with the learned world model.

SimNorm is implemented as:

$z^\circ = [g_1, \ldots, g_L] \quad \text{where} \quad g_i = \text{softmax}\left(\frac{z_{i:i+V}}{\tau}\right)$

Researchers can access models, code, and large-scale datasets at https://tdmpc2.com, enabling replication and further experimentation. Detailed appendices include ablation studies, multitask results, few-shot finetuning, and pseudocode for the model components and normalization layers.

7. Comparative Perspective and Outlook

When contrasting TD-MPC2’s continuous implicit world model with the discrete stochastic latent models such as those in DC-MPC (Scannell et al., 1 Mar 2025), TD-MPC2 favors architectural and objective simplicity and stability at scale. Discrete world models may offer higher sample efficiency and richer uncertainty modeling in certain regimes, particularly high-dimensional manipulation and locomotion, but the log-linear scaling and robust performance of TD-MPC2 across 104 tasks under a single hyperparameter configuration is unmatched in continuous latent-space approaches.

A plausible implication is that future work integrating stochastic discrete latent transitions within the TD-MPC2 architectural skeleton—for example, replacing MSE regression objectives on the world model with cross-entropy on quantized codebooks—may combine the best of both algorithmic strands, further improving generalist RL capabilities.

TD-MPC2 is thus a robust model-based RL baseline for scalable continuous control, supporting deployment on a vast spectrum of embodied tasks, with explicit lessons for stability, scalability, and generalization. Its extensions and analysis frame ongoing research on generalist agents and world model design.

PDF Markdown Chat (Pro)

References (1)

Discrete Codebook World Models for Continuous Control (2025)

Follow Topic

Get notified by email when new papers are published related to TD-MPC2 Algorithm.