TD-MPC2: Scalable Model-Based Control

Updated 20 November 2025

TD-MPC2 is a model-based RL algorithm that integrates an implicit decoder-free world model with short-horizon MPC and off-policy TD learning to achieve robust continuous control.
It employs modular MLP architectures with SimNorm, ensemble Q-networks, and a unified task embedding to enhance stability, scalability, and multi-task performance.
TD-MPC2 demonstrates superior performance in diverse domains such as robotic manipulation and medical navigation, achieving high sample efficiency and reliable operation.

TD-MPC2 is a model-based reinforcement learning (RL) algorithm designed to achieve scalable, robust, and sample-efficient continuous control by integrating implicit world models with short-horizon planning and off-policy temporal-difference (TD) learning. It extends previous model-based approaches by combining local trajectory optimization in the latent space of a learned world model with value and policy priors, unified multi-task architectures, and several stabilizing design innovations. TD-MPC2 has demonstrated strong performance across diverse domains and applications, including robotic control and medical navigation, and supports multitask and multi-embodiment learning with a single set of hyperparameters (Hansen et al., 2023, Lin et al., 5 Feb 2025, Robertshaw et al., 29 Sep 2025).

1. Algorithmic Framework and World Model Architecture

TD-MPC2 jointly learns an implicit, decoder-free world model and a value/policy backbone within a unified, multitask setting. The architecture is modular and conditioned on a learnable task embedding $e$ , enabling broad generalization and scalability. The principal components, all parameterized as multi-layer perceptrons (MLPs) with LayerNorm and Mish activations, are:

State Encoder $\phi_{\theta_h}(s_t,e)\rightarrow z_t\in\mathbb{R}^d$ : maps observations $s_t$ and task embedding $e$ to a latent state $z_t$ .
Latent Dynamics $f_{\theta_d}(z_t,a_t,e)\rightarrow z_{t+1}$ : models environment transitions in latent space.
Reward Predictor $r_{\theta_r}(z_t,a_t,e)\rightarrow \hat{r}_t$ : estimates instantaneous rewards.
Value Head $q_{\theta_Q}(z_t,a_t,e)\rightarrow \hat{R}_t$ : predicts the discounted cumulative return.
Policy Prior $p_{\theta_\pi}(z_t,e)$ : stochastic Gaussian policy trained on a maximum-entropy objective.

The model is trained by minimizing a multi-term loss over sampled transitions and their associated task embeddings, with TD targets computed via an exponential moving average (EMA) target Q-network. To prevent latent instability, TD-MPC2 applies Simplicial Normalization (“SimNorm”), softmax-normalizing portions of each latent group-wise at a fixed temperature to bias representations toward sparsity.

2. Latent-Space Trajectory Optimization via Short-Horizon MPC

Action selection in TD-MPC2 is governed by short-horizon model-predictive control (MPC) in the latent space using Model Predictive Path Integral (MPPI) sampling or the cross-entropy method (CEM). At each step, a Gaussian distribution over action sequences $\{a_{t:t+H}\}$ is iteratively refined to maximize the predicted sum of future rewards plus bootstrapped value estimates:

$(\mu^*,\sigma^*) = \arg\max_{\mu,\sigma} \mathbb E_{\{a_{t:t+H}\}\sim \mathcal{N}(\mu, \sigma^2)} \left[\sum_{h=0}^{H-1} \gamma^h r_{\theta_r}(z_{t+h}, a_{t+h}) + \gamma^H Q_{\theta_Q}(z_{t+H}, a_{t+H}) \right]$

with latent rollouts $z_{t+h+1} = f_{\theta_d}(z_{t+h},a_{t+h})$ and $z_{t} = \phi_{\theta_h}(s_t)$ . Planning is warm-started with solutions from the learned policy prior, which enhances sample efficiency. Upon convergence, the first action of the best sequence is executed in the environment.

3. Enhancements over Previous Approaches

Relative to the original TD-MPC, TD-MPC2 incorporates several critical enhancements for stability, generalization, and scalability (Hansen et al., 2023):

Latent Stability: SimNorm prevents latent blowup by group-wise normalization.
Distributed Value Estimate: Uses an ensemble of 5 Q-networks with dropout, and targets derived from the minimum of two randomly chosen EMA heads.
Discrete Log-Bin Targets: Reward and value are modeled as categorical distributions over log bins and trained using cross-entropy, enhancing robustness in sparse/heterogeneous reward scenarios.
Unified, Learnable Task Embedding: Every module receives conditioning on a single end-to-end-learned vector $e \in \mathbb{R}^{96}$ , facilitating single-network multitask learning across domains and action/observation dimensions via zero-padding and masking.
Maximum-Entropy Policy Prior: Learns a stochastic prior via the soft actor-critic (SAC) loss rather than a deterministic actor, supporting improved exploration and planning initialization.

Code efficiency improvements enable vectorized Q-ensemble computation and planning speedups, while support for zero-padding, masking, and multitask representation allows a single model to scale up to hundreds of tasks with variable dimensionalities.

4. Training and Evaluation Regimes

TD-MPC2 employs uniform replay sampling, moderate buffer sizes (e.g., $10^6$ – $10^7$ ), and batch sizes (256–1024), with seed steps scaled to task episode length. Short planning horizons ( $H=3$ in online RL, $H=12$ in surgical applications) are used, and model/policy updates are performed with Adam optimizers, standard learning rates ( $3\times 10^{-4}$ ), distributional value network targets, and automatic entropy tuning. Task embeddings can be rapidly adapted for novel tasks with few-shot fine-tuning, supporting generalization without altering the main network weights.

For evaluation, standardized benchmarks (DMControl, Meta-World, ManiSkill2, MyoSuite) and application domains (e.g., autonomous endovascular navigation) are used. The agent is evaluated on both sample efficiency and final returns, with comparison to state-of-the-art model-free (SAC) and model-based (DreamerV3, original TD-MPC) methods (Hansen et al., 2023, Robertshaw et al., 29 Sep 2025).

Empirical results demonstrate that TD-MPC2 attains higher or comparable returns with much greater sample efficiency and stability, particularly in complex, high-dimensional, and multi-task settings.

5. Applications and Empirical Benchmarks

TD-MPC2 is broadly applicable to both simulated and real-world continuous control and robotic planning tasks. Notable empirical findings include:

Robotic Manipulation & Locomotion: On 104 online RL tasks, TD-MPC2 outperforms SAC and DreamerV3 in both data efficiency and achievable performance, succeeding in all 39 DMControl tasks (>90% expert performance) and achieving 60%+ success on difficult object manipulation benchmarks such as Pick YCB (Hansen et al., 2023).
Multi-Task and Multi-Domain Scaling: TD-MPC2 supports a single 317M-parameter agent covering 80 tasks (multiple embodiments, domains, and action spaces) with unified architecture. Scaling laws indicate monotonic improvement with increased model/data size, performance not saturated at largest tested scales.
Medical Navigation: In autonomous endovascular navigation across 50 task–geometry combinations, TD-MPC2 achieved a mean success rate of 65% (vs. SAC’s 37%) and higher path efficiency, at the cost of longer average procedure times (15.2s for TD-MPC2 vs. 8.2s for SAC), illustrating a trade-off between success and execution speed (Robertshaw et al., 29 Sep 2025).

Selected Results Table

Domain	TD-MPC2 Score	SAC Score	Notes
DMControl (39 tasks)	>90% expert	lower	Robustness on dense locomotion tasks
Pick YCB (objects)	~60% success	baseline fails	Challenging object manipulation
Med. Navigation (mean)	65% success rate	37% success rate	Based on navigation in real anatomies

6. Algorithmic and Practical Trade-Offs

TD-MPC2 achieves robustness and sample efficiency by leveraging world-model-based lookahead and ensemble value estimation, but at the expense of increased per-step planning overhead and consequently, longer decision times. For domains where real-time execution speed is critical (e.g., medical robotics), a trade-off emerges:

Planning Horizon vs. Speed: Higher planning horizons and extensive sampling increase success and safety but lengthen episode durations.
Safety/Efficiency: Model-based planning can favor conservative maneuvers, particularly in safety-critical navigation, yet may be suboptimal under strict time constraints (Robertshaw et al., 29 Sep 2025).
Policy Distillation: A plausible implication is that policies distilled from TD-MPC2’s planner may offer improved runtime speed at marginally reduced performance—a direction meriting further paper.

7. Risks, Limitations, and Extensions

Potential risks include persistent value overestimation due to policy mismatch between planner and policy prior, as highlighted by follow-up work which proposes policy constraints to mitigate out-of-distribution value errors (Lin et al., 5 Feb 2025). Practical challenges involve tuning of planning horizon and policy entropy, managing latency and compute costs at scale, and ensuring safe deployment in real-world domains with stringent execution requirements.

Ongoing research targets adaptive planning schemes, risk-sensitive or cost-aware loss modifications, and distillation of fast reactive controllers from slower but more accurate TD-MPC2 policies. Scaling studies suggest continuing gains with improved architecture, larger datasets, and more diverse tasks; performance has not plateaued as of models with 317 million parameters (Hansen et al., 2023).

References:

"TD-MPC2: Scalable, Robust World Models for Continuous Control" (Hansen et al., 2023)
"TD-M(PC) $^2$ : Improving Temporal Difference MPC Through Policy Constraint" (Lin et al., 5 Feb 2025)
"World Model for AI Autonomous Navigation in Mechanical Thrombectomy" (Robertshaw et al., 29 Sep 2025)