Temporal Difference MPC

Updated 2 January 2026

Temporal Difference Model Predictive Control (TD-MPC) is a model-based reinforcement learning approach that fuses latent trajectory optimization with off-policy TD learning for robust continuous control.
It leverages a learned world model with state encoding, latent dynamics, reward prediction, and value estimation to achieve high sample efficiency and stability in complex domains.
Recent extensions incorporate multitask learning, uncertainty-aware planning, offline RL, and imitation learning, significantly enhancing performance in high-dimensional and stochastic settings.

Temporal Difference Model Predictive Control (TD-MPC) is a model-based reinforcement learning (MBRL) algorithm characterized by the integration of local trajectory optimization in the latent space of a learned world model with off-policy value function learning via temporal-difference (TD) updates. TD-MPC and its variants achieve high sample efficiency, stability, and strong asymptotic performance in continuous control and high-dimensional robotic domains by unifying planning, learning, and representation in a cohesive framework. The approach has subsequently inspired extensions to massive multitask setups, offline RL, imitation learning, and robust uncertainty-aware systems (Hansen et al., 2022, Hansen et al., 2023, Lin et al., 5 Feb 2025, Nguyen et al., 12 Jun 2025, Chitnis et al., 2023, Hassan et al., 2024, Matthies et al., 2023).

1. Core Principles and Model Architecture

TD-MPC centralizes four components in a learned latent space:

State encoder $h_\theta$ : maps observation $s_t$ to latent $z_t$
Latent dynamics $d_\theta$ : models $z_{t+1}=d_\theta(z_t,a_t)$
Reward predictor $R_\theta$ : $R_\theta(z_t,a_t)\rightarrow\hat r_t$
Value (Q-) function $Q_\theta$ : $Q_\theta(z_t,a_t)\rightarrow\hat q_t$

The approach eschews explicit reconstruction of states in favor of a compact, task-centric representation. A deterministic policy $\pi_\theta$ (used as a proposal or for regularization) and a slowly updated target network $\theta^-$ are maintained for TD learning stability. Architecturally, all modules are parameterized as MLPs or convolutional networks (for image input). In some variants, self-supervised decoders augment the architecture with a reconstruction loss for increased representation robustness (Matthies et al., 2023).

2. Model Learning and Temporal Difference Objective

The backbone of TD-MPC training is the joint optimization of dynamics, reward, and value estimation over short latent rollouts, using a compound loss: $L_\text{model}(\theta) = c_1 L_\text{dyn} + c_2 L_\text{r} + c_3 L_\text{TD} + [c_4 L_\text{rec}]$ where:

$L_\text{dyn} = \|d_\theta(z_t,a_t) - h_{\theta^-}(s_{t+1})\|^2$ (latent consistency)
$L_\text{r} = (R_\theta(z_t,a_t) - r_t)^2$ (reward prediction)
$L_\text{TD} = (Q_\theta(z_t,a_t) - [r_t + \gamma Q_{\theta^-}(z_{t+1},\pi_\theta(z_{t+1}))])^2$ (single-step Bellman residual)
$L_\text{rec}$ : mean-squared error between $h_\theta^{-1}(z_t)$ and $s_t$ if a decoder and reconstruction loss are used.

Multi-step rollouts of length $H$ are used for both learning and planning, weighted with decay factor $\lambda$ . Target networks and ensemble methods are adopted for stabilizing value estimation (Hansen et al., 2023, Nguyen et al., 12 Jun 2025).

3. Planning Loop: Model Predictive Control in Latent Space

At each control step, TD-MPC solves a finite-horizon planning problem using the current state encoding $z_t=h_\theta(s_t)$ . The planning objective maximizes (or minimizes in the cost-optimizing case) a sum of predicted rewards plus a terminal value estimate: $\max_{a_{t:t+H-1}} \mathbb{E}\Big[\sum_{k=0}^{H-1} \gamma^k R_\theta(z_k, a_k) + \gamma^H Q_\theta(z_H, a_H)\Big]$ with rollouts $z_{k+1}=d_\theta(z_k, a_k)$ .

Planning is accomplished by a sampling-based optimizer (typically CEM or MPPI), iteratively updating a Gaussian distribution over action sequences using weighted returns. A fraction of candidates may be drawn from a learned policy prior. The first action of the final best plan is executed in the environment. This approach achieves high data- and planning-efficiency, and is robust to rollout model error when the planning horizon is correctly tuned (Hansen et al., 2022, Hansen et al., 2023, Nguyen et al., 12 Jun 2025).

4. Algorithmic Advances and Extensions

TD-MPC has rapidly evolved, with key contributions:

TD-MPC2 introduces architectural improvements (LayerNorm + Mish), softmax-based "SimNorm" latent normalization, an ensemble of Q-networks with randomized target-minimization, and log-space soft cross-entropy regression for handling reward/value scale variability. It supports single-agent multitask training with up to 317M parameters and exhibits a scaling law: performance rises approximately linearly with $\log(\#$ params) (Hansen et al., 2023).
Policy constraint regularization in TD-M(PC) $^2$ addresses the structural policy mismatch between replay buffer data generated by the MPC planner $\pi_H$ and the nominal SAC-style policy prior $\pi$ . A KL penalty $+\beta\log\mu(a|s)$ in the actor loss reduces OOD queries, mitigating persistent value overestimation. This improves return by up to 100% on high-DoF humanoid tasks (Lin et al., 5 Feb 2025).
Offline and hierarchical RL: IQL-TD-MPC combines implicit Q-learning regularization with TD-MPC, supporting advantage-weighted regression and expectile value functions for offline robustness. As a hierarchical Manager, it emits intent embeddings that substantially boost the efficacy of off-the-shelf offline RL agents on long-horizon AntMaze tasks (Chitnis et al., 2023).
Uncertainty-aware extensions: The DoublyAware framework introduces explicit decompositions of aleatoric ("planning") and epistemic ("policy") uncertainty. Conformal quantile trajectory filtering and group-relative policy constraints (GRPC) increase robustness and exploration efficiency in humanoid locomotion (Nguyen et al., 12 Jun 2025).
Inverse RL for visuomotor imitation: TD-MPC principles have been adapted to learn cost functions from visual demonstrations via IRL, jointly optimizing dynamics, cost, and critic in a single TD loop for sample-efficient robotic arm manipulation (Hassan et al., 2024).
Self-supervised reconstruction: Augmenting TD-MPC with a latent decoder and reconstruction loss yields improved robustness and sample efficiency, especially for image-based and sparse/noisy-reward environments (Matthies et al., 2023).

5. Empirical Results and Benchmarks

TD-MPC and its descendants have been evaluated on the DeepMind Control Suite, Meta-World, D4RL, HumanoidBench, ManiSkill2, and MyoSuite:

State-based tasks: Solved in under 1M steps with a 10–15% mean return improvement from auxiliary reconstruction (Matthies et al., 2023).
Image-based control: Matches or outperforms model-free SAC, DrQ-v2, and Dreamer baselines on difficult benchmarks (Hansen et al., 2022, Hansen et al., 2023).
Large-scale multitask: TD-MPC2 achieves a normalized score of 70.6 across 80 tasks with a single 317M-parameter model, scaling performance with total data and model size (Hansen et al., 2023).
Hierarchical RL and offline ant maze: IQL-TD-MPC raises zero-score workers (e.g., CQL, TD3-BC) to average normalized scores >40 on AntMaze-medium and -large (Chitnis et al., 2023).
Robustness and sample efficiency: Uncertainty-aware planning (DoublyAware) delivers up to 40% faster convergence relative to TD-MPC2 in 26-DoF humanoid locomotion (Nguyen et al., 12 Jun 2025).
Robotic imitation: Over 90% success rate in simulated 7-DoF Panda pick-and-place after ∼5000 gradient steps, with bi-level IRL-TD-MPC outperforming feature-matching baselines (Hassan et al., 2024).
High-DoF tasks: Policy-constrained TD-M(PC) $^2$ surpasses vanilla TD-MPC2 by large margins (especially on 61-DoF humanoid), correcting value estimation drift prevalent in earlier iterations (Lin et al., 5 Feb 2025).

6. Limitations, Open Problems, and Frontiers

TD-MPC’s reliance on the quality of the latent world model and accurate TD value learning makes it sensitive to model bias, reward scale, and planner-polic alignment. Persistent value overestimation can arise from policy mismatch between the planning policy $\pi_H$ and the off-policy learner $\pi$ ; policy regularizers ameliorate but do not fully resolve this. Offline and multitask extensions expose challenges in compounding model errors and scaling across diverse regime boundaries, though techniques such as SimNorm and cross-entropy regression increase robustness (Hansen et al., 2023, Lin et al., 5 Feb 2025). Hierarchical approaches depend on the temporal abstraction hyperparameter, and are not universally optimal across fine-grained and temporally extended tasks (Chitnis et al., 2023).

Continued research aims to further unify model-based RL planning, world-model representation learning, and offline robustness, with emerging approaches focusing on uncertainty calibration, task-agnostic adaptation, and modular hierarchical abstractions. TD-MPC and its variants remain a focal point for sample-efficient and robust control in high-dimensional, stochastic, and multi-embodiment domains.

Selected References

Paper Title	arXiv ID	Key Topic
Temporal Difference Learning for Model Predictive Control	(Hansen et al., 2022)	Foundational TD-MPC, latent world models
TD-MPC2: Scalable, Robust World Models for Continuous Control	(Hansen et al., 2023)	Large-scale multitask, TD-MPC2 improvements
TD-M(PC) $^2$ : Improving Temporal Difference MPC Through Policy Constraint	(Lin et al., 5 Feb 2025)	Policy constraint regularization, OOD mitigation
DoublyAware: Dual Planning and Policy Awareness for TD Learning...	(Nguyen et al., 12 Jun 2025)	Uncertainty decomposition, high-DoF RL
IQL-TD-MPC: Implicit Q-Learning for Hierarchical MPC	(Chitnis et al., 2023)	Offline/Hierarchical RL, intent embeddings
Robotic Arm Manipulation with IRL & TD-MPC	(Hassan et al., 2024)	Bi-level IRL, visual control
Model Predictive Control with Self-supervised Representation Learning	(Matthies et al., 2023)	Self-supervised reconstruction extension