TD-MPC: Temporal Difference for Control

Updated 25 November 2025

TD-MPC is a model-based reinforcement learning method that integrates short-horizon MPC with long-horizon temporal difference learning using latent state representations.
It leverages predictive rollouts, Q-function bootstrapping, and multi-step planning to bridge the gap between model-based and model-free paradigms.
Recent advancements address value overestimation and distribution shift with KL-regularization, uncertainty quantification, and self-supervised representation learning.

Temporal Difference Learning for Model Predictive Control (TD-MPC) refers to a class of model-based reinforcement learning algorithms that fuse the strengths of trajectory-centric model predictive control with long-horizon value estimation via temporal-difference (TD) learning. By integrating a learned latent world model for short-term planning and a Q-function or value function for return bootstrapping, TD-MPC delivers improved performance and sample efficiency across a range of continuous control benchmarks. Recent advancements address shortcomings in distributional shift, value overestimation, and uncertainty quantification, providing principled pathways for robust and efficient control in high-dimensional, stochastic domains.

1. Core TD-MPC Architecture

TD-MPC fundamentally marries a learned, task-oriented latent world model with online planning and temporal-difference bootstrapping for long-term reward estimation. The world model comprises an encoder $h_\theta$ , a latent dynamics model $d_\theta$ , a reward predictor $R_\theta$ , a Q-function $Q_\theta$ , and a policy head $\pi_\theta$ (for state or image observations) (Hansen et al., 2022).

At each step:

The state $s_t$ is encoded as $z_t = h_\theta(s_t)$ .
Future latent transitions are rolled out via $z_{t+1} = d_\theta(z_t, a_t)$ .
The MPC planner solves for an optimal sequence of actions over a short horizon $H$ ,

$\max_{\{a_i\}}\,\mathbb E\left[\sum_{i=0}^{H-1} \gamma^i R_\theta(z_{t+i}, a_{t+i}) + \gamma^H Q_\theta(z_{t+H}, a_{t+H})\right]$

subject to the latent dynamics.

TD learning is used to train $Q_\theta$ via the target

$y_i = r_i + \gamma Q_{\theta^-}(z_{i+1}, \pi_\theta(z_{i+1}))$

where $\theta^-$ is a lagged target network.

The loss function combines reward error, TD error, and latent consistency loss: $L(\theta) = c_1 \| R_\theta(z_t, a_t) - r_t \|^2 + c_2 \| Q_\theta(z_t, a_t) - y_t \|^2 + c_3 \| d_\theta(z_t, a_t) - h_{\theta^-}(s_{t+1}) \|^2$ Weighted multi-step rollouts accelerate convergence and improve representation learning.

2. Short-Horizon MPC with Latent World Models

Online planning leverages the current value of the learned world model through Model Predictive Path Integral (MPPI) control or Cross-Entropy Method (CEM), where a population of $H$ -step action sequences are sampled, evaluated, and refined by importance sampling on returns. Gaussian parameterization $(\mu, \sigma)$ for each action in the horizon are iteratively optimized. The executed action at each environment step is sampled from the optimized Gaussian for the first step. This mechanism balances exploitation of learned dynamics with exploration via noise and policy proposals (Hansen et al., 2022, Matthies et al., 2023).

The terminal value function, $Q_\theta(z_{t+H}, a_{t+H})$ , enables long-horizon information to propagate into short-horizon plans, closing the gap between pure model-based and model-free approaches by efficiently exploiting both trajectory optimization and value learning.

3. Advanced Value Learning, Distributional Robustness, and Policy Constraints

Despite substantial empirical progress, TD-MPC architectures are vulnerable to persistent value overestimation—a phenomenon primarily arising from policy distribution mismatch. In standard TD-MPC, the data distribution is generated by the closed-loop planner $\pi_H$ while value updates are computed for the nominal policy $\pi_\theta$ . This mismatch results in the Q-function being evaluated on out-of-distribution (OOD) actions, amplifying estimation error especially in high-dimensional problems (Lin et al., 5 Feb 2025).

To mitigate this, TD-M(PC) $^2$ introduces a KL-regularized policy update objective, which constrains the nominal policy toward the planner’s action distribution: $\mathcal L_\pi(\theta) = -\,\mathbb E_{(s,\mu)\sim\mathcal D, a\sim\pi_\theta(\cdot|s)}\big[ Q_\phi(s,a) - \alpha \log \pi_\theta(a|s) + \beta \log \mu(a|s) \big]$ Here, $\beta$ penalizes OOD queries, stabilizing value estimation. Theoretical analysis provides contraction bounds on the policy gap and demonstrates the close linkage between value error, model error, and policy divergence. Large gains are observed on complex, high-DoF tasks, including over 100% improvement in episode returns in 61-DoF humanoid simulation (Lin et al., 5 Feb 2025).

Further, uncertainty-aware extensions such as DoublyAware decompose aleatoric and epistemic sources via conformal prediction and group-relative policy constraints, yielding robustness against stochastic dynamics and systematic trust-region regularization in the latent action space (Nguyen et al., 12 Jun 2025).

4. Representation Learning and Self-Supervision

Recent work augments TD-MPC with self-supervised representation learning (SRL), appending a reconstruction head to the latent state to provide a dense, reward-independent self-supervised learning signal. The decoder $h^{-1}_\theta$ is trained to minimize the reconstruction loss

$l_{\text{rec},i} = \| h^{-1}_\theta(z_i) - s_i \|^2$

This term is combined with the standard model and value losses: $L(\theta; \Gamma_i) = c_1 l_{r,i} + c_2 l_{v,i} + c_3 l_{c,i} + c_4 l_{\text{rec},i}$ Empirically, this auxiliary reconstruction loss improves sample efficiency and learning stability, especially in pixel-based or noisy settings. Gains of up to 20% in early learning and lower performance variance are documented in DeepMind-Control Suite benchmarks (Matthies et al., 2023). Careful tuning of the reconstruction weight $c_4$ is required to avoid over-regularizing the representation.

5. Hierarchical and Offline Extensions

In offline and long-horizon sparse-reward settings, flat TD-MPC may fail to provide effective exploration or credit assignment. Hierarchical variants such as IQL-TD-MPC employ temporally abstract planning, where a Manager is trained via Implicit Q-Learning to plan subgoal (“intent”) embeddings over k-step intervals. These “intent” vectors are then concatenated to state inputs for a Worker agent, substantially boosting offline RL performance on challenging benchmarks such as D4RL AntMaze and Maze2D, with increases from near-zero to over 40 normalized evaluation score for many standard offline agents (Chitnis et al., 2023). Randomized intent ablations confirm the specificity and importance of the learned abstractions.

6. Bootstrapped Policy Learning and Hybridization

Bootstrapped Model Predictive Control (BMPC) introduces explicit expert imitation into the TD-MPC loop. Here, the policy network is trained to match the MPC planner’s action distribution using KL divergence, while also serving as a prior for future planning rollouts. A “lazy reanalyze” mechanism keeps policy updates aligned with the evolving planner policy at low computational cost, amortizing the expense of expert label generation. This yields heightened policy learning, value estimation fidelity, and system stability—especially on high-dimensional locomotion tasks (Wang et al., 24 Mar 2025).

This tight coupling between MPC and policy distillation enables rapid policy extraction from planning and improved end-to-end training for closed-loop control.

7. Experimental Results and Empirical Trends

TD-MPC and its recent derivatives consistently set state-of-the-art performance on a variety of continuous control suites, both for low-dimensional states and image-based observations:

TD-MPC achieves near 10x improvement in sample efficiency on DMControl Humanoid and Dog, outperforming SAC and matching Dreamer-v2/DrQ-v2 (Hansen et al., 2022).
TD-M(PC) $^2$ achieves over 100% return gains and reduces value estimation error from 700–2000% to under 100% on 61-DoF humanoid tasks (Lin et al., 5 Feb 2025).
SRL-augmented TD-MPC enhances initial learning speed and delivers more stable convergence, especially in vision-based settings (Matthies et al., 2023).
Hierarchical IQL-TD-MPC as a Manager universally improves performance of offline RL agents on sparse-reward benchmarks (Chitnis et al., 2023).
Bootstrapped policy learning and explicit planner-policy KL coupling (BMPC) further increase robustness and data efficiency on high-dimensional and complex manipulation tasks (Wang et al., 24 Mar 2025).

8. Challenges, Limitations, and Future Directions

Current limitations include:

Persistent policy–planner mismatch without explicit KL regularization causes overestimation and suboptimal long-horizon value propagation (Lin et al., 5 Feb 2025).
Over-regularization from auxiliary losses (e.g., reconstruction) can dampen asymptotic performance if not tuned (Matthies et al., 2023).
Fixed abstraction scales in hierarchical TD-MPC may not align with task structure, motivating work on variable or learned timescales (Chitnis et al., 2023).
Increased computation from MPC rollouts can be mitigated with amortized “lazy reanalyze” and parallelization.

Future research seeks adaptive constraint scheduling, richer self-supervision (contrastive, multi-step prediction), hierarchical planners with learned abstraction granularity, and scalable uncertainty quantification in high-dimensional systems (Matthies et al., 2023, Chitnis et al., 2023, Lin et al., 5 Feb 2025, Nguyen et al., 12 Jun 2025, Wang et al., 24 Mar 2025).

In summary, TD-MPC unifies latent world modeling, trajectory-centric MPC, and TD value learning, and continues to evolve through principled regularization, self-supervision, hierarchical abstraction, and robust uncertainty modeling. These innovations collectively advance data-efficient, high-dimensional, and robust continuous control in modern reinforcement learning.