TD-MPC Agent: Data-Efficient Model-Based RL
- TD-MPC Agent is a model-based reinforcement learning method that leverages an implicit latent-space world model to integrate MPC planning with temporal difference value learning.
- It employs a decoder-free architecture with stochastic dynamics, reward prediction, and policy priors to mitigate policy mismatch and out-of-distribution estimation errors.
- TD-MPC algorithms have demonstrated scalability and improved sample efficiency in complex continuous control tasks across simulated and robotic high-dimensional environments.
TD-MPC Agent refers to a class of model-based reinforcement learning (MBRL) algorithms that integrate online trajectory optimization—model-predictive control (MPC)—in the latent space of a learned world model, with temporal difference (TD) value learning to produce highly data-efficient agents for continuous control. The TD-MPC framework and its modern descendants, such as TD-MPC2 and TD-M(PC), operate by iteratively updating both a stochastic world model and a value-predictive policy in a learned embedding space, enabling fast planning and improved robustness to high-dimensional dynamics. Recent work emphasizes the importance of controlling policy mismatch and out-of-distribution errors in the value estimation process, introducing explicit policy constraints to regularize actor updates. TD-MPC algorithms have been empirically validated to scale to tens of millions of parameters and excel at both single-task and multi-task continuous-control domains, including high-DoF simulated or robotic environments (Hansen et al., 2023, Lin et al., 5 Feb 2025).
1. Architectural Foundations
At the core of TD-MPC is an implicit, decoder-free world model trained entirely in a learned latent space. This model consists of:
- State encoder : maps observed state to latent vector .
- Latent dynamics : predicts for action .
- Reward model : outputs predicted reward .
- Action-value ensemble : approximates , typically as an ensemble of categorical critics.
- Policy-prior network : diagonal Gaussian used in value bootstrapping and as a proposal in MPC planning.
The architecture relies on Simplicial Normalization (SimNorm) in latent layers to control norm explosion, LayerNorm+Mish activations for stability, and a shared latent space for state, transition, and reward/value functions (Hansen et al., 2023, Lin et al., 5 Feb 2025).
2. Model-Predictive Planning in Latent Space
At each environment interaction step, the agent solves a short-horizon trajectory optimization (typically ) entirely in latent space to select the next action: subject to , . This is achieved using a sample-based optimizer such as Model Predictive Path Integral (MPPI) or Cross-Entropy Method (CEM), where candidate action sequences are proposed (with a subset drawn from ), unrolled in the latent world model, and weighted by their return. Elite sequences are used to update the trajectory distribution, and the first action of the best plan is executed (Hansen et al., 2023, Lin et al., 5 Feb 2025).
Bootstrapping the terminal value with ensures planning incorporates expected future value beyond the planning horizon.
3. Temporal Difference Value Learning
TD-MPC algorithms train their critic ensembles via TD bootstrapping using both real and model rollouts: The supervised loss for adopts categorical or quantile target regression—formulated as cross-entropy between predicted and bootstrapped target distributions. Critic targets are computed via ensemble averaging, often with the "min of two" trick to mitigate overestimation.
The world model optimizes a summed reconstruction (prediction) loss, reward loss, and value loss weighted by coefficients , respectively (Hansen et al., 2023, Lin et al., 5 Feb 2025). Model and value losses are aggregated over -step rollouts drawn from the replay buffer.
4. Policy Mismatch and KL-Regularized Actor Updates
A principal challenge in TD-MPC is policy mismatch: the planner () generating buffer experience may systematically differ from the policy prior () used in value learning. As a consequence, value targets are often evaluated on out-of-distribution (OOD) state-action pairs, leading to persistent value overestimation and error accumulation. Theoretical analysis shows the performance gap grows with both model errors and planner–prior divergence, unless sufficiently corrected (Lin et al., 5 Feb 2025).
TD-M(PC) introduces KL-divergence regularization between the learned policy and the buffer’s mixture of planner policies , enforcing
within policy improvement. The actor loss becomes
where controls entropy regularization and the KL penalty. This suppresses OOD action queries, stabilizing value learning, and can be introduced smoothly via curriculum on (Lin et al., 5 Feb 2025).
5. Training Protocol and Hyperparameterization
TD-M(PC) is trained by alternating model/policy updates and online environment interaction:
- Data collection: At each step, encode observation, compute action via MPC planning (using as initialization), step in environment, store transition and planner action distribution in buffer.
- Model/value updates: Every steps, sample -step batch, compute bootstrapped targets, minimize over dynamics, reward, critic heads, and update target critics via Polyak averaging.
- Policy update: Minimize including both entropy and KL penalty. Adam optimizer with learning rate , batch size $256$, gradient norm clip $20$.
Canonical hyperparameters include: , MPPI iterations $6$, $512$ samples ($64$ elites), replay buffer , ensemble with $101$-bin categorical critics on , , ramped in as value quality improves (Lin et al., 5 Feb 2025). Architecturally, the encoder is a -unit MLP, and policy/value heads use $512$-unit Mish + LayerNorm networks.
6. Empirical Performance and Comparative Analysis
TD-M(PC) demonstrates significant improvement over baseline TD-MPC2 and related methods across high-degree-of-freedom tasks, such as 61-DoF simulated humanoid control. The addition of the KL constraint term in actor updates eliminates persistent value overestimation, particularly in areas of state space only visited by the planner. Empirically, performance gains appear most pronounced in scenarios where extrapolation errors otherwise dominate, highlighting the benefit of explicit OOD penalty in actor optimization (Lin et al., 5 Feb 2025).
When extended further (e.g., TD-GRPC (Nguyen et al., 19 May 2025)), group-based policy constraints and trust-region KL penalties can enhance stability and performance in highly unstable or distribution-shift-prone settings, such as humanoid locomotion.
7. Related Methods and Evolution
The TD-MPC family has evolved through:
- Original TD-MPC: trajectory optimization in latent space using an implicit world model, with value learning via TD backups (Hansen et al., 2023, Kuzmenko et al., 2 Jul 2025).
- TD-MPC2: architectural improvements (SimNorm, larger ensembles, robust categorical value heads), large-scale multitask capabilities (Hansen et al., 2023).
- Policy-constrained variants: TD-M(PC) (minimal KL actor regularization to address policy mismatch) (Lin et al., 5 Feb 2025), and TD-GRPC (group-relative weighted actor and strict trust regions) (Nguyen et al., 19 May 2025).
- Extensions to hierarchical RL (IQL-TD-MPC) and knowledge distillation (TD-MPC-Opt) (Chitnis et al., 2023, Kuzmenko et al., 2 Jul 2025).
A plausible implication is that the KL-regularized actor update introduced in TD-M(PC) is now regarded as a required component to achieve consistent value estimation and sample efficiency in off-policy, model-based MPC agents operating at scale. This regularization aligns the policy learning distribution with the data-collection policy, bounding extrapolation errors, and providing improved learning stability and asymptotic performance.