Papers
Topics
Authors
Recent
Search
2000 character limit reached

TD-MPC Agent: Data-Efficient Model-Based RL

Updated 3 February 2026
  • TD-MPC Agent is a model-based reinforcement learning method that leverages an implicit latent-space world model to integrate MPC planning with temporal difference value learning.
  • It employs a decoder-free architecture with stochastic dynamics, reward prediction, and policy priors to mitigate policy mismatch and out-of-distribution estimation errors.
  • TD-MPC algorithms have demonstrated scalability and improved sample efficiency in complex continuous control tasks across simulated and robotic high-dimensional environments.

TD-MPC Agent refers to a class of model-based reinforcement learning (MBRL) algorithms that integrate online trajectory optimization—model-predictive control (MPC)—in the latent space of a learned world model, with temporal difference (TD) value learning to produce highly data-efficient agents for continuous control. The TD-MPC framework and its modern descendants, such as TD-MPC2 and TD-M(PC)2^2, operate by iteratively updating both a stochastic world model and a value-predictive policy in a learned embedding space, enabling fast planning and improved robustness to high-dimensional dynamics. Recent work emphasizes the importance of controlling policy mismatch and out-of-distribution errors in the value estimation process, introducing explicit policy constraints to regularize actor updates. TD-MPC algorithms have been empirically validated to scale to tens of millions of parameters and excel at both single-task and multi-task continuous-control domains, including high-DoF simulated or robotic environments (Hansen et al., 2023, Lin et al., 5 Feb 2025).

1. Architectural Foundations

At the core of TD-MPC is an implicit, decoder-free world model trained entirely in a learned latent space. This model consists of:

  • State encoder heh_e: maps observed state sts_t to latent vector zt=he(st)Rnz_t = h_e(s_t) \in \mathbb{R}^n.
  • Latent dynamics dψd_\psi: predicts zt+1=dψ(zt,at)z_{t+1} = d_\psi(z_t, a_t) for action ata_t.
  • Reward model RθR_\theta: outputs predicted reward Rθ(zt,at)R_\theta(z_t, a_t).
  • Action-value ensemble QϕQ_\phi: approximates Qπ(zt,at)Q^\pi(z_t, a_t), typically as an ensemble of KK categorical critics.
  • Policy-prior network πω\pi_\omega: diagonal Gaussian πω(azt)\pi_\omega(a|z_t) used in value bootstrapping and as a proposal in MPC planning.

The architecture relies on Simplicial Normalization (SimNorm) in latent layers to control norm explosion, LayerNorm+Mish activations for stability, and a shared latent space for state, transition, and reward/value functions (Hansen et al., 2023, Lin et al., 5 Feb 2025).

2. Model-Predictive Planning in Latent Space

At each environment interaction step, the agent solves a short-horizon trajectory optimization (typically H=3H=3) entirely in latent space to select the next action: J(τ)=i=0H1γiRθ(zi,ai)+γHVϕ(zH)J(\tau) = \sum_{i=0}^{H-1} \gamma^i R_\theta(z_i, a_i) + \gamma^H V_\phi(z_H) subject to z0=ztz_0 = z_t, zi+1=dψ(zi,ai)z_{i+1} = d_\psi(z_i, a_i). This is achieved using a sample-based optimizer such as Model Predictive Path Integral (MPPI) or Cross-Entropy Method (CEM), where candidate action sequences are proposed (with a subset drawn from πω\pi_\omega), unrolled in the latent world model, and weighted by their return. Elite sequences are used to update the trajectory distribution, and the first action of the best plan is executed (Hansen et al., 2023, Lin et al., 5 Feb 2025).

Bootstrapping the terminal value with Vϕ(zH)=Eaπω(zH)[Qϕ(zH,a)]V_\phi(z_H) = \mathbb{E}_{a \sim \pi_\omega(\cdot|z_H)}[Q_\phi(z_H, a)] ensures planning incorporates expected future value beyond the planning horizon.

3. Temporal Difference Value Learning

TD-MPC algorithms train their critic ensembles via TD bootstrapping using both real and model rollouts: TπωQ(z,a)=Rθ(z,a)+γEzdψ(z,a),aπω(z)[Qϕ(z,a)]\mathcal{T}^{\pi_\omega}Q(z, a) = R_\theta(z, a) + \gamma\,\mathbb{E}_{z' \sim d_\psi(z, a),\, a' \sim \pi_\omega(\cdot|z')}[Q_\phi(z', a')] The supervised loss for QϕQ_\phi adopts categorical or quantile target regression—formulated as cross-entropy between predicted and bootstrapped target distributions. Critic targets are computed via ensemble averaging, often with the "min of two" trick to mitigate overestimation.

The world model optimizes a summed reconstruction (prediction) loss, reward loss, and value loss weighted by coefficients cd,cr,cqc_d, c_r, c_q, respectively (Hansen et al., 2023, Lin et al., 5 Feb 2025). Model and value losses are aggregated over HH-step rollouts drawn from the replay buffer.

4. Policy Mismatch and KL-Regularized Actor Updates

A principal challenge in TD-MPC is policy mismatch: the planner (πH\pi_H) generating buffer experience may systematically differ from the policy prior (πω\pi_\omega) used in value learning. As a consequence, value targets are often evaluated on out-of-distribution (OOD) state-action pairs, leading to persistent value overestimation and error accumulation. Theoretical analysis shows the performance gap δk=VπH,kVπω,k\delta_k = \|V^{\pi_{H,k}} - V^{\pi_\omega,k}\|_\infty grows with both model errors and planner–prior divergence, unless sufficiently corrected (Lin et al., 5 Feb 2025).

TD-M(PC)2^2 introduces KL-divergence regularization between the learned policy π\pi and the buffer’s mixture of planner policies μ\mu, enforcing

DKL[π(z)    μ(z)]ϵD_{KL}[\pi(\cdot|z)\;\|\;\mu(\cdot|z)] \le \epsilon

within policy improvement. The actor loss becomes

Lπ=EzB,aπ[Qϕ(z,a)αlogπ(az)+βlogμ(az)]\mathcal{L}_\pi = -\mathbb{E}_{z\sim\mathcal{B},\,a\sim\pi} \Big[Q_\phi(z,a) - \alpha\log\pi(a|z) + \beta\log\mu(a|z)\Big]

where α\alpha controls entropy regularization and β\beta the KL penalty. This suppresses OOD action queries, stabilizing value learning, and can be introduced smoothly via curriculum on β\beta (Lin et al., 5 Feb 2025).

5. Training Protocol and Hyperparameterization

TD-M(PC)2^2 is trained by alternating model/policy updates and online environment interaction:

  • Data collection: At each step, encode observation, compute action via MPC planning (using πω\pi_\omega as initialization), step in environment, store transition and planner action distribution in buffer.
  • Model/value updates: Every NN steps, sample HH-step batch, compute bootstrapped targets, minimize Lmodel\mathcal{L}_\text{model} over dynamics, reward, critic heads, and update target critics via Polyak averaging.
  • Policy update: Minimize Lπ\mathcal{L}_\pi including both entropy and KL penalty. Adam optimizer with learning rate 3×1043\,\times\,10^{-4}, batch size $256$, gradient norm clip $20$.

Canonical hyperparameters include: H=3H=3, MPPI iterations $6$, $512$ samples ($64$ elites), replay buffer 10610^6, ensemble K=5K=5 with $101$-bin categorical critics on [10,10][-10,10], α1e4\alpha \approx 1\text{e}{-4}, β\beta ramped in as value quality improves (Lin et al., 5 Feb 2025). Architecturally, the encoder is a 2×2562\times256-unit MLP, and policy/value heads use $512$-unit Mish + LayerNorm networks.

6. Empirical Performance and Comparative Analysis

TD-M(PC)2^2 demonstrates significant improvement over baseline TD-MPC2 and related methods across high-degree-of-freedom tasks, such as 61-DoF simulated humanoid control. The addition of the KL constraint term in actor updates eliminates persistent value overestimation, particularly in areas of state space only visited by the planner. Empirically, performance gains appear most pronounced in scenarios where extrapolation errors otherwise dominate, highlighting the benefit of explicit OOD penalty in actor optimization (Lin et al., 5 Feb 2025).

When extended further (e.g., TD-GRPC (Nguyen et al., 19 May 2025)), group-based policy constraints and trust-region KL penalties can enhance stability and performance in highly unstable or distribution-shift-prone settings, such as humanoid locomotion.

The TD-MPC family has evolved through:

A plausible implication is that the KL-regularized actor update introduced in TD-M(PC)2^2 is now regarded as a required component to achieve consistent value estimation and sample efficiency in off-policy, model-based MPC agents operating at scale. This regularization aligns the policy learning distribution with the data-collection policy, bounding extrapolation errors, and providing improved learning stability and asymptotic performance.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TD-MPC Agent.