Papers
Topics
Authors
Recent
Search
2000 character limit reached

MAD-TD: Stabilizing Off-Policy TD Learning

Updated 20 April 2026
  • MAD-TD is a reinforcement learning methodology that integrates model-generated on-policy synthetic rollouts to stabilize high update-to-data temporal difference learning.
  • It builds a one-step world model using neural embeddings and tailored predictors for rewards and state transitions to bridge the coverage gap in unseen actions.
  • Empirical results show that incorporating as little as 5% model-generated transitions reduces Q overestimation and prevents divergence in deep RL control tasks.

Model-Augmented Data for Temporal Difference Learning (MAD-TD) is a reinforcement learning (RL) methodology designed to stabilize off-policy temporal-difference (TD) learning under high update-to-data (UTD) regimes by blending real and model-generated transitions. It addresses instability and Q-function overestimation issues that arise when function approximators are updated many times per environment sample—scenarios common in deep RL with replay buffers and modern control tasks. MAD-TD integrates a learned one-step world model to generate on-policy synthetic rollouts, mixing these with environment transitions during value updates to close the generalization gap to unseen actions and eliminate the need for parameter resets or ensembles (Voelcker et al., 2024).

1. Motivation and Problem Formulation

The core problem addressed by MAD-TD is the instability observed in high-UTD off-policy TD learning: updating neural value function approximators multiple times per collected transition can lead to misgeneralization, Q-function overestimation, and catastrophic divergence. In off-policy deep RL algorithms such as DDPG, TD3, and SAC, the value function is trained using transitions (s,a,r,s)(s, a, r, s') from a replay buffer DenvD_{\mathrm{env}}. The UTD ratio KK denotes the number of critic gradient updates per new environment step. While higher KK can enhance sample efficiency, it creates a scenario where the target policy π\pi evolves much faster than the data in the buffer, resulting in TD updates computed on (s,a)(s', a') pairs (with aπ(s)a' \sim \pi(\cdot | s')) that may not be present in collected data. This breaks the “coverage” necessary for stable value learning, leading to overestimation and critical instability (Voelcker et al., 2024).

2. World Model Construction and Supervision

To counter the generalization gap, MAD-TD employs a one-step world model pϕ(s,rs,a)p_\phi(s', r | s, a). Raw states ss are encoded via a neural embedding ϕ(s)Rd\phi(s) \in \mathbb{R}^d with a SimNorm nonlinearity. The reward DenvD_{\mathrm{env}}0 is predicted by a shallow MLP DenvD_{\mathrm{env}}1, while the next latent state DenvD_{\mathrm{env}}2 is predicted by a categorical/softmax output head DenvD_{\mathrm{env}}3. Model training minimizes the negative log-likelihood of observed transitions from DenvD_{\mathrm{env}}4: DenvD_{\mathrm{env}}5 which decomposes into reward MSE and state prediction cross-entropy. An additional VAML-style penalty aligns the value predicted at DenvD_{\mathrm{env}}6 with model rollouts, further regularizing the learned environment dynamics (Voelcker et al., 2024).

3. Model-Augmented Data Mixing in TD Learning

For each update, MAD-TD generates a batch by sampling DenvD_{\mathrm{env}}7 real transitions from DenvD_{\mathrm{env}}8 and DenvD_{\mathrm{env}}9 synthetic transitions from the model (KK0 in practice). Model rollouts are constructed by sampling states KK1 (typically from KK2 or uniformly from KK3), computing KK4, and sampling KK5. The critic (value function) is trained to minimize TD error over the combined batch KK6: KK7 or, for Q-functions, using the standard clipped double-Q target. This Dyna-style augmentation allows the critic to see transitions for current on-policy actions that have not yet been observed from the real environment, closing the coverage gap and improving stability (Voelcker et al., 2024).

4. Algorithmic Workflow and Pseudocode

The MAD-TD workflow is as follows:

  • Initialize replay buffer KK8, world model KK9, value network KK0, policy network KK1, target networks.
  • For each environment time step:
    • Collect an (s, a, r, s') transition and append to KK2.
    • For KK3 critic updates (where KK4 is the UTD ratio):
    • Sample KK5 transitions from KK6.
    • For KK7 model transitions: for KK8 in KK9, compute π\pi0; sample π\pi1.
    • Form batch π\pi2.
    • For each π\pi3, compute π\pi4 (or Q-value variant), minimize π\pi5.
    • Update the actor via policy gradient on π\pi6.
    • Perform model update by minimizing π\pi7.
    • Execute soft target network updates as required (Voelcker et al., 2024).

This regime maintains a high UTD ratio while stabilizing value learning.

5. Theoretical Underpinnings of Stability

The necessity for coverage of on-policy state-action pairs is rooted in classical TD learning theory. Given a parametric value function π\pi8, TD learning converges stably only if the data distribution π\pi9 matches the occupancy measure of the target policy. The “key matrix”

(s,a)(s', a')0

is positive definite when data policies (s,a)(s', a')1 coincide with the target, but may acquire negative eigenvalues as policy drift increases, precipitating divergence. The “remainder” term

(s,a)(s', a')2

is zero only under perfect coverage; otherwise, unseen actions break positivity. A single on-policy occupancy measure (s,a)(s', a')3 ensures positive definiteness and stable TD. By supplementing each batch with even a small amount of on-policy, model-generated data, MAD-TD restores the requisite positive definiteness, directly repairing the main cause of instability at high UTD (Voelcker et al., 2024).

6. Empirical Results and Analysis

Experiments on challenging DeepMind Control Suite tasks—specifically, dog (walk, trot, run) and humanoid (stand, walk, run) environments—demonstrate several phenomena:

  • As UTD increases, standard TD3 suffers from markedly increased Q-overestimation, especially for unseen on-policy actions; by UTD=8–16, this causes critical instability.
  • Adding as little as (s,a)(s', a')4 model-generated on-policy data reduces on-policy TD error to the held-out validation error level, eliminates initial Q-overshoot, and preserves stability up to UTD=16 and beyond.
  • Baselines such as BRONet (which requires periodic network resets to avoid collapse at high UTD) fail catastrophically without resets, whereas MAD-TD achieves equivalent or superior performance continuously, removing the need for resets.
  • With action repeat 2 and 2M steps, MAD-TD slightly outperforms both BRONet and TD-MPC2 on mean and interquartile-mean metrics; with 1M steps it outperforms both.
  • Ablation studies show that using random, rather than on-policy, actions in the model eliminates the stability gains; shrinking the model below 64 hidden units degrades performance below the model-free baseline, confirming the importance of model accuracy.
  • The Q-surface produced by MAD-TD is smoother and more robust to small perturbations of (s,a)(s', a')5, indicating mitigated value exploitation by the actor (Voelcker et al., 2024).

MAD-TD is a modern deep RL analog of Dyna-style methods, combining model-based and model-free updates by using synthetic rollouts to assist TD learning. The method is related to classical Dyna with imagined transitions, experience replay, and successor representation–based backups. Unlike prior works, MAD-TD focuses on the stabilization provided by even a small proportion of on-policy model data in high-UTD regimes. Notably, it avoids the need for network resets, critic ensembles, or complex regularizers. The empirical findings validate the hypothesis that the requisite coverage of on-policy actions—a problem also noted in the analysis of classic off-policy TD divergence—is the central factor behind instability under function approximation, and that this coverage can be restored efficiently through simple model-augmented data mixtures (Voelcker et al., 2024, Pitis, 2019).


MAD-TD thus establishes a simple, theoretically sound, and empirically validated paradigm for leveraging model-generated on-policy data to stabilize deep TD learning in the presence of high update-to-data ratios, closing the value generalization gap and enabling robust, efficient learning in high-dimensional continuous-control environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Model-Augmented Data for Temporal Difference Learning (MAD-TD).