Off-Policy TD Methods Overview

Updated 9 April 2026

Off-policy TD methods are reinforcement learning algorithms that estimate target policy value functions using samples from an alternative behavior policy with importance sampling corrections.
They address the divergence issues inherent in bootstrapping and function approximation by employing gradient-based, emphatic, and penalty approaches.
Recent advances improve stability and convergence through dual-timescale updates, regularization techniques, and adaptive variance control mechanisms.

Off-policy temporal-difference (TD) methods comprise a class of algorithms designed to estimate value functions of a target policy using sample trajectories generated by a different “behavior” policy, especially in the presence of function approximation. Off-policy TD methods play an essential role in settings with limited exploration control, counterfactual policy evaluation, batch reinforcement learning, and parallel agent architectures. However, the combination of bootstrapping, function approximation, and off-policy sampling leads to substantial algorithmic challenges—most notably, the possibility of catastrophic divergence due to non-contractive projected fixed-point operators. This entry systematically details the mathematical principles, classes of methods, convergence guarantees, and recent developments in off-policy TD learning.

1. Mathematical Foundations and the Divergence Problem

Let $(\mathcal{S},\mathcal{A}, P, r, \gamma)$ denote a Markov reward process or MDP, with $\pi$ the target policy and $\mu$ the behavior policy. The central objective is to estimate the value function $v^\pi(s) = \mathbb{E}_\pi[\sum_{t\ge0}\gamma^t r_t\mid s_0=s]$ , typically using a linear approximation $v(s;\theta) = \phi(s)^\top\theta$ .

The core update for off-policy TD(0) with linear function approximation and importance sampling is

$\theta_{t+1} = \theta_t + \alpha_t \rho_t \left( r_t + \gamma \phi(s_{t+1})^\top \theta_t - \phi(s_t)^\top \theta_t \right) \phi(s_t)$

where $\rho_t = \pi(a_t|s_t)/\mu(a_t|s_t)$ .

A fundamental pathology—the "deadly triad"—arises due to the failure of the projected Bellman operator to be a contraction in the $\mu$ -weighted norm when $\mu\neq\pi$ and function approximation is used. This leads to the non-negative-definiteness of the mean update matrix $A = \Phi^\top D_\mu [I - \gamma P_\pi] \Phi$ , demonstrated constructively in the Baird counterexample and the $\pi$ 0 toy case, resulting in divergent iterates for standard off-policy TD(0) (Overmars et al., 29 Oct 2025, Diddigi et al., 2019, Lim et al., 2023, Ghiassian et al., 2018).

2. Classical and Modern Algorithmic Approaches

Four broad families of off-policy TD methods have been developed to address divergence:

2.1 Importance Sampling and Projection-Based Directions

Naïve importance-sampling reweighting of every TD update guarantees unbiasedness but incurs prohibitively high variance in the presence of products of IS ratios, and still does not resolve non-definiteness of the mean operator (Overmars et al., 29 Oct 2025).

2.2 Gradient-TD and Saddle-Point Methods

Gradient-based TD algorithms (GTD, TDC, GTD2, etc.) recast value estimation as stochastic gradient descent on the mean-squared projected Bellman error (MSPBE), typically realized with two weight vectors and two timescales: $\pi$ 1 with $\pi$ 2 (Lim et al., 2023, Yu, 2017, Xu et al., 2019, Ghiassian et al., 2018).

These methods provably converge under standard assumptions but increase algorithmic complexity due to dual variables and step-size coordination.

2.3 Emphatic Temporal-Difference Learning

Emphatic TD(λ) algorithms introduce a dynamic follow-on trace $\pi$ 3 and emphatic weighting $\pi$ 4: $\pi$ 5 This reweighting ensures convergence by restoring positive definiteness to the expected update operator under the follow-on distribution, even under strong off-policy mismatch (Sutton et al., 2015, Jiang et al., 2021, Ghiassian et al., 2021). Emphatic TD uses a single parameter vector and step-size.

2.4 Chaining, Penalized, and Alternative Convergent Methods

Recently, methods including chaining value functions (Schmitt et al., 2022), explicit penalty/ridge terms (Diddigi et al., 2019), and backstepping control-theoretic correction (Lim et al., 2023) have been proposed. These approaches either use hierarchical bootstrapping from stable on-policy solutions, add explicit regularization to ensure positive definiteness, or introduce feedback-stabilizing corrections, yielding provable convergence without dual vectors.

3. Convergence Guarantees and Structural Conditions

The convergence of off-policy TD algorithms hinges on strict conditions:

Classical off-policy TD(0) diverges except under rare structural constraints.
Reversibility: Overmars et al. (Overmars et al., 29 Oct 2025) establish that standard off-policy TD(0) converges under reversible Markov chain dynamics for both target and behavior policies, provided the discount factor $\pi$ 6 is bounded by an explicit function of the transition-ratio perturbation constant $\pi$ 7 (arising from the ratio of transition probabilities under $\pi$ 8 and $\pi$ 9):

$\mu$ 0

This is the first result giving almost-sure convergence of unmodified TD(0) for reversible chains, with zero projected Bellman error in the $\mu$ 1-weighted norm.

Gradient-TD and Emphatic TD converges generally under ergodic Markov chains, bounded features, and suitable step-size schedules. GTD frameworks typically require two-timescale stochastic approximation and positive-definite expectation matrices, while emphatic TD relies on the follow-on trace restoring positive-definiteness.
Blockwise and Constant-Step-Size Schedules: Two-time-scale TDC algorithms achieve $\mu$ 2 non-asymptotic convergence under diminishing steps, and exponential convergence up to a fixed bias under constant step-sizes (Xu et al., 2019).
Penalization: Adding a regularization parameter $\mu$ 3 as in (Diddigi et al., 2019) ensures positive definiteness of the update operator, at the cost of introducing a bias:

$\mu$ 4

Convergence is guaranteed when $\mu$ 5 exceeds a computable sufficient lower bound.

Table 1: Sufficient conditions for convergence of prominent off-policy TD algorithms

Algorithm	Convergence Condition	Key Parameter(s)
TD(0) (naïve)	Reversible chain, $\mu$ 6	$\mu$ 7 (perturbation)
GTD/TDC	Positive-definite expectation matrices	Step-sizes, coverage
Emphatic TD(λ)	Ergodic $\mu$ 8, absolute continuity of π w.r.t. μ	β (variance control)
Penalized TD(0)	$\mu$ 9 large enough for $v^\pi(s) = \mathbb{E}_\pi[\sum_{t\ge0}\gamma^t r_t\mid s_0=s]$ 0	$v^\pi(s) = \mathbb{E}_\pi[\sum_{t\ge0}\gamma^t r_t\mid s_0=s]$ 1

4. Variance Control and Importance Sampling Placement

Variance explosion due to products of importance sampling ratios is a major challenge in off-policy TD. Optimal IS placement—specifically, applying the ratio to the entire TD error rather than just the target—induces a control variate with known expectation, significantly reducing variance (Graves et al., 2022). Empirically, “full-error” IS correction consistently yields faster learning and broader step-size robustness than per-decision IS, and is algorithm-agnostic across the Gradient-TD, Emphatic-TD, Tree-Backup, and V-trace families.

Adaptive capping (as in V-trace), history-dependent $v^\pi(s) = \mathbb{E}_\pi[\sum_{t\ge0}\gamma^t r_t\mid s_0=s]$ 2, and clipped emphatic traces further modulate variance, introducing bias–variance tradeoffs tailored to specific domains or stability requirements (Jiang et al., 2021, Yu et al., 2017).

5. Empirical Performance and Practical Implications

Comprehensive empirical comparisons, e.g., on the Collision task (Ghiassian et al., 2021), yield the following consensus:

Emphatic TD(λ) and its variants (e.g., ETD(λ,β)) consistently form the top performance tier in learning speed, asymptotic error, and robustness to step-size or $v^\pi(s) = \mathbb{E}_\pi[\sum_{t\ge0}\gamma^t r_t\mid s_0=s]$ 3 parameters. These methods nearly eliminate the sensitivity to the bootstrapping parameter and outperform all alternatives under varying settings.
Gradient-TD (GTD, TDC, GTD2, HTD) and Off-Policy TD(λ): These constitute a middle tier, offering stability but showing increased sensitivity and higher bias as $v^\pi(s) = \mathbb{E}_\pi[\sum_{t\ge0}\gamma^t r_t\mid s_0=s]$ 4. Blockwise diminishing step-sizes and regularization (e.g., TDRC) mitigate sensitivity and allow for faster convergence.
Capped-IS algorithms (Vtrace, Tree-Backup, ABTD): Effective in bounding variance at the cost of higher steady-state bias and loss of accuracy for small $v^\pi(s) = \mathbb{E}_\pi[\sum_{t\ge0}\gamma^t r_t\mid s_0=s]$ 5. Preference for these methods should be restricted to extreme off-policy or high-variance regimes where classical estimators fail.

6. Algorithmic Innovations and Open Directions

Areas of ongoing research include:

Chaining methods: Construction of chains of value functions (each trained on-policy about the previous link), guarantee convergence arbitrarily close to the off-policy TD solution even where TD(0) diverges (Schmitt et al., 2022).
Sparse and regularized estimation: Convex-concave saddle-point and $v^\pi(s) = \mathbb{E}_\pi[\sum_{t\ge0}\gamma^t r_t\mid s_0=s]$ 6-regularized TD methods (e.g., RO-TD) support explicit feature selection with linear per-step complexity, expanding the applicability of off-policy evaluation to high-dimensional problems (Liu et al., 2020).
Distributional and multi-agent extensions: Recent works yield finite-sample guarantees and near-optimal sample-complexity for decentralized off-policy TD with privacy and communication constraints in multi-agent systems (Chen et al., 2021).
Backstepping control-theoretic algorithms: Recursive Lyapunov-based synthesis of stabilizing controllers offers algorithmic frameworks which unifies existing correction schemes and enables rigorous stability proofs and faster convergence (Lim et al., 2023).
Consistent projection and distribution correction: Algorithms such as COP-TD(λ,β) and Log-COP-TD(λ,β) address the projection-bias incurred by learning under the wrong state-distribution, achieving consistency with on-policy TD solutions (Hallak et al., 2017).

7. Summary of Key Theorems and Structural Results

Overmars et al. (2025): If target and behavior chains are reversible with matching sparsity and perturbation constant $v^\pi(s) = \mathbb{E}_\pi[\sum_{t\ge0}\gamma^t r_t\mid s_0=s]$ 7, and $v^\pi(s) = \mathbb{E}_\pi[\sum_{t\ge0}\gamma^t r_t\mid s_0=s]$ 8, then off-policy TD(0) converges almost surely to the unique solution with vanishing projected Bellman error (Overmars et al., 29 Oct 2025).
Benveniste–Métaivier–Priouret/Tsitsiklis–Van Roy Theorem: Under negative-definiteness of the mean update matrix, geometric mixing, and standard Robbins–Monro conditions, two-timescale stochastic approximation converges almost surely to the unique fixed point (Yu, 2017, Ghiassian et al., 2018, Overmars et al., 29 Oct 2025).
Emphatic-TD convergence: Under mild assumptions, the key matrix with emphatic weighting is always positive-definite; as a result, the method is robust to mismatch and the choice of $v^\pi(s) = \mathbb{E}_\pi[\sum_{t\ge0}\gamma^t r_t\mid s_0=s]$ 9 or $v(s;\theta) = \phi(s)^\top\theta$ 0 (Sutton et al., 2015, Jiang et al., 2021).

References

(Overmars et al., 29 Oct 2025) "Convergence of off-policy TD(0) with linear function approximation for reversible Markov chains"
(Lim et al., 2023) "Backstepping Temporal Difference Learning"
(Xu et al., 2019) "Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples"
(Graves et al., 2022) "Importance Sampling Placement in Off-Policy Temporal-Difference Methods"
(Sutton et al., 2015) "An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning"
(Jiang et al., 2021) "Emphatic Algorithms for Deep Reinforcement Learning"
(Ghiassian et al., 2018) "Online Off-policy Prediction"
(Ghiassian et al., 2021) "An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task"
(Yu, 2017) "On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning"
(Diddigi et al., 2019) "A Convergent Off-Policy Temporal Difference Algorithm"
(Hallak et al., 2017) "Consistent On-Line Off-Policy Evaluation"
(Schmitt et al., 2022) "Chaining Value Functions for Off-Policy Learning"
(Liu et al., 2020) "Regularized Off-Policy TD-Learning"
(Yu et al., 2017) "On Generalized Bellman Equations and Temporal-Difference Learning"

Off-policy TD methods remain central in reinforcement learning, with continued development driven by fundamental challenges of distributional shift, bootstrapping, and function approximation, and pragmatic needs for robust, sample-efficient, and scalable policy evaluation algorithms.