Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 73 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 218 tok/s Pro

GPT OSS 120B 460 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Finite-Time Bias Bounds in TD Learning

Updated 9 October 2025

The paper’s main contribution is establishing non-asymptotic bias bounds for TD learning with linear function approximators, linking bias evolution to sample size and mixing time.
It details how different noise models, such as i.i.d. versus Markovian dependencies, affect convergence rates and demonstrates that projection steps can mitigate bias from correlated samples.
The analysis extends to TD(λ) and Q-learning, highlighting the tradeoffs between approximation accuracy and computational efficiency through the use of eligibility traces and effective contraction parameters.

Temporal-difference (TD) learning with finite-time bias guarantees represents a central subject in the theory of reinforcement learning, quantifying how quickly the TD parameter estimates or their value function approximations converge towards the fixed point, and—crucially—how their bias evolves as a function of sample size, Markovian dependence, step-size, and algorithmic design. The mathematically explicit finite-time bias bound provides non-asymptotic control over the estimation error, capturing the tradeoffs between statistical efficiency and computational convergence in linear TD, its extensions (TD(λ), Q-learning), and increasingly, nonlinear and distributed settings.

1. Finite-Time Error Bound for Linear TD Learning

The fundamental setup involves estimating the value function $V_\mu(s) = \mathbb{E}_\mu[\sum_{t \ge 0} \gamma^t R(s_t) | s_0 = s]$ for a fixed policy in a Markov reward process. Instead of tabular representation, which is impractical for large state spaces, a linear function approximator is used:

$V_\theta(s) = \phi(s)^\top \theta$

where $\phi(s)$ denotes the feature vector and $\theta \in \mathbb{R}^d$ the unknown parameter vector.

The online TD(0) update adopts a pseudo-gradient form:

$g_t(\theta) = [r_t + \gamma \phi(s_t')^\top \theta - \phi(s_t)^\top \theta] \phi(s_t)$

Though not a true stochastic gradient, the expectation (under stationarity) exhibits a crucial “gradient-like” property:

$(\theta^* - \theta)^\top \bar{g}(\theta) \geq (1-\gamma)\|V_{\theta^*} - V_\theta\|_D^2$

where $\bar{g}(\theta)=\mathbb{E}[g_t(\theta)]$ and $D$ is the diagonal matrix of the stationary distribution.

Through this property, the expected error recursion for the iterates becomes (after taking expectations and using standard decomposition techniques):

$\|\theta_{t+1} - \theta^*\|_2^2 \leq \|\theta_t - \theta^*\|_2^2 - 2\alpha_t (\theta_t - \theta^*)^\top g_t(\theta_t) + \alpha_t^2\|g_t(\theta_t)\|_2^2$

which enables explicit finite-time bounds on the parameter and value-function errors.

Under an i.i.d. sampling model and using average iterate $\bar{\theta}_T = \frac{1}{T}\sum_{t=0}^{T-1}\theta_t$ with constant step-size $\alpha_t = {1}/\sqrt{T}$ :

$\mathbb{E}\left[\|V_{\theta^*} - V_{\bar{\theta}_T}\|_D^2\right] \leq \frac{\|\theta^* - \theta_0\|_2^2 + 2\sigma^2}{\sqrt{T}(1-\gamma)}$

where $\sigma^2$ is the TD-update variance at the fixed point. Decaying step-sizes can yield an $\mathcal{O}(1/T)$ convergence rate.

2. Bias Under Markovian Noise and the Mixing Time

In the more realistic Markov chain observation model, the data are correlated and an additional bias arises from dependencies between samples. To control the corresponding bias, an explicit projection step onto a ball of fixed radius $R$ is introduced:

$\theta_{t+1} = \Pi_{2,R}(\theta_t + \alpha_t g_t(\theta_t))$

This guarantees the iterates remain bounded. Under a geometric mixing condition—that is, if the total variation distance between the chain's law at time $t$ and the stationary law is bounded by $m\rho^t$ —the extra bias term is regulated and the resulting finite-time bound is:

$\mathbb{E}\left[\|V_{\theta^*}-V_{\bar{\theta}_T}\|_D^2\right] \leq \frac{\|\theta^*-\theta_0\|_2^2 + G^2 \left(9 + 12\tau^{\mathrm{mix}}(1/\sqrt{T})\right)}{2\sqrt{T}(1-\gamma)}$

where $G$ bounds the update norm, and $\tau^{\mathrm{mix}}(\epsilon)$ is the mixing time at scale $\epsilon$ .

The leading implication is that—even under Markovian noise—the mean-squared bias relative to the fixed point is controlled, with the error scaling in $1/\sqrt{T}$ (or $1/T$ with decaying step-sizes) up to a multiplicative mixing-time term.

3. Extensions to TD(λ) and Eligibility Traces

The analysis generalizes to TD(λ) with eligibility traces, where the update direction at time $t$ involves an eligibility vector:

$z_{0:t} = (\gamma\lambda)z_{0:t-1} + \phi(s_t)$

The expected update direction is governed by a projected λ-weighted BeLLMan operator $T_\mu^{(\lambda)}$ , leading to an analogous “gradient-like” descent property with contraction factor:

$\kappa = \frac{\gamma(1-\lambda)}{1-\gamma\lambda}$

The overall finite-time bias bound thus resembles that for TD(0), with the discount factor $\gamma$ replaced by the effective contraction parameter $\kappa$ , and with additional constants depending on $\lambda$ .

Increasing $\lambda$ can reduce the asymptotic approximation error (since TD(λ) interpolates toward Monte Carlo evaluation) but may exacerbate the numerical convergence rate.

4. High-Dimensional Q-Learning in Optimal Stopping

The methodology extends to Q-learning for optimal stopping, where the BeLLMan operator features a maximum between the stopping and continuation value:

$(FQ)(s) = u(s) + \gamma\sum_{s'} P(s'|s)\max\{U(s'), Q(s')\}$

A linear function approximation is again adopted:

$Q_\theta(s) = \phi(s)^\top\theta$

Analysis demonstrates that, under suitable richness of the function class, the projected BeLLMan operator retains a contraction property and the essential finite-time bias bounds established for TD(0) apply, guaranteeing that the policy derived from the approximate Q-function is near-optimal.

5. Comparative Insights and Methodological Import

This analysis demonstrates that, despite the fact that TD learning is not performing true gradient descent, key expected update properties enable finite-time error contractivity similar to SGD in convex (and, with regularization, strongly convex) settings.

Key comparative points:

Noise Model	Step-Size/Averaging	Error Rate	Bias Control
i.i.d.	$\alpha_t=1/\sqrt{T}$	$\mathcal{O}(1/\sqrt{T})$	Direct variance control
Markov chain	constant or decaying	$\mathcal{O}(1/\sqrt{T})$ / $\mathcal{O}(1/T)$	Bias scales with mixing time
TD(λ), Q-learn	analogously structured	For TD(λ), as above with $\kappa$	Constants depend on $\lambda$

The finite-time bias in Markovian settings is fundamentally governed by the chain’s mixing time; slower mixing translates directly to increased bias via a proportionality factor. The projection step is a technical device primarily to ensure update boundedness and can, in practice, be circumvented if feature vectors are naturally bounded and/or step-sizes decay appropriately.

6. Practical Ramifications and Limitations

The explicit finite-time bias bounds for TD learning with linear function approximation supply practitioners and theorists with critical quantitative predictions of statistical and computational efficiency. Error control is directly analogous to that achieved in stochastic gradient descent for convex programs, up to explicit factors shaped by the discount factor $\gamma$ , conditioning of the feature matrix, and the mixing time $\tau$ .

Key practical takeaways include:

With appropriate step-size scheduling and, when necessary, iterate averaging, TD learning achieves near-optimal convergence rates in sample complexity.
The Markovian bias, though controllable, dictates that statistical efficiency suffers as the mixing rate decays, reinforcing the importance of policies or MDPs with good ergodicity.
The methodology accommodates generalization to TD(λ) and Q-learning, offering a general-purpose toolkit for finite-time analysis in RL algorithms employing linear value or action-function approximation.
The tradeoff between approximation bias (from limited function classes or $\lambda$ choice) and statistical/convergence bias (from sampling and step-size) is now made explicit, facilitating better algorithmic design and understanding.

These results provide a unified and technically explicit view of finite-time bias bounds across TD-related algorithms, establishing the mathematical basis for subsequent work on more complex stochastic approximation schemes, non-linear function approximation, and robust or distributed TD learning.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Finite-Time Bias Bound for Temporal-Difference Learning.