N-Step Returns in Reinforcement Learning

Updated 1 January 2026

N-step returns are defined as the sum of discounted rewards over n steps with a bootstrapped value, balancing bias and variance in reinforcement learning.
They underpin both model-based evaluations and model-free TD algorithms, facilitating faster reward propagation and improved stability.
Adaptive selection and mixture techniques of n-step returns enhance learning efficiency and control variance in practical deep RL implementations.

An $n$ -step return is a fundamental target in temporal-difference (TD) and value-based reinforcement learning, consisting of the sum of discounted rewards over $n$ steps followed by a bootstrapped estimate of the value function at the $n$ -th future state. This construction allows intermediate propagation of reward information, controlling the bias–variance trade-off between single-step TD learning and Monte Carlo evaluation. The use of $n$ -step returns, their mixtures, and adaptive schemes for step-horizon selection underpin a vast array of modern RL algorithms, spanning off-policy learning, deep Q-networks, actor-critic architectures, and multi-goal learning.

1. Mathematical Definition and Construction

The $n$ -step return for a trajectory $(s_t, a_t, r_{t+1}, s_{t+1}, \ldots, r_{t+n}, s_{t+n})$ is defined as follows: $G_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k\,r_{t+k+1} + \gamma^n\,V(s_{t+n})$ where $\gamma$ is the discount factor, $r_{t+k+1}$ are rewards, and $V(\cdot)$ is a bootstrapped value or action-value estimate. For off-policy learning, one applies importance-sampling correction: $\rho_{t:t+n-1} = \prod_{k=0}^{n-1} \frac{\pi(a_{t+k}\mid s_{t+k})}{\beta(a_{t+k}\mid s_{t+k})}$ with $\pi$ the target policy and $\beta$ the behavior policy. This structure is central in both value and Q-learning formulations, with the latter using: $R_t^{(n)} = \sum_{i=0}^{n-1} \gamma^i r_{t+i+1} + \gamma^n \max_{a'}Q(s_{t+n}, a')$ (Lim et al., 13 Feb 2025, Chiang et al., 2020).

2. Algorithmic Frameworks: Model-Based and Model-Free

$n$ -step returns enable a spectrum of policy evaluation and control algorithms:

Model-Based Policy Evaluation: The $n$ -step Bellman operator is

$T^n x = R^\pi + \gamma P^\pi R^\pi + \cdots + \gamma^{n-1}(P^\pi)^{n-1}R^\pi + \gamma^n (P^\pi)^n x$

with projection onto feature space and fixed points given by the $n$ -step projected Bellman equation. For large enough $n$ , the gain matrix $A$ becomes Schur, yielding geometric convergence in both deterministic solvers and Richardson-type gradient iterations. Sufficient bounds for $n$ exist for all contraction and Hurwitz criteria required for stability (Lim et al., 13 Feb 2025).

Model-Free $n$ -Step TD Learning: In off-policy settings, stochastic approximation algorithms can update value parameters using samples over $n$ -step trajectories, with proven w.p.1 convergence for sufficiently large $n$ and appropriate step-size schedules. Both i.i.d. and Markovian sampling converge to the fixed point of the model-based equation under mild conditions (Lim et al., 13 Feb 2025).
Deep RL Extensions: Bootstrapped DQN variants assign each head in an ensemble a distinct $n_k$ backup horizon, achieving diversity in target updates and improved exploration and sample efficiency. Mixture-based approaches further generalize via weighted targets or TD( $\lambda$ ) (Chiang et al., 2020).

3. Bias–Variance Analysis and Return Mixtures

The $n$ -step return forms the basis of the trade-off between bias (from bootstrapping) and variance (from reward stochasticity):

Bias: For small $n$ , bootstrapped value estimates dominate, increasing bias.
Variance: For large $n$ , real rewards propagate more deeply but variance grows. In the limit $n\to\infty$ , the return becomes unbiased but variance-limited (Monte Carlo).
Compound and $\lambda$ -Returns: Weighting or averaging $n$ -step returns (e.g., TD( $\lambda$ )) can strictly reduce variance for matched contraction properties, as formally proved under both linear and nonlinear settings. State-adaptive mixtures such as Confidence-based Autodidactic Returns outperform hand-tuned exponential mixtures by dynamically switching weights in response to learned confidence scores (Sharma et al., 2017, Daley et al., 2024).

Table: Comparison of $n$ -step Targets and Mixtures

Variant	Target Formula	Bias–Variance Effect
$n$ -step	$\sum_{k=0}^{n-1}\gamma^k r_{t+k+1} + \gamma^n V(s_{t+n})$	Bias $\uparrow$ , Variance $\downarrow$ as $n$ decreases
TD( $\lambda$ )	$(1-\lambda)\sum_{n=1}^\infty \lambda^{n-1}G_t^{(n)}$	Smooth trade-off via $\lambda$
Compound	$\sum_{k=1}^N w_k G_t^{(k)}$	Lower variance for same bias

All bias–variance trade-offs are supported by the cited theoretical and empirical results (Daley et al., 2024, Sharma et al., 2017, Lim et al., 13 Feb 2025).

4. Off-Policy Corrections and Practical Stability

In off-policy multi-step learning, the divergence risk inherent in the “deadly triad” (function approximation + off-policy + bootstrapping) is mitigated by increasing $n$ :

Contraction and Hurwitz: For finite $n \ge n_1^*, n_2^*, n_3^*$ , theoretical thresholds guarantee contraction of Bellman operators and geometric convergence as all relevant matrices become Schur or Hurwitz (Lim et al., 13 Feb 2025).
Importance-Sampling Variants: Large $n$ increases instability due to possible variance explosion in importance weights, motivating numerically stable schemes (e.g., quantile clipping in SAC $n$ (Łyskawa et al., 15 Dec 2025)).
Bias in Multi-goal Relabeling: In hindsight experience replay, naive $n$ -step relabeling introduces a bias that grows linearly in $n$ and reward magnitude, requiring $\lambda$ -mixtures or model-based blending to control bias in practical multi-goal learning (Yang et al., 2021).

5. Empirical Performance and Adaptive Selection of $n$

Extensive experiments confirm:

Accelerated Learning: Larger $n$ -step returns propagate rewards faster in deep RL, with mixture or compound approaches (e.g., MB-DQN, PiLaR) outperforming pure $n$ -step baselines (Chiang et al., 2020, Daley et al., 2024).
Adaptive Step Length: The SDPSA algorithm adaptively finds an $n^*$ minimizing average RMSE for TD( $n$ ), with proven almost sure convergence to the optimal discrete $n$ , outperforming alternative bandit-style algorithms in controlled experiments (Mandal et al., 2023).
Transformer-based Critics and Ensemble Methods: In continuous control, chunked $n$ -step targets substantially improve stability and sample efficiency in sparse- and multi-phase tasks, with gradient-level averaging over multiple horizons reducing critic variance (Tian et al., 5 Mar 2025).

6. Error Bounds and Theoretical Guarantees

Quantitative bounds depend critically on $n$ :

Approximation Error: Under a contraction regime, the fixed-point mismatch $\|\Phi \theta_*^n - V^\pi\|_\infty$ decays as $\mathcal{O}(\gamma^n)$ .
Finite-sample Guarantees: TD with compound returns achieves strictly better finite-sample complexity at fixed contraction modulus, offering lower upper bounds on estimation error as a function of variance and sample number (Daley et al., 2024, Lim et al., 13 Feb 2025).

When $\Pi T^n$ is $\ell_\infty$ -contractive, explicit error bounds are available: $\|\Phi \theta_*^n - V^\pi\|_\infty \le [1 - \gamma^n \|\Pi\|_\infty]^{-1} \|\Pi V^\pi - V^\pi\|_\infty$ with convergence to the least-squares projection in the $n \to \infty$ limit (Lim et al., 13 Feb 2025).

7. Hyperparameter Choices, Practical Algorithms, and Recommendations

Algorithmic implementations require careful selection of horizon $n$ and mixture weights, as well as numerically stable handling of importance ratios and reward/entropy estimation in actor-critic variants:

Choice of $n$ : Logarithmic dependence of threshold $n^*$ on feature conditioning, discount factor, and stationary distribution, as well as task-specific risk of divergence for small $n$ (Lim et al., 13 Feb 2025, Mandal et al., 2023).
Mixture Weights and Gradient Averaging: State-adaptive mixtures (e.g. CAR) and compound two-bootstrap mixtures (PiLaR) reduce variance cost-effectively; mixture strategies outperform static single-target baselines in DQN and PPO (Daley et al., 2024, Sharma et al., 2017).
Stable Importance Sampling: Quantile-based clipping provides unbiased multi-step returns in off-policy continuous control, together with τ-sampled entropy estimators to balance variance (Łyskawa et al., 15 Dec 2025).

Table: Notable Empirical Results Across Domains

Algorithm/Class	Key Empirical Finding	Reference
MB-DQN	Faster learning, richer exploration	(Chiang et al., 2020)
CAR λ-Mixture	Up to 4.5× baseline score on Atari	(Sharma et al., 2017)
T-SAC Transformer Critic	2× sample efficiency in sparse RL	(Tian et al., 5 Mar 2025)
PiLaR Compound Return	Lower variance, higher final score	(Daley et al., 2024)
SDPSA	Optimal $n$ achievable for RMSE	(Mandal et al., 2023)

References to Key Papers

Analysis of Off-Policy $n$ -Step TD-Learning with Linear Function Approximation (Lim et al., 13 Feb 2025)
Mixture of Step Returns in Bootstrapped DQN (Chiang et al., 2020)
Learning to Mix n-Step Returns: Generalizing lambda-Returns for Deep RL (Sharma et al., 2017)
Averaging $n$ -step Returns Reduces Variance in Reinforcement Learning (Daley et al., 2024)
Chunking the Critic: Transformer-based SAC with N-Step Returns (Tian et al., 5 Mar 2025)
SACn: Soft Actor-Critic with n-step Returns (Łyskawa et al., 15 Dec 2025)
Bias-reduced Multi-step Hindsight Experience Replay (Yang et al., 2021)
n-Step TD Learning with Optimal n (Mandal et al., 2023)

The $n$ -step return and its generalizations remain central to the stability, efficiency, and adaptability of reinforcement learning algorithms. Their detailed mathematical properties, bias–variance interplay, and algorithmic implications across classical and deep RL domains establish them as indispensable methodology for both theoretical analysis and practical design.