ReaPER: Reliability-Adjusted Prioritized Experience Replay

Updated 7 January 2026

The paper introduces ReaPER, a method that combines temporal difference error with a novel reliability score to enhance transition prioritization.
It modifies the sampling distribution using weighted exponents to balance error magnitude with target reliability, reducing bias and variance.
Empirical evaluations show ReaPER accelerates convergence and improves sample efficiency across both low- and high-dimensional benchmarks.

Reliability-Adjusted @@@@1@@@@ (ReaPER) is a reinforcement learning transition selection algorithm that extends Prioritized Experience Replay (PER) by integrating a novel temporally-resolved measure of target reliability. By incorporating both the magnitude of the temporal difference (TD) error and an episodewise reliability score, ReaPER enables more efficient and robust sampling from experience replay buffers, provably accelerating convergence and reducing learning variance as compared to conventional PER and uniform sampling strategies (Pleiss et al., 23 Jun 2025).

1. Temporal Difference Error and Reliability Characterization

In standard off-policy Q-learning with experience replay, each transition $C_t = (S_t, A_t, R_t, S_{t+1}, d_t)$ produces a temporal difference error

$\delta_t = Q_{\text{target}}(S_t) - Q(S_t, A_t)$

where $Q_{\text{target}}(S_t) = R_{t+1} + (1 - d_{t+1}) \gamma \max_a Q(S_{t+1}, a)$ , and %%%%2%%%%.

ReaPER identifies that for transitions occurring earlier within an episode, their bootstrapped target values inherit bias due to dependency on a chain of future—potentially unresolved—transitions. In contrast, terminal transitions have exact targets unaffected by bootstrapping. For a complete episode of $n$ steps, this yields the following reliability score for the target at time $t$ : $R_t = 1 - \frac{\sum_{i = t+1}^{n} \delta_i^+}{\sum_{i = 1}^{n} \delta_i^+}$ The value $R_t \in [0, 1]$ expresses the fraction of absolute TD-errors in the episode resolved up to time $t$ . If many unresolved errors remain downstream, $R_t$ is small, reflecting low reliability of the current transition's target. As learning progresses and more downstream transitions are updated, $R_t$ increases accordingly.

2. Priority Formulation and Sampling Distribution

Standard proportional PER assigns transition priorities proportional to the absolute TD-error, $\Psi_t^{\text{PER}} = (\delta_t^+)^{\alpha}$ , with $\alpha \in (0,1]$ controlling prioritization sharpness. In ReaPER, transition priority is adjusted to account for both TD-error and reliability: $\Psi_t = (R_t)^{\omega} \cdot (\delta_t^+)^{\alpha}$ with exponents $\omega, \alpha \in (0,1]$ . The associated per-transition sampling probability is normalized as

$p_t = \frac{\Psi_t}{\sum_{i=1}^N \Psi_i}$

where $N$ is the buffer size. This formulation penalizes transitions that exhibit large but unreliable errors, emphasizing transitions with significant, trustworthy TD-updates. Tuning of exponents $\alpha$ and $\omega$ can mitigate adverse effects from asynchronous updates and priority outliers.

3. Algorithmic Structure and Implementation

The ReaPER algorithm can be embedded in off-policy methods such as DQN or DDQN. Its core elements include per-episode reliability tracking, dynamic priority adjustment, and importance sampling for gradient stability. The main algorithmic flow is:

Maintain a replay buffer of transitions, with each transition indexed by both buffer position and episode ID.
For each encountered transition, record the TD-error and assign a maximum priority.
Upon episode termination, calculate exact reliability scores for all transitions in the finished episode as per the definition above.
For non-terminal episodes, utilize a conservative reliability estimate based on the maximal sum of absolute TD-errors among all completed episodes.
Periodically sample minibatches according to the current $p_t$ , applying per-sample importance weights:

$w_j = \left[ \frac{1}{N \cdot p_j} \right]^\beta$

with $\beta$ annealed over training to control correction smoothness, and normalize $w_j$ to a maximum of 1.

Gradient updates for network parameters incorporate the importance weights and corrected TD-errors.

Per-epoch complexity is $O(N)$ for full updates, but more efficiently $O(n-t)$ when recalculating reliability scores only for transitions in affected episodes. Table 1 summarizes the priority computation difference between PER and ReaPER:

Method	Priority Formula	Key Criterion
PER	$(\delta_t^+)^{\alpha}$	TD-error magnitude
ReaPER	$(R_t)^{\omega} (\delta_t^+)^{\alpha}$	Error magnitude and reliability

Buffer management leverages episode membership tracking for fast, targeted computations of reliability.

4. Theoretical Guarantees: Convergence and Variance Reduction

Convergence Speed Hierarchy

The theoretical analysis distinguishes the impact of target bias on TD-updates. For an error $e_t = Q(S_t, A_t) - Q^*(S_t, A_t)$ and target bias $\epsilon_t = Q_{\text{target}}(S_t) - Q^*(S_t, A_t)$ , one has (up to constant scaling): $\langle g_t, g_t^* \rangle = 2 e_t^2 - 2 e_t \epsilon_t$ indicating that misaligned targets (large $\epsilon_t$ ) can reverse or hinder correct value updates.

Under a sampling distribution $\mu$ and learning rate $\eta$ , the expected squared error decrement per update is: $\mathbb{E}_{\mu}[\Delta\|Q - Q^*\|^2] = \eta^2 \sum_t \mu_t \mathbb{E}[\delta_t^2] - 2\eta \sum_t \mu_t \mathbb{E}[e_t^2] + 2\eta \sum_t \mu_t \mathbb{E}[e_t \epsilon_t]$ The last term is a bias–error interaction, which ReaPER aims to control via the reliability adjustment.

Assuming target bias can be bounded as $|\epsilon_t| \le \lambda \sum_{i>t} \delta_i^+$ , it follows that

$|\epsilon_t| \leq \lambda (1-R_t) \sum_i \delta_i^+$

Thus, adequately weighting transitions by $R_t$ restricts detrimental bias contributions.

The expected error after $T$ updates satisfies: $\mathbb{E}[\| Q_T^{\text{Uniform}} - Q^* \|^2] \geq \mathbb{E}[\| Q_T^{\text{PER}} - Q^* \|^2] \geq \mathbb{E}[\| Q_T^{\text{ReaPER}} - Q^* \|^2]$ reflecting the hierarchy: uniform sampling is suboptimal, PER is better, and ReaPER achieves improved convergence by bias-variance trade-off management.

Variance Reduction

For minimizing update variance under the constraint $\sum_t \mu_t \delta_t^+ \geq \tau$ , the optimal distribution is: $\mu_t^* \propto \frac{ \delta_t^+ }{ \sigma_t^2 }$ where $\sigma_t^2 = \operatorname{Var}(Q_{\text{target}}(S_t))$ . Under the target bias assumption, $R_t \approx 1/\sigma_t^2$ , so ReaPER’s sampling distribution $\mu_t \propto R_t \delta_t^+$ matches the variance-optimal weights, directly reducing update variance compared to conventional PER.

5. Empirical Results: Low- and High-Dimensional Benchmarks

The empirical evaluation covers both low-dimensional classical control tasks (CartPole-v1, Acrobot-v1, LunarLander-v2) and the high-dimensional Atari-10 benchmark (ten games representing 80% of the full Atari-57 score variance).

Notable findings include:

Acrobot-v1: ReaPER reaches convergence in $14,\!550 \pm 3,\!528$ steps compared to PER’s $18,\!500 \pm 2,\!356$ (−21.35%).
CartPole-v1: ReaPER achieves the performance threshold in $15,\!850 \pm 7,\!601$ steps, outperforming PER’s $20,\!650 \pm 6,\!048$ (−23.24%).
LunarLander-v2: ReaPER achieves a score of 200 in 95% of runs (PER: 80%), with average steps $38,\!500 \pm 24,\!270$ versus PER’s $54,\!600 \pm 30,\!776$ (−29.49%).
Atari-10: ReaPER outperforms PER in 8/10 games and matches in 2, with an average peak-score increase of $+24.37\%$ ( $SD=23.76\%$ ). Normalized cumulative reward curves demonstrate both earlier and higher performance for ReaPER.

Consistent improvements favor ReaPER across all tasks, with minimal overlap in standard deviations and robust gains in sample efficiency and peak performance (Pleiss et al., 23 Jun 2025).

6. Practical Implementation Considerations

Key implementation details involve:

The episode-ID vector $\phi_i$ for efficient tracking of episode boundaries.
The statistic $F$ (maximum episode sum of $\delta^+$ over all terminated episodes) to provide a conservative bound on reliability scores for unfinished episodes.
Online update strategies to avoid $O(N)$ full-buffer recomputation; the practical complexity per update is controlled by recalculating only within transitioned episodes.
Weighted importance sampling to correct for the non-uniform transition distribution, as in PER.

Hyperparameters for ReaPER ( $\alpha, \omega, \beta$ ) are tuned on a per-domain basis, with typical values $\alpha=0.4$ , $\omega=0.2$ , and $\beta$ annealed from 0.4 to 1.0.

7. Significance and Outlook

ReaPER introduces an explicit mechanism to penalize sampling of transitions whose bootstrapped targets are contaminated by unresolved downstream TD-errors. By sampling in proportion to $R_t \delta_t^+$ , ReaPER provides enhanced control over detrimental bias–error interactions and aligns the transition selection distribution close to the variance-optimal regime. Empirical and theoretical results jointly establish its superiority over vanilla PER in both convergence rate and variance reduction across a diversity of low- and high-dimensional benchmarks (Pleiss et al., 23 Jun 2025).

A plausible implication is that reliability-modulated sampling can generalize to other off-policy reinforcement learning algorithms and more complex domains, though further empirical validation is required. This framework establishes a foundation for future developments in experience prioritization and bias-aware transition selection in reinforcement learning.

PDF Markdown Chat (Pro)

References (1)

Reliability-Adjusted Prioritized Experience Replay (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Reliability-Adjusted Prioritized Experience Replay (ReaPER).