Papers
Topics
Authors
Recent
2000 character limit reached

ReaPER: Reliability-Adjusted Prioritized Experience Replay

Updated 7 January 2026
  • The paper introduces ReaPER, a method that combines temporal difference error with a novel reliability score to enhance transition prioritization.
  • It modifies the sampling distribution using weighted exponents to balance error magnitude with target reliability, reducing bias and variance.
  • Empirical evaluations show ReaPER accelerates convergence and improves sample efficiency across both low- and high-dimensional benchmarks.

Reliability-Adjusted @@@@1@@@@ (ReaPER) is a reinforcement learning transition selection algorithm that extends Prioritized Experience Replay (PER) by integrating a novel temporally-resolved measure of target reliability. By incorporating both the magnitude of the temporal difference (TD) error and an episodewise reliability score, ReaPER enables more efficient and robust sampling from experience replay buffers, provably accelerating convergence and reducing learning variance as compared to conventional PER and uniform sampling strategies (Pleiss et al., 23 Jun 2025).

1. Temporal Difference Error and Reliability Characterization

In standard off-policy Q-learning with experience replay, each transition Ct=(St,At,Rt,St+1,dt)C_t = (S_t, A_t, R_t, S_{t+1}, d_t) produces a temporal difference error

δt=Qtarget(St)Q(St,At)\delta_t = Q_{\text{target}}(S_t) - Q(S_t, A_t)

where Qtarget(St)=Rt+1+(1dt+1)γmaxaQ(St+1,a)Q_{\text{target}}(S_t) = R_{t+1} + (1 - d_{t+1}) \gamma \max_a Q(S_{t+1}, a), and %%%%2%%%%.

ReaPER identifies that for transitions occurring earlier within an episode, their bootstrapped target values inherit bias due to dependency on a chain of future—potentially unresolved—transitions. In contrast, terminal transitions have exact targets unaffected by bootstrapping. For a complete episode of nn steps, this yields the following reliability score for the target at time tt: Rt=1i=t+1nδi+i=1nδi+R_t = 1 - \frac{\sum_{i = t+1}^{n} \delta_i^+}{\sum_{i = 1}^{n} \delta_i^+} The value Rt[0,1]R_t \in [0, 1] expresses the fraction of absolute TD-errors in the episode resolved up to time tt. If many unresolved errors remain downstream, RtR_t is small, reflecting low reliability of the current transition's target. As learning progresses and more downstream transitions are updated, RtR_t increases accordingly.

2. Priority Formulation and Sampling Distribution

Standard proportional PER assigns transition priorities proportional to the absolute TD-error, ΨtPER=(δt+)α\Psi_t^{\text{PER}} = (\delta_t^+)^{\alpha}, with α(0,1]\alpha \in (0,1] controlling prioritization sharpness. In ReaPER, transition priority is adjusted to account for both TD-error and reliability: Ψt=(Rt)ω(δt+)α\Psi_t = (R_t)^{\omega} \cdot (\delta_t^+)^{\alpha} with exponents ω,α(0,1]\omega, \alpha \in (0,1]. The associated per-transition sampling probability is normalized as

pt=Ψti=1NΨip_t = \frac{\Psi_t}{\sum_{i=1}^N \Psi_i}

where NN is the buffer size. This formulation penalizes transitions that exhibit large but unreliable errors, emphasizing transitions with significant, trustworthy TD-updates. Tuning of exponents α\alpha and ω\omega can mitigate adverse effects from asynchronous updates and priority outliers.

3. Algorithmic Structure and Implementation

The ReaPER algorithm can be embedded in off-policy methods such as DQN or DDQN. Its core elements include per-episode reliability tracking, dynamic priority adjustment, and importance sampling for gradient stability. The main algorithmic flow is:

  • Maintain a replay buffer of transitions, with each transition indexed by both buffer position and episode ID.
  • For each encountered transition, record the TD-error and assign a maximum priority.
  • Upon episode termination, calculate exact reliability scores for all transitions in the finished episode as per the definition above.
  • For non-terminal episodes, utilize a conservative reliability estimate based on the maximal sum of absolute TD-errors among all completed episodes.
  • Periodically sample minibatches according to the current ptp_t, applying per-sample importance weights:

wj=[1Npj]βw_j = \left[ \frac{1}{N \cdot p_j} \right]^\beta

with β\beta annealed over training to control correction smoothness, and normalize wjw_j to a maximum of 1.

  • Gradient updates for network parameters incorporate the importance weights and corrected TD-errors.

Per-epoch complexity is O(N)O(N) for full updates, but more efficiently O(nt)O(n-t) when recalculating reliability scores only for transitions in affected episodes. Table 1 summarizes the priority computation difference between PER and ReaPER:

Method Priority Formula Key Criterion
PER (δt+)α(\delta_t^+)^{\alpha} TD-error magnitude
ReaPER (Rt)ω(δt+)α(R_t)^{\omega} (\delta_t^+)^{\alpha} Error magnitude and reliability

Buffer management leverages episode membership tracking for fast, targeted computations of reliability.

4. Theoretical Guarantees: Convergence and Variance Reduction

Convergence Speed Hierarchy

The theoretical analysis distinguishes the impact of target bias on TD-updates. For an error et=Q(St,At)Q(St,At)e_t = Q(S_t, A_t) - Q^*(S_t, A_t) and target bias ϵt=Qtarget(St)Q(St,At)\epsilon_t = Q_{\text{target}}(S_t) - Q^*(S_t, A_t), one has (up to constant scaling): gt,gt=2et22etϵt\langle g_t, g_t^* \rangle = 2 e_t^2 - 2 e_t \epsilon_t indicating that misaligned targets (large ϵt\epsilon_t) can reverse or hinder correct value updates.

Under a sampling distribution μ\mu and learning rate η\eta, the expected squared error decrement per update is: Eμ[ΔQQ2]=η2tμtE[δt2]2ηtμtE[et2]+2ηtμtE[etϵt]\mathbb{E}_{\mu}[\Delta\|Q - Q^*\|^2] = \eta^2 \sum_t \mu_t \mathbb{E}[\delta_t^2] - 2\eta \sum_t \mu_t \mathbb{E}[e_t^2] + 2\eta \sum_t \mu_t \mathbb{E}[e_t \epsilon_t] The last term is a bias–error interaction, which ReaPER aims to control via the reliability adjustment.

Assuming target bias can be bounded as ϵtλi>tδi+|\epsilon_t| \le \lambda \sum_{i>t} \delta_i^+, it follows that

ϵtλ(1Rt)iδi+|\epsilon_t| \leq \lambda (1-R_t) \sum_i \delta_i^+

Thus, adequately weighting transitions by RtR_t restricts detrimental bias contributions.

The expected error after TT updates satisfies: E[QTUniformQ2]E[QTPERQ2]E[QTReaPERQ2]\mathbb{E}[\| Q_T^{\text{Uniform}} - Q^* \|^2] \geq \mathbb{E}[\| Q_T^{\text{PER}} - Q^* \|^2] \geq \mathbb{E}[\| Q_T^{\text{ReaPER}} - Q^* \|^2] reflecting the hierarchy: uniform sampling is suboptimal, PER is better, and ReaPER achieves improved convergence by bias-variance trade-off management.

Variance Reduction

For minimizing update variance under the constraint tμtδt+τ\sum_t \mu_t \delta_t^+ \geq \tau, the optimal distribution is: μtδt+σt2\mu_t^* \propto \frac{ \delta_t^+ }{ \sigma_t^2 } where σt2=Var(Qtarget(St))\sigma_t^2 = \operatorname{Var}(Q_{\text{target}}(S_t)). Under the target bias assumption, Rt1/σt2R_t \approx 1/\sigma_t^2, so ReaPER’s sampling distribution μtRtδt+\mu_t \propto R_t \delta_t^+ matches the variance-optimal weights, directly reducing update variance compared to conventional PER.

5. Empirical Results: Low- and High-Dimensional Benchmarks

The empirical evaluation covers both low-dimensional classical control tasks (CartPole-v1, Acrobot-v1, LunarLander-v2) and the high-dimensional Atari-10 benchmark (ten games representing 80% of the full Atari-57 score variance).

Notable findings include:

  • Acrobot-v1: ReaPER reaches convergence in 14, ⁣550±3, ⁣52814,\!550 \pm 3,\!528 steps compared to PER’s 18, ⁣500±2, ⁣35618,\!500 \pm 2,\!356 (−21.35%).
  • CartPole-v1: ReaPER achieves the performance threshold in 15, ⁣850±7, ⁣60115,\!850 \pm 7,\!601 steps, outperforming PER’s 20, ⁣650±6, ⁣04820,\!650 \pm 6,\!048 (−23.24%).
  • LunarLander-v2: ReaPER achieves a score of 200 in 95% of runs (PER: 80%), with average steps 38, ⁣500±24, ⁣27038,\!500 \pm 24,\!270 versus PER’s 54, ⁣600±30, ⁣77654,\!600 \pm 30,\!776 (−29.49%).
  • Atari-10: ReaPER outperforms PER in 8/10 games and matches in 2, with an average peak-score increase of +24.37%+24.37\% (SD=23.76%SD=23.76\%). Normalized cumulative reward curves demonstrate both earlier and higher performance for ReaPER.

Consistent improvements favor ReaPER across all tasks, with minimal overlap in standard deviations and robust gains in sample efficiency and peak performance (Pleiss et al., 23 Jun 2025).

6. Practical Implementation Considerations

Key implementation details involve:

  • The episode-ID vector ϕi\phi_i for efficient tracking of episode boundaries.
  • The statistic FF (maximum episode sum of δ+\delta^+ over all terminated episodes) to provide a conservative bound on reliability scores for unfinished episodes.
  • Online update strategies to avoid O(N)O(N) full-buffer recomputation; the practical complexity per update is controlled by recalculating only within transitioned episodes.
  • Weighted importance sampling to correct for the non-uniform transition distribution, as in PER.

Hyperparameters for ReaPER (α,ω,β\alpha, \omega, \beta) are tuned on a per-domain basis, with typical values α=0.4\alpha=0.4, ω=0.2\omega=0.2, and β\beta annealed from 0.4 to 1.0.

7. Significance and Outlook

ReaPER introduces an explicit mechanism to penalize sampling of transitions whose bootstrapped targets are contaminated by unresolved downstream TD-errors. By sampling in proportion to Rtδt+R_t \delta_t^+, ReaPER provides enhanced control over detrimental bias–error interactions and aligns the transition selection distribution close to the variance-optimal regime. Empirical and theoretical results jointly establish its superiority over vanilla PER in both convergence rate and variance reduction across a diversity of low- and high-dimensional benchmarks (Pleiss et al., 23 Jun 2025).

A plausible implication is that reliability-modulated sampling can generalize to other off-policy reinforcement learning algorithms and more complex domains, though further empirical validation is required. This framework establishes a foundation for future developments in experience prioritization and bias-aware transition selection in reinforcement learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Reliability-Adjusted Prioritized Experience Replay (ReaPER).