ReaPER: Reliability-Adjusted Prioritized Experience Replay
- The paper introduces ReaPER, a method that combines temporal difference error with a novel reliability score to enhance transition prioritization.
- It modifies the sampling distribution using weighted exponents to balance error magnitude with target reliability, reducing bias and variance.
- Empirical evaluations show ReaPER accelerates convergence and improves sample efficiency across both low- and high-dimensional benchmarks.
Reliability-Adjusted @@@@1@@@@ (ReaPER) is a reinforcement learning transition selection algorithm that extends Prioritized Experience Replay (PER) by integrating a novel temporally-resolved measure of target reliability. By incorporating both the magnitude of the temporal difference (TD) error and an episodewise reliability score, ReaPER enables more efficient and robust sampling from experience replay buffers, provably accelerating convergence and reducing learning variance as compared to conventional PER and uniform sampling strategies (Pleiss et al., 23 Jun 2025).
1. Temporal Difference Error and Reliability Characterization
In standard off-policy Q-learning with experience replay, each transition produces a temporal difference error
where , and %%%%2%%%%.
ReaPER identifies that for transitions occurring earlier within an episode, their bootstrapped target values inherit bias due to dependency on a chain of future—potentially unresolved—transitions. In contrast, terminal transitions have exact targets unaffected by bootstrapping. For a complete episode of steps, this yields the following reliability score for the target at time : The value expresses the fraction of absolute TD-errors in the episode resolved up to time . If many unresolved errors remain downstream, is small, reflecting low reliability of the current transition's target. As learning progresses and more downstream transitions are updated, increases accordingly.
2. Priority Formulation and Sampling Distribution
Standard proportional PER assigns transition priorities proportional to the absolute TD-error, , with controlling prioritization sharpness. In ReaPER, transition priority is adjusted to account for both TD-error and reliability: with exponents . The associated per-transition sampling probability is normalized as
where is the buffer size. This formulation penalizes transitions that exhibit large but unreliable errors, emphasizing transitions with significant, trustworthy TD-updates. Tuning of exponents and can mitigate adverse effects from asynchronous updates and priority outliers.
3. Algorithmic Structure and Implementation
The ReaPER algorithm can be embedded in off-policy methods such as DQN or DDQN. Its core elements include per-episode reliability tracking, dynamic priority adjustment, and importance sampling for gradient stability. The main algorithmic flow is:
- Maintain a replay buffer of transitions, with each transition indexed by both buffer position and episode ID.
- For each encountered transition, record the TD-error and assign a maximum priority.
- Upon episode termination, calculate exact reliability scores for all transitions in the finished episode as per the definition above.
- For non-terminal episodes, utilize a conservative reliability estimate based on the maximal sum of absolute TD-errors among all completed episodes.
- Periodically sample minibatches according to the current , applying per-sample importance weights:
with annealed over training to control correction smoothness, and normalize to a maximum of 1.
- Gradient updates for network parameters incorporate the importance weights and corrected TD-errors.
Per-epoch complexity is for full updates, but more efficiently when recalculating reliability scores only for transitions in affected episodes. Table 1 summarizes the priority computation difference between PER and ReaPER:
| Method | Priority Formula | Key Criterion |
|---|---|---|
| PER | TD-error magnitude | |
| ReaPER | Error magnitude and reliability |
Buffer management leverages episode membership tracking for fast, targeted computations of reliability.
4. Theoretical Guarantees: Convergence and Variance Reduction
Convergence Speed Hierarchy
The theoretical analysis distinguishes the impact of target bias on TD-updates. For an error and target bias , one has (up to constant scaling): indicating that misaligned targets (large ) can reverse or hinder correct value updates.
Under a sampling distribution and learning rate , the expected squared error decrement per update is: The last term is a bias–error interaction, which ReaPER aims to control via the reliability adjustment.
Assuming target bias can be bounded as , it follows that
Thus, adequately weighting transitions by restricts detrimental bias contributions.
The expected error after updates satisfies: reflecting the hierarchy: uniform sampling is suboptimal, PER is better, and ReaPER achieves improved convergence by bias-variance trade-off management.
Variance Reduction
For minimizing update variance under the constraint , the optimal distribution is: where . Under the target bias assumption, , so ReaPER’s sampling distribution matches the variance-optimal weights, directly reducing update variance compared to conventional PER.
5. Empirical Results: Low- and High-Dimensional Benchmarks
The empirical evaluation covers both low-dimensional classical control tasks (CartPole-v1, Acrobot-v1, LunarLander-v2) and the high-dimensional Atari-10 benchmark (ten games representing 80% of the full Atari-57 score variance).
Notable findings include:
- Acrobot-v1: ReaPER reaches convergence in steps compared to PER’s (−21.35%).
- CartPole-v1: ReaPER achieves the performance threshold in steps, outperforming PER’s (−23.24%).
- LunarLander-v2: ReaPER achieves a score of 200 in 95% of runs (PER: 80%), with average steps versus PER’s (−29.49%).
- Atari-10: ReaPER outperforms PER in 8/10 games and matches in 2, with an average peak-score increase of (). Normalized cumulative reward curves demonstrate both earlier and higher performance for ReaPER.
Consistent improvements favor ReaPER across all tasks, with minimal overlap in standard deviations and robust gains in sample efficiency and peak performance (Pleiss et al., 23 Jun 2025).
6. Practical Implementation Considerations
Key implementation details involve:
- The episode-ID vector for efficient tracking of episode boundaries.
- The statistic (maximum episode sum of over all terminated episodes) to provide a conservative bound on reliability scores for unfinished episodes.
- Online update strategies to avoid full-buffer recomputation; the practical complexity per update is controlled by recalculating only within transitioned episodes.
- Weighted importance sampling to correct for the non-uniform transition distribution, as in PER.
Hyperparameters for ReaPER () are tuned on a per-domain basis, with typical values , , and annealed from 0.4 to 1.0.
7. Significance and Outlook
ReaPER introduces an explicit mechanism to penalize sampling of transitions whose bootstrapped targets are contaminated by unresolved downstream TD-errors. By sampling in proportion to , ReaPER provides enhanced control over detrimental bias–error interactions and aligns the transition selection distribution close to the variance-optimal regime. Empirical and theoretical results jointly establish its superiority over vanilla PER in both convergence rate and variance reduction across a diversity of low- and high-dimensional benchmarks (Pleiss et al., 23 Jun 2025).
A plausible implication is that reliability-modulated sampling can generalize to other off-policy reinforcement learning algorithms and more complex domains, though further empirical validation is required. This framework establishes a foundation for future developments in experience prioritization and bias-aware transition selection in reinforcement learning.