Reliability-adjusted PER (ReaPER)

Updated 3 December 2025

ReaPER is a reinforcement learning method that integrates a temporal reliability measure into Prioritized Experience Replay to better weight TD-error magnitudes.
It combines episodic reliability scores with TD-error magnitudes using adjustable exponents, leading to more informative gradient updates and reduced bias.
Empirical evaluations on control tasks and Atari benchmarks demonstrate that ReaPER accelerates convergence and improves performance with modest computational overhead.

Reliability-adjusted Prioritized Experience Replay (ReaPER) is a transition sampling scheme for online reinforcement learning that extends Prioritized Experience Replay (PER) by integrating a novel temporally informed reliability measure for temporal-difference (TD) errors. ReaPER leverages theoretical and empirical observations that TD targets near an episode’s termination have increased reliability, and it combines standard TD-error magnitude with an episodic reliability score to sample transitions more efficiently. Empirical studies with Double DQN agents on both classic control environments and the Atari-10 benchmark demonstrate that ReaPER yields accelerated learning and improved score metrics compared to PER, with modest computational overhead relative to baseline approaches (Pleiss et al., 23 Jun 2025).

1. Temporal-Difference Error Reliability

ReaPER builds upon the recognition that not all TD-errors equally reflect the true quality of sampled transitions. For an episode of length $n$ , each transition $C_t = (S_t, A_t, R_t, S_{t+1}, d_t)$ receives a TD-error defined as

$\delta_t = Q_{\rm target}(S_t) - Q(S_t, A_t), \qquad Q_{\rm target}(S_t) = R_{t+1} + (1-d_{t+1})\,\gamma\max_a Q(S_{t+1},a)$

where $\delta_t^+ = |\delta_t|$ is the TD-error magnitude. The reliability measure $\mathcal{R}_t \in [0,1]$ quantifies the fraction of unresolved future error for a transition. For terminated episodes: $\mathcal{R}_t = 1 - \frac{\sum_{i=t+1}^n \delta_i^+}{\sum_{i=1}^n \delta_i^+} \tag{4}$ For ongoing episodes, $\mathcal{R}_t$ is conservatively estimated using $F = \max_{e \in \text{buffer}} \sum_{i \in e} \delta_i^+$ : $\mathcal{R}_t = 1 - \frac{F - \sum_{i=1}^t \delta_i^+}{F} \tag{14}$ This formulation encodes the temporal dependency inherent in TD-error propagation and systematically privileges transitions whose downstream TD-errors have resolved.

2. Reliability-Adjusted Priority Formulation

Standard PER samples transitions in proportion to their TD-error magnitude raised to an exponent, $(\delta_t^+)^\alpha$ . ReaPER adjusts this by integrating reliability: $\Psi_t = \mathcal{R}_t^\omega \, (\delta_t^+)^\alpha, \qquad p_t = \frac{\Psi_t}{\sum_i \Psi_i} \tag{18}$ where $\alpha, \omega \in (0,1]$ stabilize the priority weights in practice. This approach selectively down-weights transitions with less reliable target estimates, thereby reducing misleading gradient updates from unresolved future TD errors.

3. Transition Sampling and Algorithmic Procedure

ReaPER’s sampling scheme augments PER by maintaining, updating, and utilizing both TD-error magnitude and episodic reliability scores:

Each transition in the buffer retains $(\delta_t^+, \mathcal{R}_t, p_t)$ and an episode-ID vector $\phi$ .
New transitions are inserted with default settings, and ongoing episodes receive conservative reliability updates.
For each training step, priorities are recomputed using the latest reliability estimation; minibatches are sampled according to $p_t$ .
Importance-sampling weights are adjusted as $w_j = (N\,p_j)^{-\beta} / \max_i (N\,p_i)^{-\beta}$ , with $\beta$ annealed.
On episode completion, precise reliability scores are retroactively assigned using the terminated formula.

Fundamental modifications from PER include the maintenance of episodic sums for $\mathcal{R}_t$ , the usage of reliability-adjusted priority $\Psi_t$ , regularization exponents, and handling for ongoing episodes based on buffer maxima.

4. Theoretical Properties and Guarantees

ReaPER’s design is supported by a sequence of theoretical results:

Assumption 3.1 relates target bias $\varepsilon_t$ to cumulative downstream TD errors: $|\varepsilon_t| \leq \lambda \sum_{i=t+1}^n \delta_i^+$ .
Lemma 3.2 bounds target bias by reliability: $|\varepsilon_t| \leq \lambda (1-\mathcal{R}_t) \sum_{i=1}^n \delta_i^+$ .
Lemma 3.3 decomposes the expected update in squared error: contributions include variance, error reduction, and bias interaction.
Proposition 3.4 (Convergence Hierarchy) asserts $\mathbb{E}\|Q_T^{\rm Uniform}-Q^\star\|^2 \geq \mathbb{E}\|Q_T^{\rm PER}-Q^\star\|^2 \geq \mathbb{E}\|Q_T^{\rm ReaPER}-Q^\star\|^2$ , formalizing that ReaPER accelerates convergence over PER by suppressing bias-error interaction.
Proposition 3.5 (Variance Reduction) identifies the optimal sampling distribution $\mu_t^\star \propto \delta_t^+ / \sigma_t^2$ , with ReaPER’s weighting $\mu_t \propto \mathcal{R}_t\,\delta_t^+$ approximating this criterion under $\sigma_t^2 \propto 1/\mathcal{R}_t$ .

This framework provides the formal basis for ReaPER’s preference for more reliable transitions as identified by episodic TD-error dynamics.

5. Empirical Performance and Benchmarks

Comprehensive evaluation was conducted on both low-complexity (CartPole, Acrobot, LunarLander) and high-complexity (Atari-10) environments using Double DQN agents. Experiment settings included large replay buffers ( $10^6$ ), batch sizes (32 for Atari; 64–128 for small environments), and identical learning rates and architectures for PER and ReaPER. The following key results were recorded:

Environment	PER Steps (Mean, SD)	ReaPER Steps (Mean, SD)	Speedup (%)
Acrobot	18,500 (2,356)	14,550 (3,528)	21.35
CartPole	20,650 (6,048)	15,850 (7,601)	23.24
LunarLander	54,600 (30,776)	38,500 (24,270)	29.49

In LunarLander, ReaPER achieved threshold attainment in 95% of runs versus 80% for PER. On Atari-10, peak score increases averaged 24.37% improvement and median normalized learning curves showed ReaPER outperforming PER consistently throughout training.

A plausible implication is that reliability-adjusted sampling more effectively focuses updates on informative experiences, reducing the deleterious impact of unresolved future errors and resulting in faster and more stable learning across diverse tasks.

6. Computational Complexity Considerations

Naïve per-transition update of reliability scores $\{\mathcal{R}_t\}$ incurs $O(N)$ time complexity whenever a TD-error changes, where $N$ is buffer size. However, efficient update is achieved by restricting prefix/suffix sum recomputation to affected episodes, reducing the cost to $O(n-t)$ for episode length $n$ . In practical settings where episode lengths are small relative to buffer capacity, this adds only modest overhead compared to PER’s segment tree updates.

7. Relevance, Limitations, and Outlook

ReaPER retains simplicity and low computational overhead comparable to PER while providing both theoretical and empirical advantages in bias reduction, convergence speed, and variance minimization. Its episodic-reliability perspective introduces finer control over transition prioritization, especially in environments characterized by temporally compounding prediction errors. Both bias-error decomposition and convergence hierarchy support the broad applicability of ReaPER, and benchmarks validate efficiency gains. Limitations include the incremental cost of reliability computation in long episodes and reliance on episode segmentation for reliability estimation. Potential areas for further investigation include adaptation for continual learning settings and integration with alternative buffer management schemes.

For full technical development and all experimental methodology, see "Reliability-Adjusted Prioritized Experience Replay" by Pleiss et al. (Pleiss et al., 23 Jun 2025).

PDF Markdown Chat (Pro)

References (1)

Reliability-Adjusted Prioritized Experience Replay (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Reliability-adjusted PER (ReaPER).