Papers
Topics
Authors
Recent
2000 character limit reached

Prioritized Experience Replay (PER)

Updated 3 December 2025
  • Prioritized Experience Replay is an approach that selectively samples transitions using TD error and Bayesian uncertainty to improve deep reinforcement learning efficiency.
  • It leverages transition-level statistics to focus on experiences with high epistemic value, ensuring updates yield maximal parameter improvements.
  • In continual learning, UPER integrates uncertainty-based sampling with regularization strategies to mitigate catastrophic forgetting and maintain class balance.

Prioritized Experience Replay (PER) is an algorithmic framework for improving sample efficiency in value-based deep reinforcement learning (RL) via selective sampling of experience transitions, and has more recently been integrated into continual learning for long-tailed data streams. PER and its variants systematically leverage transition-level statistics—such as temporal difference (TD) error or Bayesian uncertainty—to focus replay on transitions that are expected to yield maximal parameter improvement.

1. Formal Principles of Prioritized Experience Replay

The canonical instantiation of PER prioritizes transitions in the replay buffer based on an externally computed saliency signal. In RL, this is typically the absolute TD error δ|\delta| for each transition (s,a,r,s)(s, a, r, s') relative to its Bellman target Θ(s,r)=r+γmaxaQ(s,a)\Theta(s', r) = r + \gamma \max_{a'} Q(s', a'). PER assigns each buffer entry ii a priority pi=δi+ϵp_i = |\delta_i| + \epsilon, where ϵ>0\epsilon > 0 ensures all transitions are sampled with nonzero probability. Sampling is performed with probability P(i)=piα/kpkαP(i) = p_i^\alpha / \sum_k p_k^\alpha for user-set exponent α[0,1]\alpha \in [0,1], and bias is corrected with importance-sampling weights wi(NP(i))βw_i \propto (N P(i))^{-\beta}. This mechanism enables focused parameter updates while still maintaining coverage of the full state space.

Recent advances, notably Uncertainty-Prioritized Experience Replay (UPER) (Carrasco-Davis et al., 10 Jun 2025), have identified conceptual and statistical limitations of TD-error-based prioritization. TD error confounds reducible learning uncertainty with irreducible environment noise (aleatoric uncertainty), which can induce over-sampling of transitions generated by unpredictable random processes—a phenomenon which traps agents in "noisy TV" states and impedes effective exploration.

2. Bayesian Uncertainty-Based Prioritization

A principled alternative to TD-error prioritization is the use of epistemic uncertainty. Epistemic uncertainty quantifies the portion of predictive risk that is reducible by further training, distinguishing between random, stochastic transitions and those lacking sufficient knowledge. Formally, under an ensemble-distributional RL architecture (QR-DQN ensemble), total predictive uncertainty for an experience (s,a)(s, a) is U^δ(s,a)=Eψ,τ[(Θ(s,r)θτ(s,a;ψ))2]\hat{U}_\delta(s, a) = \mathbb{E}_{\psi, \tau}[ (\Theta(s', r) - \theta_\tau(s, a; \psi))^2 ]. Aleatoric uncertainty is the irreducible variance A^(s,a)\hat{A}(s, a), and epistemic is E^δ(s,a)=U^δ(s,a)A^(s,a)\hat{E}_\delta(s, a) = \hat{U}_\delta(s, a) - \hat{A}(s, a). UPER uses an information-theoretic gain formula: pi=12log(1+E^δ(si,ai)/A^(si,ai)),p_i = \frac{1}{2} \log(1 + \hat{E}_\delta(s_i, a_i) / \hat{A}(s_i, a_i)), to prioritize samples. This quantifies expected posterior mean entropy reduction, analogously to Bayesian experimental design. Samples with high epistemic and low aleatoric uncertainty are most likely to improve predictive estimates.

3. Algorithmic Workflows and Implementation

The practical implementation of both TD-error PER and UPER involves a shared replay buffer of fixed size CC, transitions stored as tuples (s,a,r,s,m)(s, a, r, s', m), and a prioritization/sampling loop as follows:

  • For every environment step, transitions are added to the buffer with initial priority.
  • During training updates, batches are sampled from the buffer based on priorities.
  • For each sampled transition ii, importance weights wiw_i are computed and applied to the update loss.
  • In UPER, uncertainty estimates (Uδ,A,Eδ)(U_\delta, A, E_\delta) are computed using ensembles and quantile outputs per head; priorities are updated after each update using the information gain formula.

In continual learning and long-tailed incremental settings, UPER for CL augments this with a Bayesian MI-based sampling over tasks (Liu et al., 27 Aug 2024). For each new data batch in continual learning, Monte Carlo dropout (as a variational Bayesian estimator) is used to estimate the mutual information I[Y,θx,D]I[Y, \theta \mid x, D] for each candidate sample. Reservoir sampling is tuned to favor samples with highest MI, ensuring that both minority-class and decision-boundary-supporting examples are retained in the buffer without requiring prior knowledge of class distributions.

4. Regularization Strategies: Boundary and Prototype Constraints

In the context of long-tailed continual learning, uncertainty-prioritized replay is complemented by prior-free regularization mechanisms designed to mitigate catastrophic forgetting. The boundary constraint consists of a knowledge-distillation term

LKL=i=1Cqi(x)logqi(x)qi(x),\mathcal{L}_{\text{KL}} = \sum_{i=1}^C q^*_i(x) \log \frac{q^*_i(x)}{q_i(x)},

applied to replayed buffer samples where qiq^*_i and qiq_i are the softmax outputs of the previous and current model heads, respectively. The prototype constraint operates by enforcing consistency among cosine-normalized classifier weights: Lproto=i=1c1w^iw^i22,\mathcal{L}_{\text{proto}} = \sum_{i=1}^{c-1} \|\hat{w}_i - \hat{w}_i^*\|_2^2, where w^i\hat{w}_i and w^i\hat{w}_i^* denote normalized current and previous class centroids (prototypes). These two constraints ensure preservation of decision boundaries and balanced class prototypes during sequential learning.

5. Empirical Evaluation and Benchmarks

Comprehensive empirical assessments have been performed for both RL and continual learning scenarios. In RL, UPER demonstrates superior sample efficiency over standard TD-error PER and uniform ER on diverse settings including:

  • Multi-arm bandit tasks: UPER matches oracle bias prioritization and avoids over-sampling noisy arms.
  • Noisy tabular gridworlds: UPER concentrates sampling on informative paths, resulting in accelerated learning.
  • Atari-57 suite: Median human-normalized scores exhibit clear UPER gains over QR-DQN baselines and ensemble+PER. Ablations on prioritization variables confirm the necessity of weighing epistemic against aleatoric uncertainty.

In long-tailed continual learning, UPER achieves state-of-the-art performance across benchmarks such as Seq-CIFAR-10-LT, Seq-CIFAR-100-LT, and Seq-TinyImageNet-LT under class-incremental (Class-IL) and task-incremental (Task-IL) protocols. For Class-IL with imbalance ratio $0.01$ and buffer size $200$, UPER improves test accuracy by $4.6$–$13.3$ percentage points against DER++ (Liu et al., 27 Aug 2024). Backward transfer (BWT) metrics consistently show reduced forgetting, especially among minority classes.

6. Conceptual Context: Information Gain and Sample Efficiency

The theoretical motivation for uncertainty-prioritized replay lies in maximizing the reduction of epistemic uncertainty—i.e., using replay samples to efficiently drive the model towards the Bayes-optimal predictor. TD-error prioritization conflates reducible and irreducible uncertainties; pure epistemic selection, without considering data fidelity, may over-prioritize samples that are fundamentally unpredictable. The information gain–based criterion in UPER balances these effects, targeting transitions and data points with highest learning potential and informativeness. This approach is mathematically equivalent to maximizing expected entropy reduction in a Gaussian posterior, and consistently yields improved learning dynamics.

A plausible implication is that the combination of uncertainty estimation, buffer sampling tuned to information gain, and continual regularization will be increasingly essential as data streams become both larger and more imbalanced. The minimal computational overhead of uncertainty estimation in ensemble or dropout architectures positions UPER and related methods as foundational components in scalable RL and continual learning pipelines.

PER and UPER share conceptual linkages with other replay strategies such as count-based prioritization, novelty-based exploration, and Bayesian active learning in RL. However, direct reliance on TD-error or count-based variables can be problematic under stochastic or adversarially noisy conditions. The principal limitation of uncertainty-prioritized methods is the added computational requirement for ensemble prediction and uncertainty decomposition, though empirical results indicate that the cost is marginal relative to sample efficiency gains. In continual learning, the effectiveness of MI-driven buffer sampling depends on the fidelity of underlying uncertainty approximations; future work on more expressive posterior approximations could further improve the selection of boundary and minority samples.

In summary, prioritized experience replay has evolved from TD-error-based selection to sophisticated Bayesian uncertainty-based criteria that maximize expected learning progress, maintain balanced representation in continual learning, and minimize catastrophic forgetting and bias. The integrated framework of UPER represents the current state-of-the-art for sample-efficient replay across both RL and long-tailed continual learning domains (Carrasco-Davis et al., 10 Jun 2025, Liu et al., 27 Aug 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Prioritized Experience Replay (PER).