Prioritized Experience Replay (PER)
- Prioritized Experience Replay is an approach that selectively samples transitions using TD error and Bayesian uncertainty to improve deep reinforcement learning efficiency.
- It leverages transition-level statistics to focus on experiences with high epistemic value, ensuring updates yield maximal parameter improvements.
- In continual learning, UPER integrates uncertainty-based sampling with regularization strategies to mitigate catastrophic forgetting and maintain class balance.
Prioritized Experience Replay (PER) is an algorithmic framework for improving sample efficiency in value-based deep reinforcement learning (RL) via selective sampling of experience transitions, and has more recently been integrated into continual learning for long-tailed data streams. PER and its variants systematically leverage transition-level statistics—such as temporal difference (TD) error or Bayesian uncertainty—to focus replay on transitions that are expected to yield maximal parameter improvement.
1. Formal Principles of Prioritized Experience Replay
The canonical instantiation of PER prioritizes transitions in the replay buffer based on an externally computed saliency signal. In RL, this is typically the absolute TD error for each transition relative to its Bellman target . PER assigns each buffer entry a priority , where ensures all transitions are sampled with nonzero probability. Sampling is performed with probability for user-set exponent , and bias is corrected with importance-sampling weights . This mechanism enables focused parameter updates while still maintaining coverage of the full state space.
Recent advances, notably Uncertainty-Prioritized Experience Replay (UPER) (Carrasco-Davis et al., 10 Jun 2025), have identified conceptual and statistical limitations of TD-error-based prioritization. TD error confounds reducible learning uncertainty with irreducible environment noise (aleatoric uncertainty), which can induce over-sampling of transitions generated by unpredictable random processes—a phenomenon which traps agents in "noisy TV" states and impedes effective exploration.
2. Bayesian Uncertainty-Based Prioritization
A principled alternative to TD-error prioritization is the use of epistemic uncertainty. Epistemic uncertainty quantifies the portion of predictive risk that is reducible by further training, distinguishing between random, stochastic transitions and those lacking sufficient knowledge. Formally, under an ensemble-distributional RL architecture (QR-DQN ensemble), total predictive uncertainty for an experience is . Aleatoric uncertainty is the irreducible variance , and epistemic is . UPER uses an information-theoretic gain formula: to prioritize samples. This quantifies expected posterior mean entropy reduction, analogously to Bayesian experimental design. Samples with high epistemic and low aleatoric uncertainty are most likely to improve predictive estimates.
3. Algorithmic Workflows and Implementation
The practical implementation of both TD-error PER and UPER involves a shared replay buffer of fixed size , transitions stored as tuples , and a prioritization/sampling loop as follows:
- For every environment step, transitions are added to the buffer with initial priority.
- During training updates, batches are sampled from the buffer based on priorities.
- For each sampled transition , importance weights are computed and applied to the update loss.
- In UPER, uncertainty estimates are computed using ensembles and quantile outputs per head; priorities are updated after each update using the information gain formula.
In continual learning and long-tailed incremental settings, UPER for CL augments this with a Bayesian MI-based sampling over tasks (Liu et al., 27 Aug 2024). For each new data batch in continual learning, Monte Carlo dropout (as a variational Bayesian estimator) is used to estimate the mutual information for each candidate sample. Reservoir sampling is tuned to favor samples with highest MI, ensuring that both minority-class and decision-boundary-supporting examples are retained in the buffer without requiring prior knowledge of class distributions.
4. Regularization Strategies: Boundary and Prototype Constraints
In the context of long-tailed continual learning, uncertainty-prioritized replay is complemented by prior-free regularization mechanisms designed to mitigate catastrophic forgetting. The boundary constraint consists of a knowledge-distillation term
applied to replayed buffer samples where and are the softmax outputs of the previous and current model heads, respectively. The prototype constraint operates by enforcing consistency among cosine-normalized classifier weights: where and denote normalized current and previous class centroids (prototypes). These two constraints ensure preservation of decision boundaries and balanced class prototypes during sequential learning.
5. Empirical Evaluation and Benchmarks
Comprehensive empirical assessments have been performed for both RL and continual learning scenarios. In RL, UPER demonstrates superior sample efficiency over standard TD-error PER and uniform ER on diverse settings including:
- Multi-arm bandit tasks: UPER matches oracle bias prioritization and avoids over-sampling noisy arms.
- Noisy tabular gridworlds: UPER concentrates sampling on informative paths, resulting in accelerated learning.
- Atari-57 suite: Median human-normalized scores exhibit clear UPER gains over QR-DQN baselines and ensemble+PER. Ablations on prioritization variables confirm the necessity of weighing epistemic against aleatoric uncertainty.
In long-tailed continual learning, UPER achieves state-of-the-art performance across benchmarks such as Seq-CIFAR-10-LT, Seq-CIFAR-100-LT, and Seq-TinyImageNet-LT under class-incremental (Class-IL) and task-incremental (Task-IL) protocols. For Class-IL with imbalance ratio $0.01$ and buffer size $200$, UPER improves test accuracy by $4.6$–$13.3$ percentage points against DER++ (Liu et al., 27 Aug 2024). Backward transfer (BWT) metrics consistently show reduced forgetting, especially among minority classes.
6. Conceptual Context: Information Gain and Sample Efficiency
The theoretical motivation for uncertainty-prioritized replay lies in maximizing the reduction of epistemic uncertainty—i.e., using replay samples to efficiently drive the model towards the Bayes-optimal predictor. TD-error prioritization conflates reducible and irreducible uncertainties; pure epistemic selection, without considering data fidelity, may over-prioritize samples that are fundamentally unpredictable. The information gain–based criterion in UPER balances these effects, targeting transitions and data points with highest learning potential and informativeness. This approach is mathematically equivalent to maximizing expected entropy reduction in a Gaussian posterior, and consistently yields improved learning dynamics.
A plausible implication is that the combination of uncertainty estimation, buffer sampling tuned to information gain, and continual regularization will be increasingly essential as data streams become both larger and more imbalanced. The minimal computational overhead of uncertainty estimation in ensemble or dropout architectures positions UPER and related methods as foundational components in scalable RL and continual learning pipelines.
7. Related Methodologies and Limitations
PER and UPER share conceptual linkages with other replay strategies such as count-based prioritization, novelty-based exploration, and Bayesian active learning in RL. However, direct reliance on TD-error or count-based variables can be problematic under stochastic or adversarially noisy conditions. The principal limitation of uncertainty-prioritized methods is the added computational requirement for ensemble prediction and uncertainty decomposition, though empirical results indicate that the cost is marginal relative to sample efficiency gains. In continual learning, the effectiveness of MI-driven buffer sampling depends on the fidelity of underlying uncertainty approximations; future work on more expressive posterior approximations could further improve the selection of boundary and minority samples.
In summary, prioritized experience replay has evolved from TD-error-based selection to sophisticated Bayesian uncertainty-based criteria that maximize expected learning progress, maintain balanced representation in continual learning, and minimize catastrophic forgetting and bias. The integrated framework of UPER represents the current state-of-the-art for sample-efficient replay across both RL and long-tailed continual learning domains (Carrasco-Davis et al., 10 Jun 2025, Liu et al., 27 Aug 2024).