Uncertainty Prioritized Replay (UPER)

Updated 3 December 2025

Uncertainty Prioritized Replay (UPER) is a method that decomposes epistemic and aleatoric uncertainty to guide memory replay in machine learning.
The approach integrates uncertainty with reward signals in both value-based and model-based RL, achieving higher scores and faster learning on benchmarks.
In continual learning, UPER mitigates bias toward noisy or over-represented samples by focusing on high-uncertainty experiences, reducing forgetting and enhancing class balance.

Uncertainty Prioritized Replay (UPER) is a class of memory management and sampling strategies designed to guide the selection of experiences for replay in machine learning—in particular, deep reinforcement learning (RL) and continual learning—by prioritizing samples according to principled estimates of uncertainty, task-relevance, or both. Unlike classic Prioritized Experience Replay (PER), which typically selects transitions based on their temporal-difference (TD) error magnitude, UPER variants integrate epistemic and aleatoric uncertainty estimates, and/or combine them with measures of extrinsic reward or model error, to improve sample efficiency, learning stability, and robustness in environments characterized by noise, partial observability, or nonstationary data streams (Bierling et al., 24 Oct 2025, Carrasco-Davis et al., 10 Jun 2025, Liu et al., 27 Aug 2024).

1. Motivation and Theoretical Underpinnings

Traditional PER methods assign sampling priorities to buffer elements—transitions or trajectories—based on a proxy for learning progress, most commonly the TD-error's absolute value. However, this approach is susceptible to oversampling transitions with high aleatoric noise, which may not offer actionable gradients and can promote instability (e.g., the "noisy TV" problem). The central advance of UPER is to derive sampling priorities from principled uncertainty decomposition, typically focusing on epistemic uncertainty (the portion improvable by further learning) and, in some settings, conditioning this with aleatoric noise (the irreducible intrinsic variance of the environment or model). For continual learning, the concept generalizes to uncertainty estimates over class boundaries or model prototypes, with UPER enabling bias-free rehearsal by focusing on hard-to-learn or minority-class samples without explicit prior knowledge.

2. Core Methodological Variants and Mathematical Formulations

Distinct UPER methodologies appear across RL and supervised/continual learning domains. The canonical example in value-based RL utilizes a distributional-ensemble of Q-networks (e.g., QR-DQN) to decompose uncertainty:

Epistemic uncertainty $\hat{\mathcal{E}}(s,a)$ : Mean ensemble disagreement across quantiles for state-action $(s,a)$ .
Aleatoric uncertainty $\hat{\mathcal{A}}(s,a)$ : Variance in the predicted quantiles' mean across the ensemble.
Information-gain-based priority:

$p_i = \frac{1}{2}\log\left(1+\frac{\hat{\mathcal{E}}_\delta}{\hat{\mathcal{A}}}\right),$

where $\hat{\mathcal{E}}_\delta$ includes both target value prediction error and epistemic uncertainty.

Sampling probability for each transition $i$ is then given by

$P(i) = \frac{p_i^\alpha}{\sum_j p_j^\alpha},$

with the exponent $\alpha$ controlling the degree of prioritization, and bias-correcting importance weights applied to each gradient update (Carrasco-Davis et al., 10 Jun 2025).

In model-based RL (e.g., DreamerV3-XP), UPER operates at the trajectory level. The trajectory priority $s_i$ is computed by: $s_i = (\lambda_r + \lambda_\delta \delta_i) R_i + \lambda_\epsilon \epsilon_i,$ where $R_i$ is total discounted return, $\delta_i$ is cumulative value error, $\epsilon_i$ is total VAE reconstruction loss, and $\lambda_*$ are hyperparameters balancing each component. Sampling and updates follow the same exponentiated-priority framework as above (Bierling et al., 24 Oct 2025).

In continual learning under long-tailed or nonstationary distributions, UPER can be instantiated via uncertainty-guided reservoir sampling. Here, input sample uncertainty is quantified via predictive entropy or Bayesian mutual-information (e.g., using Monte Carlo dropout), and the buffer prioritizes high-uncertainty samples, typically corresponding to minority or decision-boundary-adjacent data. Sample-in probabilities are derived from dynamic class frequency estimates, and two regularizers—boundary constraint (distillation loss on uncertain samples) and prototype constraint (consistency of normalized class prototypes)—stabilize learning over time (Liu et al., 27 Aug 2024).

3. Algorithmic Flows, Update Rules, and Hyperparameters

The following table summarizes major UPER algorithmic components and representative hyperparameters:

Variant/domain	Uncertainty Estimator	Priority formula	Sampling/Update rule
Value-based RL (QR-DQN)	Ensemble quantile disagreement	$p_i = \frac{1}{2}\log(1+\hat{\mathcal{E}}_\delta/\hat{\mathcal{A}})$	$\propto p_i^\alpha$ , with IS weights
Model-based RL (DreamerV3-XP)	VAE recon. + critic value error	$s_i = (\lambda_r + \lambda_\delta \delta_i)R_i + \lambda_\epsilon \epsilon_i$	$\propto [s_i+\eta]^\alpha$ , with IS weights
LTCL/rehearsal	MC Dropout, predictive MI	$S(x) = I[y,\theta\|x,D]$	Reservoir, prioritized on $S(x)$

Hyperparameter choices (examples): $\alpha=0.5$ –$0.7$, $\beta$ annealed from $0.4$ to $1.0$, $\lambda_{r}=1.0$ , $\lambda_{\delta}=\lambda_{\epsilon}=0.1$ , $\eta=10^{-6}$ , boundary/prototype regularization weights $\alpha,\beta\sim0.1$ –$1.0$ (Bierling et al., 24 Oct 2025, Carrasco-Davis et al., 10 Jun 2025, Liu et al., 27 Aug 2024).
Priority updates are run after each iteration or after $K$ gradient steps, refreshing the uncertainty terms for recently sampled experiences.
Importance sampling corrects for bias, with weights $w_i\propto (N P(i))^{-\beta}$ .

4. Empirical Benefits, Scope, and Benchmark Results

UPER has demonstrated significant advantages in multiple domains:

Reinforcement Learning:
- On Atari benchmarks, median human-normalized scores for UPER ( $\sim420\%$ ) robustly exceed PER ( $\sim340\%$ ), QR-DQN-ensemble PER ( $\sim380\%$ ), and uniform QR-DQN ( $\sim300\%$ ) over 200M frames (Carrasco-Davis et al., 10 Jun 2025).
- In tabular and toy domains, UPER matches or exceeds oracle- and count-based schemes in sample efficiency and value estimation MSE.
- For DreamerV3-XP, UPER achieves 20–40% lower model reconstruction and reward prediction loss, 15% reduced value-error variance, and up to 30% faster rise in episode returns during early training, particularly in sparse-reward and model-misaligned regimes (Bierling et al., 24 Oct 2025).
Continual Learning:
- On long-tailed Seq-CIFAR-10/100-LT and TinyImageNet-LT, UPER-based Prior-free Balanced Replay attains 51.00% class-incremental and 89.99% task-incremental accuracy with buffer size 200, outperforming DER++ by 10–15 points; corresponding backward transfer and forgetting metrics also show consistent improvement (Liu et al., 27 Aug 2024).

These results consistently demonstrate that UPER mechanisms mitigate bias toward inherently noisy or over-represented samples while preserving (or enhancing) attention to rare, high-uncertainty, and high-reward experiences. A plausible implication is increased robustness in environments with high stochasticity or unbalanced class distributions.

Canonical PER (Schaul et al., 2015) assigns priorities solely based on the magnitude of TD error, thereby failing to separate learnable information (epistemic) from irreducible randomness (aleatoric). UPER methods distinguish themselves along several dimensions:

Uncertainty decomposition: UPER leverages explicit modeling of epistemic vs. aleatoric uncertainty, typically using ensemble-based, Bayesian, or dropout-based estimators.
Trajectory-level prioritization: In model-based RL, priorities are assigned to entire trajectories, which aligns with planning and imagination techniques (e.g., RSSM in Dreamer).
Synergy of return and uncertainty: Combined return-plus-uncertainty priorities balance exploitation (learning from high-return trajectories) and exploration (targeting model weaknesses or knowledge deficits), a synergy absent from pure TD- or curiosity-driven criteria (Bierling et al., 24 Oct 2025).
Continual Learning bias avoidance: UPER enforces rehearsal diversity and stability even without access to class frequency priors, providing a prior-free solution to catastrophic forgetting in unbalanced streams (Liu et al., 27 Aug 2024).

6. Computational Complexity and Practical Considerations

UPER introduces additional computational and storage costs, primarily from:

Maintaining multiple ensemble heads (for epistemic/aleatoric estimation) and Monte-Carlo dropout passes.
Computing mutual-information or information-gain scores, often requiring $O(TNF)$ operations per batch ( $T$ MC passes, $N$ ensemble heads, $F$ feature size).
Buffer storage for uncertainty statistics or prototypes in continual learning (typically $O(CF)$ , $C$ classes, $F$ feature size) (Liu et al., 27 Aug 2024).

Batch parallelism on modern hardware generally mitigates these costs. For RL, GPU-based QR-DQN ensemble UPER runs at marginally higher wall-clock per-iteration time than standard DQN (Carrasco-Davis et al., 10 Jun 2025). Buffer sizes in continual learning (e.g., $M_{\text{max}}=200$ ) are sufficient for CIFAR-scale datasets.

7. Extensions, Limitations, and Future Directions

A primary limitation of UPER is the need for relatively sophisticated uncertainty estimation, such as ensemble disagreement or MC dropout, and the Gaussian approximation underpinning information-gain derivations. In settings where uncertainty is miscalibrated, prioritization may become biased or suboptimal. Potential extensions include:

Substituting alternate epistemic estimators (e.g., pseudo-counts, Bayesian neural networks).
Adapting UPER strategies beyond RL/continual learning to supervised and active learning contexts.
Exploring alternative functional forms of priority to further offset model bias or temperature miscalibration (Carrasco-Davis et al., 10 Jun 2025).
Evaluating UPER under resource-constrained or on-policy replay frameworks.

Collectively, UPER methods constitute a family of replay buffer management techniques that systematically distinguish learnable uncertainty from noise, prioritize experiences in a balanced and theoretically grounded manner, and demonstrate quantifiable advantages in representative benchmarks across learning paradigms.