Uncertainty Prioritized Experience Replay
- UPER is a reinforcement learning approach that distinguishes reducible epistemic uncertainty from irreducible aleatoric noise to guide transition sampling.
- It uses an ensemble of bootstrapped QR-DQN networks to estimate total, epistemic, and aleatoric uncertainty through an information-gain criterion.
- Empirical evaluations in bandits, gridworld, and Atari games demonstrate UPER’s enhanced sample efficiency and improved performance compared to standard PER.
Uncertainty Prioritized Experience Replay (UPER) is a variant of prioritized experience replay that addresses the over-sampling of transitions driven by irreducible noise in reinforcement learning (RL). Standard prioritized experience replay (PER) samples transitions from a replay buffer in proportion to their absolute temporal difference (TD) error, yet this approach fails to distinguish between epistemic uncertainty (uncertainty due to lack of knowledge) and aleatoric uncertainty (uncertainty due to inherent randomness). UPER instead prioritizes transitions by their estimated epistemic uncertainty, modulated by aleatoric uncertainty using an information-gain criterion. This mechanism aims to maximize sample efficiency by focusing updates on transitions expected to most reduce model uncertainty (Carrasco-Davis et al., 10 Jun 2025).
1. Motivation and Problem Statement
In canonical PER (Schaul et al., 2016), transitions are sampled according to , conflating statistical sources of error. RL agents in stochastic environments frequently encounter the "noisy-TV" problem, in which transitions with large TD error are prioritized even though the error is due to unlearnable aleatoric noise. This results in wasted updates on transitions that cannot improve policy performance. UPER explicitly distinguishes between reducible (epistemic) and irreducible (aleatoric) uncertainty, enabling focused replay sampling. The method substitutes with an information-theoretic priority that reflects the value of updating a given transition for reducing model uncertainty.
2. Formal Uncertainty Decomposition
UPER adopts uncertainty decomposition grounded in Direct Epistemic Uncertainty Prediction (DEUP, Lahlou et al., 2022):
- Let denote the one-step Bellman target.
- Total uncertainty at each is .
- Aleatoric uncertainty, , is defined as , i.e., the irreducible variance under the Bayes-optimal predictor.
- Epistemic uncertainty, , is .
In practice, UPER estimates these values using an ensemble of bootstrapped QR-DQN networks, each with multiple quantile heads. Given ensemble members () and quantile indices (), each head outputs quantile values . The ensemble statistics provide empirical estimates for total, epistemic, and aleatoric uncertainty:
| Quantity | Symbol | Empirical Estimate |
|---|---|---|
| Total uncertainty | ||
| Aleatoric uncertainty | ||
| Epistemic uncertainty |
UPER's refinement involves the target-to-prediction gap:
- Target-total uncertainty:
- Target epistemic uncertainty:
3. Information-Gain Prioritization Scheme
The UPER prioritization replaces the PER TD-error criterion with an information-gain measure under a Gaussian surrogate:
- Reducible variance:
- Irreducible variance:
- Priority:
Transitions are sampled with probability (with controlling prioritization strength) and importance-sampling weights to correct for bias ( anneals from 0.4 toward 1).
4. Algorithm and Implementation Details
The UPER algorithm employs an ensemble of QR-DQN networks with quantile heads each. At each time step, actions are selected -greedily using the ensemble mean . Transitions are stored in the replay buffer with computed priorities. During updates:
- Minibatches are sampled by .
- For each sampled transition :
- Distributional QR targets are computed for each quantile and head.
- Losses , uncertainties , , and distance are calculated.
- Priority is updated as above.
- All prioritized transitions update network parameters via weighted gradient steps.
Architectural and training specifics include:
- Base network: Three convolutional, one fully-connected, as in QR-DQN (Dabney et al., 2017).
- Ensemble: networks, quantiles per head.
- Bootstrapping: Random binary mask per head at each update (Osband et al., 2016).
- Atari settings: , annealed from 0.4 to 1, learning rate , Adam , discount factor , -greedy , 1M buffer size, batch size 32, target update every 8000 frames.
5. Empirical Evaluation
UPER is evaluated on both tabular and high-dimensional benchmarks:
- Conal Bandit: In a multi-arm bandit (5 arms, equal means, increasing variance), PER over-samples noisy arms. UPER matches oracle sampling based on true mean distance, yielding the fastest convergence in MSE of value estimates.
- Noisy Gridworld: In a gridworld with a stochastic reward “corridor” and deterministic goal, PER focuses replay on noisy segments while UPER allocates sampling to goal-relevant states. Test return: UPER PER uniform experience replay.
- Atari-57: Against baselines (QR-DQN, PER, QR-Ens PER), UPER achieves the highest median human-normalized scores over training, with substantial gains in specific games (e.g., Asterix, Chopper Command) and small-magnitude regressions on a few titles. Ablations confirm the critical role of the information-gain variable; direct prioritization or uniform sampling underperform.
6. Analysis: Advantages, Limitations, and Robustness
UPER provides several advantages:
- Mitigates the “noisy-TV” phenomenon by downweighting transitions driven by irreducible noise.
- Focuses updates where epistemic uncertainty is high and aleatoric noise is low, maximizing the reduction of model uncertainty per update.
- Demonstrated empirical gains over PER in diverse domains.
Limitations include:
- Increased computational and memory cost arising from the ensemble of distributional heads. Mitigation is achieved via shared lower-level representation and GPU batch parallelism—for example, the approach introduces a modest per-iteration slowdown of 2 s in Pong.
- The approach depends on the fidelity of uncertainty estimates; under model mis-specification, the priority distribution may act as if "tempered," affecting performance.
Ablations exploring alternative functional forms of priority reveal trade-offs between robustness to bias and sensitivity. Using direct prioritization instead of the information gain (log-ratio) leads to slower convergence.
7. Related Directions and Future Work
UPER establishes an information-theoretic, uncertainty-focused prioritization scheme for RL experience replay buffers, outperforming classic PER by aligning the prioritization signal with the actual capacity for knowledge gain. Future research may address:
- Alternative epistemic uncertainty estimators (e.g., dropout, pseudo-counts).
- Extensions to other RL paradigms: policy gradients, model-based reinforcement learning.
- Broader adoption in supervised and active learning settings.
- Theoretical analyses of bias correction and robustness under model mis-specification (Carrasco-Davis et al., 10 Jun 2025).