Papers
Topics
Authors
Recent
2000 character limit reached

Uncertainty Prioritized Experience Replay

Updated 7 January 2026
  • UPER is a reinforcement learning approach that distinguishes reducible epistemic uncertainty from irreducible aleatoric noise to guide transition sampling.
  • It uses an ensemble of bootstrapped QR-DQN networks to estimate total, epistemic, and aleatoric uncertainty through an information-gain criterion.
  • Empirical evaluations in bandits, gridworld, and Atari games demonstrate UPER’s enhanced sample efficiency and improved performance compared to standard PER.

Uncertainty Prioritized Experience Replay (UPER) is a variant of prioritized experience replay that addresses the over-sampling of transitions driven by irreducible noise in reinforcement learning (RL). Standard prioritized experience replay (PER) samples transitions from a replay buffer in proportion to their absolute temporal difference (TD) error, yet this approach fails to distinguish between epistemic uncertainty (uncertainty due to lack of knowledge) and aleatoric uncertainty (uncertainty due to inherent randomness). UPER instead prioritizes transitions by their estimated epistemic uncertainty, modulated by aleatoric uncertainty using an information-gain criterion. This mechanism aims to maximize sample efficiency by focusing updates on transitions expected to most reduce model uncertainty (Carrasco-Davis et al., 10 Jun 2025).

1. Motivation and Problem Statement

In canonical PER (Schaul et al., 2016), transitions are sampled according to δ|\delta|, conflating statistical sources of error. RL agents in stochastic environments frequently encounter the "noisy-TV" problem, in which transitions with large TD error are prioritized even though the error is due to unlearnable aleatoric noise. This results in wasted updates on transitions that cannot improve policy performance. UPER explicitly distinguishes between reducible (epistemic) and irreducible (aleatoric) uncertainty, enabling focused replay sampling. The method substitutes δ|\delta| with an information-theoretic priority that reflects the value of updating a given transition for reducing model uncertainty.

2. Formal Uncertainty Decomposition

UPER adopts uncertainty decomposition grounded in Direct Epistemic Uncertainty Prediction (DEUP, Lahlou et al., 2022):

  • Let Θ(s,r)=r+γmaxaQ(s,a)\Theta(s', r) = r + \gamma \max_{a'} \overline{Q}(s', a') denote the one-step Bellman target.
  • Total uncertainty at each (s,a)(s, a) is U(Qp,s,a)=Es,r[(Θ(s,r)Qp(s,a))2]U(Q_p, s, a) = \mathbb{E}_{s', r}\big[(\Theta(s', r) - Q_p(s, a))^2\big].
    • Aleatoric uncertainty, A(s,a)A(s, a), is defined as U(Q,s,a)U(Q^*, s, a), i.e., the irreducible variance under the Bayes-optimal predictor.
    • Epistemic uncertainty, E(Qp,s,a)E(Q_p, s, a), is U(Qp,s,a)A(s,a)U(Q_p, s, a)-A(s, a).

In practice, UPER estimates these values using an ensemble of bootstrapped QR-DQN networks, each with multiple quantile heads. Given ensemble members (ψ\psi) and quantile indices (τ\tau), each head outputs quantile values θτ(s,a;ψ)\theta_\tau(s, a; \psi). The ensemble statistics provide empirical estimates for total, epistemic, and aleatoric uncertainty:

Quantity Symbol Empirical Estimate
Total uncertainty U^\widehat{U} Varτ,ψ[θτ(s,a;ψ)]\mathrm{Var}_{\tau, \psi}[\, \theta_\tau(s,a;\psi) \,]
Aleatoric uncertainty A^\widehat{A} Varτ[Eψ[θτ(s,a;ψ)]]\mathrm{Var}_\tau \big[\mathbb{E}_\psi[\theta_\tau(s,a;\psi)]\big]
Epistemic uncertainty E^\widehat{E} Eτ[Varψ[θτ(s,a;ψ)]]\mathbb{E}_\tau \big[ \mathrm{Var}_\psi[\theta_\tau(s,a;\psi)] \big]

UPER's refinement involves the target-to-prediction gap:

  • δΘ(s,a)=Θ(s,r)Eτ,ψ[θτ(s,a;ψ)]\delta_\Theta(s, a) = \Theta(s', r) - \mathbb{E}_{\tau, \psi} [ \theta_\tau(s, a; \psi) ]
  • Target-total uncertainty: U^δ(s,a)=δΘ2(s,a)+E^(s,a)+A^(s,a)\widehat{U}_\delta(s, a) = \delta_\Theta^2(s, a) + \widehat{E}(s, a) + \widehat{A}(s, a)
  • Target epistemic uncertainty: E^δ(s,a)=δΘ2(s,a)+E^(s,a)\widehat{E}_\delta(s, a) = \delta_\Theta^2(s, a) + \widehat{E}(s, a)

3. Information-Gain Prioritization Scheme

The UPER prioritization replaces the PER TD-error criterion with an information-gain measure under a Gaussian surrogate:

  • Reducible variance: σep2=E^δ(si,ai)\sigma_{\text{ep}}^2 = \widehat{E}_\delta(s_i, a_i)
  • Irreducible variance: σal2=A^(si,ai)\sigma_{\text{al}}^2 = \widehat{A}(s_i, a_i)
  • Priority: pi=ΔHδ=12log(1+σep2/σal2)p_i = \Delta \mathcal{H}_\delta = \frac{1}{2} \log(1 + \sigma_{\text{ep}}^2 / \sigma_{\text{al}}^2)

Transitions are sampled with probability P(i)=piαkpkαP(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha} (with α\alpha controlling prioritization strength) and importance-sampling weights wi=(NP(i))β/maxj[(NP(j))β]w_i = (N P(i))^{-\beta} / \max_j [(N P(j))^{-\beta}] to correct for bias (β\beta anneals from 0.4 toward 1).

4. Algorithm and Implementation Details

The UPER algorithm employs an ensemble of NeN_e QR-DQN networks with NqN_q quantile heads each. At each time step, actions are selected ϵ\epsilon-greedily using the ensemble mean QQ. Transitions are stored in the replay buffer with computed priorities. During updates:

  • Minibatches are sampled by P(i)P(i).
  • For each sampled transition ii:
    • Distributional QR targets Θi\Theta_i are computed for each quantile and head.
    • Losses LQRL_{QR}, uncertainties E^\widehat{E}, A^\widehat{A}, and distance δΘ2\delta_\Theta^2 are calculated.
    • Priority pip_i is updated as above.
  • All prioritized transitions update network parameters via weighted gradient steps.

Architectural and training specifics include:

  • Base network: Three convolutional, one fully-connected, as in QR-DQN (Dabney et al., 2017).
  • Ensemble: Ne=10N_e = 10 networks, Nq=51N_q = 51 quantiles per head.
  • Bootstrapping: Random binary mask mBernoulli(0.5)m \sim \mathrm{Bernoulli}(0.5) per head at each update (Osband et al., 2016).
  • Atari settings: α=0.6\alpha = 0.6, β\beta annealed from 0.4 to 1, learning rate 5×1055 \times 10^{-5}, Adam ϵ=108\epsilon = 10^{-8}, discount factor γ=0.99\gamma = 0.99, ϵ\epsilon-greedy ϵ=0.01\epsilon = 0.01, 1M buffer size, batch size 32, target update every 8000 frames.

5. Empirical Evaluation

UPER is evaluated on both tabular and high-dimensional benchmarks:

  • Conal Bandit: In a multi-arm bandit (5 arms, equal means, increasing variance), PER over-samples noisy arms. UPER matches oracle sampling based on true mean distance, yielding the fastest convergence in MSE of value estimates.
  • Noisy Gridworld: In a gridworld with a stochastic reward “corridor” and deterministic goal, PER focuses replay on noisy segments while UPER allocates sampling to goal-relevant states. Test return: UPER >> PER >> uniform experience replay.
  • Atari-57: Against baselines (QR-DQN, PER, QR-Ens PER), UPER achieves the highest median human-normalized scores over training, with substantial gains in specific games (e.g., Asterix, Chopper Command) and small-magnitude regressions on a few titles. Ablations confirm the critical role of the information-gain variable; direct E^δ\widehat{E}_\delta prioritization or uniform sampling underperform.

6. Analysis: Advantages, Limitations, and Robustness

UPER provides several advantages:

  • Mitigates the “noisy-TV” phenomenon by downweighting transitions driven by irreducible noise.
  • Focuses updates where epistemic uncertainty is high and aleatoric noise is low, maximizing the reduction of model uncertainty per update.
  • Demonstrated empirical gains over PER in diverse domains.

Limitations include:

  • Increased computational and memory cost arising from the ensemble of distributional heads. Mitigation is achieved via shared lower-level representation and GPU batch parallelism—for example, the approach introduces a modest per-iteration slowdown of \sim2 s in Pong.
  • The approach depends on the fidelity of uncertainty estimates; under model mis-specification, the priority distribution may act as if "tempered," affecting performance.

Ablations exploring alternative functional forms of priority piE^m/(E^+A^)p_i \sim \widehat{E}^m/(\widehat{E}+\widehat{A}) reveal trade-offs between robustness to bias and sensitivity. Using direct E^δ\widehat{E}_\delta prioritization instead of the information gain (log-ratio) leads to slower convergence.

UPER establishes an information-theoretic, uncertainty-focused prioritization scheme for RL experience replay buffers, outperforming classic PER by aligning the prioritization signal with the actual capacity for knowledge gain. Future research may address:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Uncertainty Prioritized Experience Replay (UPER).