Papers
Topics
Authors
Recent
Search
2000 character limit reached

Neural Experience Replay Samplers

Updated 25 February 2026
  • Neural Experience Replay Samplers (NERS) are neural network-based methods that dynamically weight past experiences to enhance sample efficiency and learning stability.
  • NERS use permutation-equivariant architectures and attention mechanisms to assign adaptive priorities, resulting in faster convergence and improved value estimation.
  • NERS integrate bias correction and importance sampling techniques to mitigate non-uniform sampling, achieving robust performance across RL, continual, and graph-based tasks.

Neural Experience Replay Samplers (NERS) refer to a class of data-driven, neural network–based sampling mechanisms that select and weight past experience samples in experience replay buffers for reinforcement and continual learning. Distinct from heuristic rules or static priorities, these samplers utilize trainable neural architectures—often leveraging attention or permutation-equivariant mechanisms—to compute sampling probabilities or buffer updates by integrating local sample statistics with global batch context. NERS frameworks have demonstrated empirical gains in sample efficiency, stability, robustness to noise, and continual learning performance, with implementations validated in deep Q-learning, actor-critic RL, multi-agent settings, and graph neural network continual learning (Chen et al., 2023, Sarfraz et al., 2023, Zhou et al., 2020, Oh et al., 2020).

1. Neural Architectures and Permutation-Equivariant Scoring

NERS typically employ neural network modules as samplers to compute sampling probabilities or buffer admission weights dynamically.

  • In off-policy RL, NERS uses permutation-equivariant architectures to process batches of transitions, ensuring that cross-sample relationships inform scoring (Oh et al., 2020). For each batch index ii, local features (e.g., (si,ai,ri,si+1,δi)(s_i, a_i, r_i, s_{i+1}, \delta_i)) are embedded by shared MLPs ϕloc\phi_\text{loc}. Batch-level “global” features are aggregated as c=1Ijϕglob(xj)\mathbf{c} = \frac{1}{|I|} \sum_j \phi_\text{glob}(x_j). The final priority score for transition ii is computed as σi=ϕscore([hilocal;c])\sigma_i = \phi_\text{score}([\mathbf{h}_i^{\text{local}} ; \mathbf{c}]), where [][\,] denotes concatenation.
  • The Attention Loss Adjusted Prioritized (ALAP) framework augments standard DQN/DDPG-like architectures with a “neural sampler” side-branch. It processes mini-batches X=[(s1,a1),,(sm,am)]X=[(s_1,a_1),\ldots,(s_m,a_m)] by self-attention: projecting to queries Q=XWQQ=XW_Q, permuting to form keys KK, and computing a normalized “sum-of-projections” similarity aa which is mapped via an FC head to an adaptive importance-sampling exponent β[0,1]\beta\in[0,1] (Equation: β=Clip[0,1](W2ϕ(W1a+b1)+b2)\beta = \operatorname{Clip}_{[0,1]}(W_2\,\phi(W_1\,a+b_1)+b_2)) (Chen et al., 2023).

This architectural principle enables the sampler to respond adaptively to buffer diversity, learning phase, and sample redundancy.

2. Mechanisms for Sample Selection and Buffer Update

NERS mechanisms impact both the probability with which existing buffer entries are sampled for replay and the rules by which new samples are added or prioritized in the buffer.

  • In off-policy RL, the sampling distribution over buffer indices is non-uniform, determined by neural priorities as pi=σiαjσjαp_i = \frac{\sigma_i^\alpha}{\sum_j\sigma_j^\alpha}, with α\alpha controlling sharpness. The corresponding importance-sampling weights wi=(1Bpi)βw_i = \left( \frac{1}{|\mathcal{B}| p_i} \right)^\beta (where β\beta may also be learned) are applied to each sample (Oh et al., 2020, Chen et al., 2023).
  • In continual learning, Error-Sensitive Reservoir Sampling (ESRS) integrates model-based loss statistics for candidate filtering: a sample (xi,yi)(x_i,y_i) is eligible for buffer insertion iff its stable-model loss siβμ\ell^i_s \leq \beta\mu_\ell, where μ\mu_\ell is the running mean loss under the slow or semantic copy. Standard reservoir sampling is then applied to the filtered candidate stream, maintaining uniformity post-filtering (Sarfraz et al., 2023).
  • For graph continual learning, candidate selection for buffer updates leverages statistics such as proximity to class means (Mean-of-Feature), inter-class neighborhood sparseness (Coverage-Maximization), or influence on model loss as estimated by Hessian-vector products (Influence-Maximization) (Zhou et al., 2020).

A unified property across these methods is that the sample/batch relationships and model state inform either the selection probability or the admissibility of a sample to the buffer.

3. Bias Correction, Importance Sampling, and Theoretical Properties

Non-uniform sampling introduces bias in the estimation of value gradients or loss surfaces. NERS frameworks integrate explicit debiasing mechanisms:

  • The ALAP method adjusts the importance-sampling exponent β\beta via the neural sampler, ensuring

w(i)=(1NP(i))βw(i) = \left(\frac{1}{N P(i)}\right)^\beta

approaches w(i)1/(NP(i))w(i) \propto 1/(N P(i)) as β1\beta \to 1, which eliminates sampling-induced bias, guaranteeing that

EiP[w(i)θLi]=θ(1Ni=1NLi).\mathbb{E}_{i \sim P}[w(i) \, \nabla_\theta L_i] = \nabla_\theta \left( \frac{1}{N} \sum_{i=1}^N L_i \right).

The neural sampler adaptively increases β\beta as Q-networks converge, dynamically correcting bias throughout training (Chen et al., 2023).

  • In neural samplers for RL, per-step importance weights wiw_i are computed and normalized before applying them to prioritizing samples in the loss, explicitly handling the distribution shift induced by the sampling policy (Oh et al., 2020).
  • ESRS, while primarily a buffer-update mechanism, filters out high-loss (potentially noisy or outlier) transitions, implicitly protecting against catastrophic forgetting without direct bias correction but with measurable improvements in empirical distributional quality (Sarfraz et al., 2023).

4. Empirical Impact: Sample Efficiency, Stability, and Robustness

NERS implementations demonstrate improvements across key metrics:

Metric ALAP (NERS) (Chen et al., 2023) Perm-Equiv NERS (Oh et al., 2020) ESRS (Sarfraz et al., 2023) ER-GNN Samplers (Zhou et al., 2020)
Convergence speed 2× faster (DQN/CartPole); >30% speedup 10–50% faster (TD3, SAC, Rainbow) +5–7pp in continual Class-IL tasks IM sampler reduces Catastrophic Forgetting (FM ↓)
Final return/accuracy 10–20% higher average return Higher asymptotic return all tasks Doubled accuracy under label noise PM up to 95.66% (Cora)
Stability/variance 50–80% reduction in variance Higher diversity/Std in sampled batches Lower buffer corruption, less drift Consistency across GNNs
Noise/label robustness >2× accuracy under 50% label noise
Generality Same code for DQN, DDPG, MADDPG Continuous/discrete/both Consistent under task/buffer sizes Specializes to GNN continual learning

Significant findings include sample selection that maintains higher diversity (NERS batch TD/Q-value stds are higher than RANDOM or greedy, supporting the mechanism’s efficacy), improved resistance to label noise (ESRS), and minimization of catastrophic forgetting in continual learning.

5. Algorithmic and Training Considerations

NERS modules are typically trained on-line, as meta-learners or via supervised/policy-gradient objectives.

  • The permutation-equivariant NERS (Oh et al., 2020) employs REINFORCE-based updates to maximize a replay-improvement reward, the delta in cumulative return for models fine-tuned under current sampler parameters. The NERS policy is thus directly trained to maximize agent performance.
  • ALAP maintains a mirror buffer with i.i.d. sampling (Double Sampling) to de-correlate the self-attention–driven β\beta adjustment from the prioritized batch used for network updates. Updates to the “neural sampler” branch use alternate uniform samples rather than replay-biased ones, preventing positive feedback loops (Chen et al., 2023).
  • ESRS requires an additional forward pass through the stable model for each incoming sample to assess candidate admissibility; however, update cost is O(1)O(1) per sample. Reservoir sampling remains uniform within the pre-filtered stream (Sarfraz et al., 2023).

NERS introduces moderate computational overhead (additional forward/backward passes for the sampler networks), but empirical studies indicate this is amortized by faster convergence and improved final performance.

6. Extensions, Limitations, and Future Directions

NERS design is generalizable across RL and continual learning domains, with instantiations for off-policy actor-critic, multi-agent, DQN/DDPG, and GNN-based learning. Reported limitations include:

  • Increased computational cost for sampler forward/backward passes, particularly in influence-based samplers for GNNs, where Hessian-vector solves are required (Zhou et al., 2020).
  • Replay-reward estimation in RL NERS requires multiple full evaluation rollouts, which can be expensive or impractical in real-world settings (Oh et al., 2020).
  • For ALAP, the adaptive β\beta mechanism relies on measuring “batch concentration” via simple self-attention; it may underperform if the buffer’s state is not well reflected in such a statistic (Chen et al., 2023).

Future avenues proposed include incorporating uncertainty/model-based signals into neural sampler features, multi-agent or hierarchical replay, and meta-learned annealing of sample-selection exponents (Oh et al., 2020). A plausible implication is that NERS will become standard in regimes where data efficiency and robustness to nonstationarity are central, especially as online RL and continual learning advance.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Neural Experience Replay Samplers (NERS).