Neural Experience Replay Samplers
- Neural Experience Replay Samplers (NERS) are neural network-based methods that dynamically weight past experiences to enhance sample efficiency and learning stability.
- NERS use permutation-equivariant architectures and attention mechanisms to assign adaptive priorities, resulting in faster convergence and improved value estimation.
- NERS integrate bias correction and importance sampling techniques to mitigate non-uniform sampling, achieving robust performance across RL, continual, and graph-based tasks.
Neural Experience Replay Samplers (NERS) refer to a class of data-driven, neural network–based sampling mechanisms that select and weight past experience samples in experience replay buffers for reinforcement and continual learning. Distinct from heuristic rules or static priorities, these samplers utilize trainable neural architectures—often leveraging attention or permutation-equivariant mechanisms—to compute sampling probabilities or buffer updates by integrating local sample statistics with global batch context. NERS frameworks have demonstrated empirical gains in sample efficiency, stability, robustness to noise, and continual learning performance, with implementations validated in deep Q-learning, actor-critic RL, multi-agent settings, and graph neural network continual learning (Chen et al., 2023, Sarfraz et al., 2023, Zhou et al., 2020, Oh et al., 2020).
1. Neural Architectures and Permutation-Equivariant Scoring
NERS typically employ neural network modules as samplers to compute sampling probabilities or buffer admission weights dynamically.
- In off-policy RL, NERS uses permutation-equivariant architectures to process batches of transitions, ensuring that cross-sample relationships inform scoring (Oh et al., 2020). For each batch index , local features (e.g., ) are embedded by shared MLPs . Batch-level “global” features are aggregated as . The final priority score for transition is computed as , where denotes concatenation.
- The Attention Loss Adjusted Prioritized (ALAP) framework augments standard DQN/DDPG-like architectures with a “neural sampler” side-branch. It processes mini-batches by self-attention: projecting to queries , permuting to form keys , and computing a normalized “sum-of-projections” similarity which is mapped via an FC head to an adaptive importance-sampling exponent (Equation: ) (Chen et al., 2023).
This architectural principle enables the sampler to respond adaptively to buffer diversity, learning phase, and sample redundancy.
2. Mechanisms for Sample Selection and Buffer Update
NERS mechanisms impact both the probability with which existing buffer entries are sampled for replay and the rules by which new samples are added or prioritized in the buffer.
- In off-policy RL, the sampling distribution over buffer indices is non-uniform, determined by neural priorities as , with controlling sharpness. The corresponding importance-sampling weights (where may also be learned) are applied to each sample (Oh et al., 2020, Chen et al., 2023).
- In continual learning, Error-Sensitive Reservoir Sampling (ESRS) integrates model-based loss statistics for candidate filtering: a sample is eligible for buffer insertion iff its stable-model loss , where is the running mean loss under the slow or semantic copy. Standard reservoir sampling is then applied to the filtered candidate stream, maintaining uniformity post-filtering (Sarfraz et al., 2023).
- For graph continual learning, candidate selection for buffer updates leverages statistics such as proximity to class means (Mean-of-Feature), inter-class neighborhood sparseness (Coverage-Maximization), or influence on model loss as estimated by Hessian-vector products (Influence-Maximization) (Zhou et al., 2020).
A unified property across these methods is that the sample/batch relationships and model state inform either the selection probability or the admissibility of a sample to the buffer.
3. Bias Correction, Importance Sampling, and Theoretical Properties
Non-uniform sampling introduces bias in the estimation of value gradients or loss surfaces. NERS frameworks integrate explicit debiasing mechanisms:
- The ALAP method adjusts the importance-sampling exponent via the neural sampler, ensuring
approaches as , which eliminates sampling-induced bias, guaranteeing that
The neural sampler adaptively increases as Q-networks converge, dynamically correcting bias throughout training (Chen et al., 2023).
- In neural samplers for RL, per-step importance weights are computed and normalized before applying them to prioritizing samples in the loss, explicitly handling the distribution shift induced by the sampling policy (Oh et al., 2020).
- ESRS, while primarily a buffer-update mechanism, filters out high-loss (potentially noisy or outlier) transitions, implicitly protecting against catastrophic forgetting without direct bias correction but with measurable improvements in empirical distributional quality (Sarfraz et al., 2023).
4. Empirical Impact: Sample Efficiency, Stability, and Robustness
NERS implementations demonstrate improvements across key metrics:
| Metric | ALAP (NERS) (Chen et al., 2023) | Perm-Equiv NERS (Oh et al., 2020) | ESRS (Sarfraz et al., 2023) | ER-GNN Samplers (Zhou et al., 2020) |
|---|---|---|---|---|
| Convergence speed | 2× faster (DQN/CartPole); >30% speedup | 10–50% faster (TD3, SAC, Rainbow) | +5–7pp in continual Class-IL tasks | IM sampler reduces Catastrophic Forgetting (FM ↓) |
| Final return/accuracy | 10–20% higher average return | Higher asymptotic return all tasks | Doubled accuracy under label noise | PM up to 95.66% (Cora) |
| Stability/variance | 50–80% reduction in variance | Higher diversity/Std in sampled batches | Lower buffer corruption, less drift | Consistency across GNNs |
| Noise/label robustness | — | — | >2× accuracy under 50% label noise | — |
| Generality | Same code for DQN, DDPG, MADDPG | Continuous/discrete/both | Consistent under task/buffer sizes | Specializes to GNN continual learning |
Significant findings include sample selection that maintains higher diversity (NERS batch TD/Q-value stds are higher than RANDOM or greedy, supporting the mechanism’s efficacy), improved resistance to label noise (ESRS), and minimization of catastrophic forgetting in continual learning.
5. Algorithmic and Training Considerations
NERS modules are typically trained on-line, as meta-learners or via supervised/policy-gradient objectives.
- The permutation-equivariant NERS (Oh et al., 2020) employs REINFORCE-based updates to maximize a replay-improvement reward, the delta in cumulative return for models fine-tuned under current sampler parameters. The NERS policy is thus directly trained to maximize agent performance.
- ALAP maintains a mirror buffer with i.i.d. sampling (Double Sampling) to de-correlate the self-attention–driven adjustment from the prioritized batch used for network updates. Updates to the “neural sampler” branch use alternate uniform samples rather than replay-biased ones, preventing positive feedback loops (Chen et al., 2023).
- ESRS requires an additional forward pass through the stable model for each incoming sample to assess candidate admissibility; however, update cost is per sample. Reservoir sampling remains uniform within the pre-filtered stream (Sarfraz et al., 2023).
NERS introduces moderate computational overhead (additional forward/backward passes for the sampler networks), but empirical studies indicate this is amortized by faster convergence and improved final performance.
6. Extensions, Limitations, and Future Directions
NERS design is generalizable across RL and continual learning domains, with instantiations for off-policy actor-critic, multi-agent, DQN/DDPG, and GNN-based learning. Reported limitations include:
- Increased computational cost for sampler forward/backward passes, particularly in influence-based samplers for GNNs, where Hessian-vector solves are required (Zhou et al., 2020).
- Replay-reward estimation in RL NERS requires multiple full evaluation rollouts, which can be expensive or impractical in real-world settings (Oh et al., 2020).
- For ALAP, the adaptive mechanism relies on measuring “batch concentration” via simple self-attention; it may underperform if the buffer’s state is not well reflected in such a statistic (Chen et al., 2023).
Future avenues proposed include incorporating uncertainty/model-based signals into neural sampler features, multi-agent or hierarchical replay, and meta-learned annealing of sample-selection exponents (Oh et al., 2020). A plausible implication is that NERS will become standard in regimes where data efficiency and robustness to nonstationarity are central, especially as online RL and continual learning advance.