Relational Experience Replay (RER)
- RER is a continual learning framework that uses a Relation Replay Net (RRN) to adaptively tune weights between new and memory samples based on loss and logit norms.
- It addresses the stability–plasticity dilemma by employing a bi-level optimization scheme, ensuring efficient knowledge acquisition while mitigating catastrophic forgetting.
- Empirical results on benchmarks like CIFAR-10, CIFAR-100, and TinyImageNet demonstrate that RER improves accuracy and backward transfer compared to fixed-weight rehearsal methods.
Relational Experience Replay (RER) is a bi-level continual learning framework that adaptively tunes both the task-wise relationship and the sample importance for rehearsal-based approaches. RER addresses the pervasive stability–plasticity dilemma, enabling a continual learner to acquire new knowledge (plasticity) without overwriting prior information (stability) by leveraging a learned, per-example weighting network that operates over pairs of new-task and old-task (memory) samples. This method represents a significant methodological advance beyond traditional rehearsal strategies, which often fix task weights and largely ignore inter-task or intra-task relational structure (Wang et al., 2021).
1. Motivation and Foundational Principles
Continual learning requires a model to learn sequential tasks while mitigating catastrophic forgetting of earlier tasks. Rehearsal methods, which replay stored exemplars from previous tasks, have proven effective, yet most treat the loss incurred on new versus replayed examples with fixed relative weighting. They do not differentiate between past tasks by similarity to the novel task nor reweight based on sample informativeness.
RER introduces a parameterized Relation Replay Net (RRN) that, for each training iteration, inspects pairs of new-task and memory-buffer samples, generating adaptive weights that indicate the relative importance of each instance. These weights exploit both the semantic/statistical relation between tasks (e.g., measured by logit norms) and the inherent difficulty of each sample (e.g., measured by loss magnitude). RER thus replaces hand-tuned task balancing parameters with learned, task-adaptive weighting to better control the stability–plasticity trade-off (Wang et al., 2021).
2. Bi-Level Optimization Formulation
The core of RER is a nested bi-level optimization framework. The main classifier network is updated subject to sample weights produced by the RRN .
- Inner Loop (Plasticity): For each mini-batch, new-task examples and memory samples are paired. For each pair , RRN generates per-sample weights , computed as a function of cross-entropy losses and logit norms:
The classifier’s parameters are updated to minimize the weighted loss over the batch.
- Outer Loop (Stability): After an inner-loop step, the impact of is evaluated on a new batch from memory. The parameters are updated to minimize the held-out buffer loss, ensuring that the sample weighting produced by RRN yields strong retention on old data even after plastic updates to .
This bi-level structure is essential; an end-to-end version with joint and updates under a unified loss proved less effective, confirming the necessity of the staged optimization (Wang et al., 2021).
3. Algorithmic Workflow and Implementation Details
The RER algorithm maintains a memory buffer updated via reservoir or herding sampling after each task. At each training step, batches from the current task and memory are paired. The RRN is a two-layer MLP (hidden size 16), receiving as input a concatenation of loss and logit-norm values for the pair.
Key implementation parameters include:
- Memory buffer sizes: tested.
- Warm-up period: During the first half of epochs per task, is updated, but uses fixed weights.
- Outer update interval: is typically updated every (epochs per task) steps.
- Optimizers: updated by SGD (lr=0.03, momentum=0.9); by Adam (lr=1e-3, weight decay=1e-4).
A high-level pseudocode is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Initialize θ (Main Net), φ (RRN), memory M←∅ for t = 1…T: Add exemplars from task t to M for k = 1…K: Sample BD from Dt, BM from M Pair each (xiD, xiM): compute losses LD, LM, logit norms ND, NM [λDi, λMi] = h([LD, LM, ND, NM]; φ) Ltr = (1/B) ∑i [λDi LD + λMi LM] θ ← θ − ηθ ∇θ Ltr if k mod S == 0: Sample B' old samples Bbf from M Lbf = (1/B') ∑ CE(f(x; θ), y) φ ← φ − ηφ ∇φ Lbf |
4. Adaptive Stability–Plasticity Trade-off
The RRN adaptively shifts attention between consolidation and accretion. When new and old tasks are semantically similar (quantified via logit norms and observed empirically with structured task similarity studies), RRN increases weights on old samples—prioritizing stability. For dissimilar tasks, weights shift toward current data, increasing plasticity. This per-pair, data-dependent weighting obviates the need for global, hand-tuned hyperparameters.
Empirical ablation demonstrated that RER reduces accuracy drops in challenging task similarity setups compared to baseline ER (e.g., only a 5.1% ACC drop versus 6.9% for ER on semantically relevant paired tasks) (Wang et al., 2021).
5. Experimental Evaluation and Performance
RER and its variants (RER-ACE and RDER) were evaluated on class-incremental and task-incremental learning settings with CIFAR-10, CIFAR-100, and TinyImageNet.
| Setting | Baseline | Accuracy (ACC) | RER/RDER ACC | BWT (Backward Transfer) |
|---|---|---|---|---|
| CIFAR-10 Class-IL, M=200 | DER++ | ≈62.3% | ≈65.4% | Less negative with RER |
| CIFAR-100 Class-IL, M=100 | DER++ | ≈14.98% | ≈20.8% | +5.8% ACC with RER/RDER |
| TinyImageNet Class-IL,5120 | DER++ | ≈37.9% | ≈39.7% | Improved with RER/RDER |
Plugging RER into strong rehearsal baselines (ER-ACE, DER++) consistently improved both ACC and BWT, surpassing previous state-of-the-art methods. These improvements were robust across buffer sizes and datasets (Wang et al., 2021).
6. Comparative Analyses and Ablative Study
Ablation experiments confirmed that the full bi-level scheme, including outer-loop updates on , was essential for optimal performance. Removing adaptive weighting or using vanilla end-to-end updates degraded accuracy compared to the baseline DER++.
Studies on task similarity manipulation established that RER’s sample weighting is sensitive to inter-task relationships, automatically increasing old-sample emphasis when current and prior tasks are similar, which is not seen in fixed-weight rehearsal schemes. Additionally, variations on warm-up duration and outer update intervals showed that warm-up epochs and an interval near epochs provided a strong trade-off between computational efficiency and final accuracy (Wang et al., 2021).
7. Limitations and Research Directions
RER introduces computational overhead due to the RRN and extra hyperparameters (warm-up length, interval for ). Optimizing the scheduling of -updates can reduce this complexity. The method continues to rely on an explicit memory buffer; integrating RER with generative replay remains an open avenue. A further extension is to incorporate higher-order interactions (e.g., triplets of tasks or samples) and to use feature-space similarity directly within the RRN. These enhancements could further improve the adaptation of continual learners in more challenging, non-stationary environments.
8. Relationship to Relational Replay in Reinforcement Learning
The concept of leveraging relational context for experience replay also arises in reinforcement learning, as in the Neural Experience Replay Sampler (NERS) framework (Oh et al., 2020). NERS adopts a permutation-equivariant neural architecture that assesses local (per-transition) and global (batch-wise) contexts to enable adaptive, diverse sampling from the experience buffer. Both RER and NERS highlight the advantage of moving beyond independent, fixed-priority replay toward architectures capable of exploiting relational structure among experience samples, although NERS is developed for off-policy RL whereas RER targets continual supervised learning.
In summary, Relational Experience Replay provides an effective, bi-level, and relationally-aware approach to continual learning, substantiated by empirical superiority on standard benchmarks and a robust theoretical grounding (Wang et al., 2021).