Papers
Topics
Authors
Recent
Search
2000 character limit reached

Relational Experience Replay (RER)

Updated 6 March 2026
  • RER is a continual learning framework that uses a Relation Replay Net (RRN) to adaptively tune weights between new and memory samples based on loss and logit norms.
  • It addresses the stability–plasticity dilemma by employing a bi-level optimization scheme, ensuring efficient knowledge acquisition while mitigating catastrophic forgetting.
  • Empirical results on benchmarks like CIFAR-10, CIFAR-100, and TinyImageNet demonstrate that RER improves accuracy and backward transfer compared to fixed-weight rehearsal methods.

Relational Experience Replay (RER) is a bi-level continual learning framework that adaptively tunes both the task-wise relationship and the sample importance for rehearsal-based approaches. RER addresses the pervasive stability–plasticity dilemma, enabling a continual learner to acquire new knowledge (plasticity) without overwriting prior information (stability) by leveraging a learned, per-example weighting network that operates over pairs of new-task and old-task (memory) samples. This method represents a significant methodological advance beyond traditional rehearsal strategies, which often fix task weights and largely ignore inter-task or intra-task relational structure (Wang et al., 2021).

1. Motivation and Foundational Principles

Continual learning requires a model to learn sequential tasks {D1,D2,}\{\mathcal{D}_1, \mathcal{D}_2, \dots\} while mitigating catastrophic forgetting of earlier tasks. Rehearsal methods, which replay stored exemplars from previous tasks, have proven effective, yet most treat the loss incurred on new versus replayed examples with fixed relative weighting. They do not differentiate between past tasks by similarity to the novel task nor reweight based on sample informativeness.

RER introduces a parameterized Relation Replay Net (RRN) that, for each training iteration, inspects pairs of new-task and memory-buffer samples, generating adaptive weights that indicate the relative importance of each instance. These weights exploit both the semantic/statistical relation between tasks (e.g., measured by logit norms) and the inherent difficulty of each sample (e.g., measured by loss magnitude). RER thus replaces hand-tuned task balancing parameters with learned, task-adaptive weighting to better control the stability–plasticity trade-off (Wang et al., 2021).

2. Bi-Level Optimization Formulation

The core of RER is a nested bi-level optimization framework. The main classifier network f(;θ)f(\cdot; \theta) is updated subject to sample weights produced by the RRN h(;ϕ)h(\cdot; \phi).

  • Inner Loop (Plasticity): For each mini-batch, new-task examples and memory samples are paired. For each pair (xiD,xiM)(x^D_i, x^M_i), RRN generates per-sample weights [λiD,λiM][\lambda^D_i, \lambda^M_i], computed as a function of cross-entropy losses and logit norms:

[λiD(ϕ),λiM(ϕ)]=h(Ltr(xiD;θ),Ltr(xiM;θ),z(xiD;θ)2,z(xiM;θ)2;ϕ)[\lambda^D_i(\phi), \lambda^M_i(\phi)] = h(L^{tr}(x^D_i; \theta), L^{tr}(x^M_i; \theta), \|z(x^D_i; \theta)\|_2, \|z(x^M_i; \theta)\|_2; \phi)

The classifier’s parameters θ\theta are updated to minimize the weighted loss over the batch.

  • Outer Loop (Stability): After an inner-loop step, the impact of ϕ\phi is evaluated on a new batch from memory. The parameters ϕ\phi are updated to minimize the held-out buffer loss, ensuring that the sample weighting produced by RRN yields strong retention on old data even after plastic updates to θ\theta.

This bi-level structure is essential; an end-to-end version with joint θ\theta and ϕ\phi updates under a unified loss proved less effective, confirming the necessity of the staged optimization (Wang et al., 2021).

3. Algorithmic Workflow and Implementation Details

The RER algorithm maintains a memory buffer M\mathcal{M} updated via reservoir or herding sampling after each task. At each training step, batches from the current task and memory are paired. The RRN is a two-layer MLP (hidden size 16), receiving as input a concatenation of loss and logit-norm values for the pair.

Key implementation parameters include:

  • Memory buffer sizes: M{50,200,500,5120}M\in\{50, 200, 500, 5120\} tested.
  • Warm-up period: During the first half of epochs per task, ϕ\phi is updated, but θ\theta uses fixed weights.
  • Outer update interval: ϕ\phi is typically updated every SS \approx (epochs per task)/10/10 steps.
  • Optimizers: θ\theta updated by SGD (lr=0.03, momentum=0.9); ϕ\phi by Adam (lr=1e-3, weight decay=1e-4).

A high-level pseudocode is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
Initialize θ (Main Net), φ (RRN), memory M
for t = 1T:
    Add exemplars from task t to M
    for k = 1K:
        Sample BD from Dt, BM from M
        Pair each (xiD, xiM): compute losses LD, LM, logit norms ND, NM
        [λDi, λMi] = h([LD, LM, ND, NM]; φ)
        Ltr = (1/B) i [λDi LD + λMi LM]
        θ  θ  ηθ θ Ltr
        if k mod S == 0:
            Sample B' old samples Bbf from M
            Lbf = (1/B') ∑ CE(f(x; θ), y)
            φ  φ  ηφ φ Lbf
(Wang et al., 2021)

4. Adaptive Stability–Plasticity Trade-off

The RRN adaptively shifts attention between consolidation and accretion. When new and old tasks are semantically similar (quantified via logit norms and observed empirically with structured task similarity studies), RRN increases weights on old samples—prioritizing stability. For dissimilar tasks, weights shift toward current data, increasing plasticity. This per-pair, data-dependent weighting obviates the need for global, hand-tuned λ\lambda hyperparameters.

Empirical ablation demonstrated that RER reduces accuracy drops in challenging task similarity setups compared to baseline ER (e.g., only a 5.1% ACC drop versus 6.9% for ER on semantically relevant paired tasks) (Wang et al., 2021).

5. Experimental Evaluation and Performance

RER and its variants (RER-ACE and RDER) were evaluated on class-incremental and task-incremental learning settings with CIFAR-10, CIFAR-100, and TinyImageNet.

Setting Baseline Accuracy (ACC) RER/RDER ACC BWT (Backward Transfer)
CIFAR-10 Class-IL, M=200 DER++ ≈62.3% ≈65.4% Less negative with RER
CIFAR-100 Class-IL, M=100 DER++ ≈14.98% ≈20.8% +5.8% ACC with RER/RDER
TinyImageNet Class-IL,5120 DER++ ≈37.9% ≈39.7% Improved with RER/RDER

Plugging RER into strong rehearsal baselines (ER-ACE, DER++) consistently improved both ACC and BWT, surpassing previous state-of-the-art methods. These improvements were robust across buffer sizes and datasets (Wang et al., 2021).

6. Comparative Analyses and Ablative Study

Ablation experiments confirmed that the full bi-level scheme, including outer-loop updates on ϕ\phi, was essential for optimal performance. Removing adaptive weighting or using vanilla end-to-end updates degraded accuracy compared to the baseline DER++.

Studies on task similarity manipulation established that RER’s sample weighting is sensitive to inter-task relationships, automatically increasing old-sample emphasis when current and prior tasks are similar, which is not seen in fixed-weight rehearsal schemes. Additionally, variations on warm-up duration and outer update intervals showed that 50%50\% warm-up epochs and an interval near epochs/10/10 provided a strong trade-off between computational efficiency and final accuracy (Wang et al., 2021).

7. Limitations and Research Directions

RER introduces computational overhead due to the RRN and extra hyperparameters (warm-up length, interval for ϕ\phi). Optimizing the scheduling of ϕ\phi-updates can reduce this complexity. The method continues to rely on an explicit memory buffer; integrating RER with generative replay remains an open avenue. A further extension is to incorporate higher-order interactions (e.g., triplets of tasks or samples) and to use feature-space similarity directly within the RRN. These enhancements could further improve the adaptation of continual learners in more challenging, non-stationary environments.

8. Relationship to Relational Replay in Reinforcement Learning

The concept of leveraging relational context for experience replay also arises in reinforcement learning, as in the Neural Experience Replay Sampler (NERS) framework (Oh et al., 2020). NERS adopts a permutation-equivariant neural architecture that assesses local (per-transition) and global (batch-wise) contexts to enable adaptive, diverse sampling from the experience buffer. Both RER and NERS highlight the advantage of moving beyond independent, fixed-priority replay toward architectures capable of exploiting relational structure among experience samples, although NERS is developed for off-policy RL whereas RER targets continual supervised learning.

In summary, Relational Experience Replay provides an effective, bi-level, and relationally-aware approach to continual learning, substantiated by empirical superiority on standard benchmarks and a robust theoretical grounding (Wang et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Relational Experience Replay (RER).