Papers
Topics
Authors
Recent
Search
2000 character limit reached

Retrospective Augmented Rehearsal (RAR)

Updated 5 February 2026
  • RAR is an online continual learning method that integrates repeated data exposure with stochastic augmentations to counter both underfitting and memory overfitting.
  • It leverages a biased empirical risk formulation and RL-based hyperparameter tuning to optimize performance across benchmarks such as CIFAR100 and MiniImageNet.
  • RAR aligns memory and test loss landscapes, resulting in robust generalization and significant accuracy gains over traditional experience replay methods.

Retrospective Augmented Rehearsal (RAR) is a method for online continual learning (OCL) that addresses the dual challenge of underfitting new data and overfitting limited episodic memory. RAR generalizes rehearsal-based learning by combining repeated exposures to new data with stochastic data augmentations applied to both current and memory samples. This approach yields substantially improved risk minimization and empirical accuracy on a wide range of OCL benchmarks, outperforming both vanilla experience replay (ER) and several state-of-the-art variants. RAR provides not only practical gains but also novel theoretical insight into the memory bias and loss landscape approximation properties of rehearsal in OCL (Zhang et al., 2022).

1. Online Continual Learning Problem Formulation

In OCL, a learner observes a non-stationary stream of mini-batches {Bt}t=1T\{\mathcal{B}_t\}_{t=1}^T, where each Bt={(xi,yi)}i=1B\mathcal{B}_t = \{(x_i,y_i)\}_{i=1}^B is sampled from a current-task distribution P(DT)\mathbb{P}(\mathcal{D}_\mathcal{T}). A fixed-size episodic memory M\mathcal{M}, containing up to MM past examples selected online (typically using reservoir sampling), provides limited storage for rehearsal. At each time tt, a memory batch BtMM\mathcal{B}_t^{\mathcal{M}} \subset \mathcal{M} is sampled, and the model is updated via gradient descent on both current and memory samples.

The continual learning risk is formalized as: R(θ)=1t=1TBtt=1T(x,y)BtL(fθ(x),y)\mathcal{R}(\theta) = \frac{1}{\sum_{t=1}^T|\mathcal{B}_t|}\sum_{t=1}^T\sum_{(x,y)\in\mathcal{B}_t} \mathcal{L}(f_\theta(x), y) where fθf_\theta is the model and L\mathcal{L} is the per-sample loss (e.g., cross-entropy). The vanilla ER update at each step is: θt+1=θtηBt(x,y)BtL(fθ(x),y)ηBtM(x,y)BtML(fθ(x),y)\theta_{t+1} = \theta_t - \frac{\eta}{|\mathcal{B}_t|}\sum_{(x,y)\in\mathcal{B}_t} \nabla \mathcal{L}(f_\theta(x),y) - \frac{\eta}{|\mathcal{B}_t^\mathcal{M}|}\sum_{(x,y)\in\mathcal{B}_t^\mathcal{M}} \nabla \mathcal{L}(f_\theta(x),y) Vanilla ER is known for strong empirical results but remains susceptible to biased risk and catastrophic forgetting due to limitations in memory representativity and overfitting.

2. Biased Empirical Risk Minimization and Memory Overfitting

If memory M\mathcal{M} is updated via reservoir sampling, vanilla ER optimizes a biased empirical risk: Rt(θ)=(x,y)DTL(fθ(x),y)+βtλ(x,y)DM0L(fθ(x),y)\mathcal{R}_t(\theta) = \sum_{(x,y)\in\mathcal{D}_\mathcal{T}} \mathcal{L}(f_\theta(x),y) + \beta_t \lambda \sum_{(x,y)\in\mathcal{D}_\mathcal{M}^0} \mathcal{L}(f_\theta(x),y) where λ=DTDM0\lambda = \frac{|\mathcal{D}_\mathcal{T}|}{|\mathcal{D}_\mathcal{M}^0|} is the task-to-memory size ratio, and βt=(1+2Ncurt/Npast)1\beta_t = (1+2 N_{\text{cur}}^t/N_{\text{past}})^{-1} with NcurtN_{\text{cur}}^t and NpastN_{\text{past}} denoting counts of current and past samples. Key phenomena include:

  • Bias & Overfitting: βtλ\beta_t \lambda upweights memory, increasing overfitting risk for large λ\lambda.
  • Underfitting: Insufficient exposure to new tasks results from a single stream pass.
  • Temporal Dynamics: βt1\beta_t \to 1 as history grows, so bias persists.

While repeated rehearsal (multiple inner updates with k=1...Kk=1...K) can mitigate underfitting, it accelerates memory overfitting, as the contribution from the incoming batch rapidly diminishes with successive iterations.

3. Retrospective Augmented Rehearsal: Algorithm and Objective

RAR integrates repeated rehearsal with stochastic augmentation. Let G\mathcal{G} denote a compact group of transformations (e.g., RandAugment), with gQg\sim\mathbb{Q} a sampled augmentation. At each step tt, iteration k=1...Kk=1...K:

  1. Sample Bt,kM\mathcal{B}_{t,k}^\mathcal{M} from memory.
  2. Draw gt,kQg_{t,k}\sim\mathbb{Q}.
  3. Form an augmented batch:

Bt,kaug={(gt,k(x),y)(x,y)BtBt,kM}\mathcal{B}_{t,k}^{\text{aug}} = \left\{ (g_{t,k}(x),y) \mid (x,y) \in \mathcal{B}_t \cup \mathcal{B}_{t,k}^\mathcal{M} \right\}

  1. Apply SGD update:

θt,k+1=θt,kηBt,kaug(x,y)Bt,kaugL(fθ(x),y)\theta_{t,k+1} = \theta_{t,k} - \frac{\eta}{|\mathcal{B}_{t,k}^{\text{aug}}|} \sum_{(x,y)\in\mathcal{B}_{t,k}^{\text{aug}}}\nabla\mathcal{L}(f_\theta(x),y)

The total RAR loss per sample is: RAR(θ;x,y)=k=1KL(fθ(gt,kx),y),gt,kQ\ell_{\rm RAR}(\theta; x, y) = \sum_{k=1}^K \mathcal{L}(f_\theta(g_{t,k} x), y), \quad g_{t,k} \sim \mathbb{Q} RAR thus implements SGD on an augmented biased risk objective with integrated augmentation and memory weighting: Rˉt(θ)=(x,y)DTgGL(fθ(gx),y)dQ(g)+βtλ(x,y)DM0gGL(fθ(gx),y)dQ(g)\bar{\mathcal{R}}_t(\theta) = \sum_{(x,y)\in\mathcal{D}_{\mathcal{T}}} \int_{g\in G} \mathcal{L}(f_\theta(gx),y) d\mathbb{Q}(g) + \beta_t \lambda \sum_{(x,y)\in\mathcal{D}_\mathcal{M}^0} \int_{g\in G} \mathcal{L}(f_\theta(gx),y) d\mathbb{Q}(g)

Empirical ablations confirm that both repetition (K>1K>1) and augmentation are needed for consistent gains: using only one of these is often detrimental.

4. Loss Landscape and Ridge Aversion

It is established that ER solutions can align with high-loss ridges in the past-task loss landscape, resulting in memory overfitting even as memory-loss appears low. RAR aligns the memory-based and test-based loss contours, such that the optimization endpoint lies in a low-loss valley for both criteria. A quantitative metric for this effect is

Δridge=Ltest(θ)Lmem(θ)\Delta_{\text{ridge}} = |\mathcal{L}_{\text{test}}(\theta^*) - \mathcal{L}_{\text{mem}}(\theta^*)|

RAR achieves significantly lower Δridge\Delta_{\text{ridge}}, demonstrating superior ridge aversion and improved correspondence between memory and true risk. This property is not observed for pure repetition or pure augmentation in isolation.

5. RL-Based Hyperparameter Auto-Tuning

RAR introduces critical hyperparameters:

  • KK: number of rehearsal repeats,
  • (P,Q)(P,Q): augmentation strength (RandAugment selects PP operators at magnitude QQ).

Given the challenge of hyperparameter selection in absence of external validation, a multi-armed bandit with bootstrapped policy gradient (BPG) is employed. The key features of this scheme are:

  • State: None; bandit is fully observable.
  • Action: a=(K,P,Q)a = (K,P,Q), sampled from policy πw(a)\pi_w(a).
  • Reward:

rt=AM(a)AMr_t = -|A_{\mathcal{M}}(a) - A^*_{\mathcal{M}}|

where AMA_{\mathcal{M}} is the memory-batch training accuracy, and AMA^*_{\mathcal{M}} is a target.

  • Policy Update: Updates ww via BPG based on sets of "better"/"worse" actions, using

gBPG=Eaπw[r(a)(logπ^w+(a)logπ^w(a))]g^{\mathrm{BPG}} = \mathbb{E}_{a\sim\pi_w}\left[|r(a)|\bigl(\nabla\log\widehat{\pi}^+_w(a) - \nabla\log\widehat{\pi}^-_w(a)\bigr)\right]

The bandit converges within a few tasks, adaptively balancing stability-plasticity and improving robustness.

6. Empirical Performance on OCL Benchmarks

RAR achieves substantial performance gains on four OCL benchmarks: Seq-CIFAR100, Seq-MiniImageNet, CORE50-NC, and CLRS25-NC, with memory budgets M{2k,5k}M \in \{2\text{k}, 5\text{k}\}. The following table summarizes improvement in average accuracy:

Benchmark ER Baseline ER-RAR Gain
CIFAR100, M=2M=2k 19.0 27.8 +8.8
MiniImageNet, M=2M=2k 20.0 30.0 +10.0
CORE50, M=2M=2k 24.0 39.3 +15.3
CLRS25, M=2M=2k 18.7 28.6 +9.9

RAR also yields +5–18 point improvements when used to augment state-of-the-art variants MIR, ASER, and SCR. Notably, ablation studies show (i) only repetition or only augmentation often reduce performance; only their combination secures robust gains, and increasing KK beyond about 10 yields diminishing, but not negative, returns.

7. Hyperparameter Configuration and Practical Guidelines

Best practice suggestions include:

  • Default: K=10K=10, RandAugment (P=1,Q=14)(P=1, Q=14).
  • For high task-to-memory ratio (λ\lambda; overfitting risk): stronger augmentation, fewer repeats.
  • For low λ\lambda (underfitting risk): more repeats, lighter augmentation.
  • The RL-based adaptation typically doubles runtime, but provides robustness and requires only a single additional pass.

RAR represents a minimal, generalizable extension to rehearsal-based OCL that simultaneously counters underfitting (by repeated rehearsal) and memory overfitting (by augmentation), yielding accurate empirical risk and alignment with the true loss surface (Zhang et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Retrospective Augmented Rehearsal (RAR).