Retrospective Augmented Rehearsal (RAR)

Updated 5 February 2026

RAR is an online continual learning method that integrates repeated data exposure with stochastic augmentations to counter both underfitting and memory overfitting.
It leverages a biased empirical risk formulation and RL-based hyperparameter tuning to optimize performance across benchmarks such as CIFAR100 and MiniImageNet.
RAR aligns memory and test loss landscapes, resulting in robust generalization and significant accuracy gains over traditional experience replay methods.

Retrospective Augmented Rehearsal (RAR) is a method for online continual learning (OCL) that addresses the dual challenge of underfitting new data and overfitting limited episodic memory. RAR generalizes rehearsal-based learning by combining repeated exposures to new data with stochastic data augmentations applied to both current and memory samples. This approach yields substantially improved risk minimization and empirical accuracy on a wide range of OCL benchmarks, outperforming both vanilla experience replay (ER) and several state-of-the-art variants. RAR provides not only practical gains but also novel theoretical insight into the memory bias and loss landscape approximation properties of rehearsal in OCL (Zhang et al., 2022).

1. Online Continual Learning Problem Formulation

In OCL, a learner observes a non-stationary stream of mini-batches $\{\mathcal{B}_t\}_{t=1}^T$ , where each $\mathcal{B}_t = \{(x_i,y_i)\}_{i=1}^B$ is sampled from a current-task distribution $\mathbb{P}(\mathcal{D}_\mathcal{T})$ . A fixed-size episodic memory $\mathcal{M}$ , containing up to $M$ past examples selected online (typically using reservoir sampling), provides limited storage for rehearsal. At each time $t$ , a memory batch $\mathcal{B}_t^{\mathcal{M}} \subset \mathcal{M}$ is sampled, and the model is updated via gradient descent on both current and memory samples.

The continual learning risk is formalized as: $\mathcal{R}(\theta) = \frac{1}{\sum_{t=1}^T|\mathcal{B}_t|}\sum_{t=1}^T\sum_{(x,y)\in\mathcal{B}_t} \mathcal{L}(f_\theta(x), y)$ where $f_\theta$ is the model and $\mathcal{L}$ is the per-sample loss (e.g., cross-entropy). The vanilla ER update at each step is: $\theta_{t+1} = \theta_t - \frac{\eta}{|\mathcal{B}_t|}\sum_{(x,y)\in\mathcal{B}_t} \nabla \mathcal{L}(f_\theta(x),y) - \frac{\eta}{|\mathcal{B}_t^\mathcal{M}|}\sum_{(x,y)\in\mathcal{B}_t^\mathcal{M}} \nabla \mathcal{L}(f_\theta(x),y)$ Vanilla ER is known for strong empirical results but remains susceptible to biased risk and catastrophic forgetting due to limitations in memory representativity and overfitting.

2. Biased Empirical Risk Minimization and Memory Overfitting

If memory $\mathcal{M}$ is updated via reservoir sampling, vanilla ER optimizes a biased empirical risk: $\mathcal{R}_t(\theta) = \sum_{(x,y)\in\mathcal{D}_\mathcal{T}} \mathcal{L}(f_\theta(x),y) + \beta_t \lambda \sum_{(x,y)\in\mathcal{D}_\mathcal{M}^0} \mathcal{L}(f_\theta(x),y)$ where $\lambda = \frac{|\mathcal{D}_\mathcal{T}|}{|\mathcal{D}_\mathcal{M}^0|}$ is the task-to-memory size ratio, and $\beta_t = (1+2 N_{\text{cur}}^t/N_{\text{past}})^{-1}$ with $N_{\text{cur}}^t$ and $N_{\text{past}}$ denoting counts of current and past samples. Key phenomena include:

Bias & Overfitting: $\beta_t \lambda$ upweights memory, increasing overfitting risk for large $\lambda$ .
Underfitting: Insufficient exposure to new tasks results from a single stream pass.
Temporal Dynamics: $\beta_t \to 1$ as history grows, so bias persists.

While repeated rehearsal (multiple inner updates with $k=1...K$ ) can mitigate underfitting, it accelerates memory overfitting, as the contribution from the incoming batch rapidly diminishes with successive iterations.

3. Retrospective Augmented Rehearsal: Algorithm and Objective

RAR integrates repeated rehearsal with stochastic augmentation. Let $\mathcal{G}$ denote a compact group of transformations (e.g., RandAugment), with $g\sim\mathbb{Q}$ a sampled augmentation. At each step $t$ , iteration $k=1...K$ :

Sample $\mathcal{B}_{t,k}^\mathcal{M}$ from memory.
Draw $g_{t,k}\sim\mathbb{Q}$ .
Form an augmented batch:

$\mathcal{B}_{t,k}^{\text{aug}} = \left\{ (g_{t,k}(x),y) \mid (x,y) \in \mathcal{B}_t \cup \mathcal{B}_{t,k}^\mathcal{M} \right\}$

Apply SGD update:

$\theta_{t,k+1} = \theta_{t,k} - \frac{\eta}{|\mathcal{B}_{t,k}^{\text{aug}}|} \sum_{(x,y)\in\mathcal{B}_{t,k}^{\text{aug}}}\nabla\mathcal{L}(f_\theta(x),y)$

The total RAR loss per sample is: $\ell_{\rm RAR}(\theta; x, y) = \sum_{k=1}^K \mathcal{L}(f_\theta(g_{t,k} x), y), \quad g_{t,k} \sim \mathbb{Q}$ RAR thus implements SGD on an augmented biased risk objective with integrated augmentation and memory weighting: $\bar{\mathcal{R}}_t(\theta) = \sum_{(x,y)\in\mathcal{D}_{\mathcal{T}}} \int_{g\in G} \mathcal{L}(f_\theta(gx),y) d\mathbb{Q}(g) + \beta_t \lambda \sum_{(x,y)\in\mathcal{D}_\mathcal{M}^0} \int_{g\in G} \mathcal{L}(f_\theta(gx),y) d\mathbb{Q}(g)$

Empirical ablations confirm that both repetition ( $K>1$ ) and augmentation are needed for consistent gains: using only one of these is often detrimental.

4. Loss Landscape and Ridge Aversion

It is established that ER solutions can align with high-loss ridges in the past-task loss landscape, resulting in memory overfitting even as memory-loss appears low. RAR aligns the memory-based and test-based loss contours, such that the optimization endpoint lies in a low-loss valley for both criteria. A quantitative metric for this effect is

$\Delta_{\text{ridge}} = |\mathcal{L}_{\text{test}}(\theta^*) - \mathcal{L}_{\text{mem}}(\theta^*)|$

RAR achieves significantly lower $\Delta_{\text{ridge}}$ , demonstrating superior ridge aversion and improved correspondence between memory and true risk. This property is not observed for pure repetition or pure augmentation in isolation.

5. RL-Based Hyperparameter Auto-Tuning

RAR introduces critical hyperparameters:

$K$ : number of rehearsal repeats,
$(P,Q)$ : augmentation strength (RandAugment selects $P$ operators at magnitude $Q$ ).

Given the challenge of hyperparameter selection in absence of external validation, a multi-armed bandit with bootstrapped policy gradient (BPG) is employed. The key features of this scheme are:

State: None; bandit is fully observable.
Action: $a = (K,P,Q)$ , sampled from policy $\pi_w(a)$ .
Reward:

$r_t = -|A_{\mathcal{M}}(a) - A^*_{\mathcal{M}}|$

where $A_{\mathcal{M}}$ is the memory-batch training accuracy, and $A^*_{\mathcal{M}}$ is a target.

Policy Update: Updates $w$ via BPG based on sets of "better"/"worse" actions, using

$g^{\mathrm{BPG}} = \mathbb{E}_{a\sim\pi_w}\left[|r(a)|\bigl(\nabla\log\widehat{\pi}^+_w(a) - \nabla\log\widehat{\pi}^-_w(a)\bigr)\right]$

The bandit converges within a few tasks, adaptively balancing stability-plasticity and improving robustness.

6. Empirical Performance on OCL Benchmarks

RAR achieves substantial performance gains on four OCL benchmarks: Seq-CIFAR100, Seq-MiniImageNet, CORE50-NC, and CLRS25-NC, with memory budgets $M \in \{2\text{k}, 5\text{k}\}$ . The following table summarizes improvement in average accuracy:

Benchmark	ER Baseline	ER-RAR	Gain
CIFAR100, $M=2$ k	19.0	27.8	+8.8
MiniImageNet, $M=2$ k	20.0	30.0	+10.0
CORE50, $M=2$ k	24.0	39.3	+15.3
CLRS25, $M=2$ k	18.7	28.6	+9.9

RAR also yields +5–18 point improvements when used to augment state-of-the-art variants MIR, ASER, and SCR. Notably, ablation studies show (i) only repetition or only augmentation often reduce performance; only their combination secures robust gains, and increasing $K$ beyond about 10 yields diminishing, but not negative, returns.

7. Hyperparameter Configuration and Practical Guidelines

Best practice suggestions include:

Default: $K=10$ , RandAugment $(P=1, Q=14)$ .
For high task-to-memory ratio ( $\lambda$ ; overfitting risk): stronger augmentation, fewer repeats.
For low $\lambda$ (underfitting risk): more repeats, lighter augmentation.
The RL-based adaptation typically doubles runtime, but provides robustness and requires only a single additional pass.

RAR represents a minimal, generalizable extension to rehearsal-based OCL that simultaneously counters underfitting (by repeated rehearsal) and memory overfitting (by augmentation), yielding accurate empirical risk and alignment with the true loss surface (Zhang et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

A simple but strong baseline for online continual learning: Repeated Augmented Rehearsal (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Retrospective Augmented Rehearsal (RAR).