Slow Surprise Replay: SuRe & EMA Dual-Learner

Updated 23 June 2026

The paper introduces a novel dual-learner method integrating surprise-prioritised replay with EMA consolidation, achieving state-of-the-art continual learning performance.
It mitigates selection error by prioritising high-surprise sequences measured by negative log-likelihood, ensuring critical past data is effectively revisited.
EMA-based consolidation of LoRA adapters minimizes integration error and stabilizes SGD updates, resulting in improved knowledge retention across tasks.

Slow Surprise Replay (SuRe plus EMA dual-learner) is a method for continual finetuning of LLMs designed to mitigate catastrophic forgetting in sequential task learning. It integrates surprise-prioritised replay—storing sequences with high negative log-likelihood ("surprise")—with a dual-learner architecture employing exponential moving average (EMA) consolidation of LoRA adapters. This approach addresses both the selection and integration failure modes in continual learning, achieving state-of-the-art performance in benchmarks with a large number of tasks (Hazard et al., 27 Nov 2025).

1. Continual Learning Failure Modes: Selection and Integration

Continual learning seeks to train models sequentially on tasks $D_1, \dots, D_T$ without catastrophic forgetting. The expected forgetting $\mathcal{F}$ of the "slow" model decomposes into two non-vanishing error terms:

Selection Error: This arises from mismatch between the true past-task distribution $P_{1:t-1} = \tfrac{1}{t-1} \sum_{k<t} P_k$ and the replay buffer distribution $q$ . It is bounded by an integral probability metric $D_{\mathcal F_{\mathrm{loc}}}(P_{1:t-1}, q)$ , quantifying how the replay buffer fails to sample representative experiences from previous tasks.
Integration Error: This represents variance in stochastic gradient descent (SGD) updates that can overwrite old knowledge. Lemma 2 in (Hazard et al., 27 Nov 2025) shows that consolidating past updates via EMA reduces this variance by a factor $\approx (1-\beta)$ . The slow parameter update is defined as:

$\theta_t^{\mathrm{slow}} = \beta \theta_{t-1}^{\mathrm{slow}} + (1-\beta) \theta_t^{\mathrm{fast}}, \quad \beta \in (0,1),$

yielding a bound for suboptimality that decays with higher $\beta$ and larger replay windows.

Selection and integration errors must both be addressed for robust continual learning, as neither vanishes in isolation given finite buffer and steps.

2. Surprise-Prioritised Replay (SuRe)

SuRe addresses the selection error by storing the most "surprising" sequences as measured by the sequence-level average negative log-likelihood under the current model. For sequence $z_i = (z_{i,1}, \dots, z_{i,T_i})$ :

$S_\theta(z_i) = -\frac{1}{T_i} \sum_{t=1}^{T_i} \log p_\theta(z_{i,t} \mid z_{i,<t}, x_i).$

High-surprise examples have higher per-example gradient norms and contribute more significantly to gradient geometry; thus, prioritising them in the replay buffer minimizes $\mathcal{F}$ 0. Buffer management enforces per-task quotas. Let $\mathcal{F}$ 1 denote total buffer capacity (2% of total data), and after $\mathcal{F}$ 2 tasks, each receives $\mathcal{F}$ 3 slots. On arrival of new dataset $\mathcal{F}$ 4, $\mathcal{F}$ 5 is computed for each $\mathcal{F}$ 6, the top $\mathcal{F}$ 7 samples are retained as candidates $\mathcal{F}$ 8, and $\mathcal{F}$ 9 is updated, pruning to maintain $P_{1:t-1} = \tfrac{1}{t-1} \sum_{k<t} P_k$ 0 examples.

Buffer Management Pseudocode

$\theta_t^{\mathrm{slow}} = \beta \theta_{t-1}^{\mathrm{slow}} + (1-\beta) \theta_t^{\mathrm{fast}}, \quad \beta \in (0,1),$ 3 This selection process ensures continual representation of high-impact past data across tasks.

3. Dual-Learner LoRA with Exponential Moving Average

The architecture attaches dual LoRA adapters to each attention layer weight ( $P_{1:t-1} = \tfrac{1}{t-1} \sum_{k<t} P_k$ 1, $P_{1:t-1} = \tfrac{1}{t-1} \sum_{k<t} P_k$ 2):

Fast LoRA ( $P_{1:t-1} = \tfrac{1}{t-1} \sum_{k<t} P_k$ 3): Updated by SGD on the cross-entropy loss over current and replay batches.
Slow LoRA ( $P_{1:t-1} = \tfrac{1}{t-1} \sum_{k<t} P_k$ 4): Not directly optimised; receives updates via EMA using the fast LoRA weights.

During training, the fast head minimises the expected cross-entropy over the mixed batch:

$P_{1:t-1} = \tfrac{1}{t-1} \sum_{k<t} P_k$ 5

The slow head is consolidated:

$P_{1:t-1} = \tfrac{1}{t-1} \sum_{k<t} P_k$ 6

At inference, only the base model and $P_{1:t-1} = \tfrac{1}{t-1} \sum_{k<t} P_k$ 7 adapter are used, incurring minimal overhead.

4. EMA Consolidation and Hyperparameters

Generally, the EMA consolidation is:

$P_{1:t-1} = \tfrac{1}{t-1} \sum_{k<t} P_k$ 8

where $P_{1:t-1} = \tfrac{1}{t-1} \sum_{k<t} P_k$ 9 near 1 (e.g., $q$ 0) produces an effective averaging window $q$ 1 steps, substantially reducing variance.

Key hyperparameters:

Parameter	Value	Definition/Role
Buffer size ( $q$ 2)	2% of total dataset	Replay memory
Per-task quota ( $q$ 3)	$q$ 4	Buffer entries per task
Replay frequency ( $q$ 5)	2	Steps per replay batch
Batch sizes	$q$ 6, $q$ 7	Current and replay
Learning rate	$q$ 8	T5-Large fast adapter
LoRA rank/ $q$ 9	$D_{\mathcal F_{\mathrm{loc}}}(P_{1:t-1}, q)$ 0, $D_{\mathcal F_{\mathrm{loc}}}(P_{1:t-1}, q)$ 1	Adapter structure
EMA rate ( $D_{\mathcal F_{\mathrm{loc}}}(P_{1:t-1}, q)$ 2)	0.995	EMA decay
Dropout	0.1	Regularisation

Surprise is computed once per sequence either before or after each task, and the buffer is updated post-training as a stability measure.

5. Algorithmic Summary

The complete training process for task $D_{\mathcal F_{\mathrm{loc}}}(P_{1:t-1}, q)$ 3 is as follows:

Compute candidates $D_{\mathcal F_{\mathrm{loc}}}(P_{1:t-1}, q)$ 4 TopKSamples( $D_{\mathcal F_{\mathrm{loc}}}(P_{1:t-1}, q)$ 5, score= $D_{\mathcal F_{\mathrm{loc}}}(P_{1:t-1}, q)$ 6, K= $D_{\mathcal F_{\mathrm{loc}}}(P_{1:t-1}, q)$ 7).
Update buffer $D_{\mathcal F_{\mathrm{loc}}}(P_{1:t-1}, q)$ 8 UpdateBuffer( $D_{\mathcal F_{\mathrm{loc}}}(P_{1:t-1}, q)$ 9, $\approx (1-\beta)$ 0, $\approx (1-\beta)$ 1).
For each SGD step $\approx (1-\beta)$ $\approx (1 - β)$ 2 on $\approx (1-\beta)$ $\approx (1 - β)$ 3:
- Sample $\approx (1-\beta)$ 4 from $\approx (1-\beta)$ 5.
- Every $\approx (1-\beta)$ 6 step, supplement with replay batch $\approx (1-\beta)$ 7 from $\approx (1-\beta)$ 8; otherwise, use $\approx (1-\beta)$ 9 only.
- Update $\theta_t^{\mathrm{slow}} = \beta \theta_{t-1}^{\mathrm{slow}} + (1-\beta) \theta_t^{\mathrm{fast}}, \quad \beta \in (0,1),$ 0 via SGD on cross-entropy over $\theta_t^{\mathrm{slow}} = \beta \theta_{t-1}^{\mathrm{slow}} + (1-\beta) \theta_t^{\mathrm{fast}}, \quad \beta \in (0,1),$ 1.
- Merge into $\theta_t^{\mathrm{slow}} = \beta \theta_{t-1}^{\mathrm{slow}} + (1-\beta) \theta_t^{\mathrm{fast}}, \quad \beta \in (0,1),$ 2 via EMA.

This workflow enforces dense replay of impactful past sequences and consolidates knowledge in a manner robust to SGD noise and task drift.

6. Empirical Performance

On two benchmarks (averaged over three random task orders):

Standard CL (4 tasks):
- Reservoir replay: 76.9% average accuracy
- Surprise Replay: 77.2%
- Slow Surprise Replay: 78.1%
Large Number of Tasks (15 tasks):
- Reservoir: 69.1%
- Surprise Replay: 72.1%
- Slow Surprise Replay: 75.1%

This represents an improvement of up to +5 percentage points on LNT over prior state-of-the-art. Ablation studies show resilience under small buffer sizes (e.g., with 300 examples, Slow Surprise achieves ≈75% vs. random replay ≈70%) and infrequent replay. Negative forgetting is observed, indicating increased retention or even gains on prior tasks (Hazard et al., 27 Nov 2025).

7. Advantages, Limitations, and Deployment Considerations

Advantages:

SuRe reduces selection error via surprise-based prioritisation, storing high-impact samples.
EMA dual-learner minimises integration error, yielding low-variance slow-weights for deployment.
Complementary structure, each targeting a distinct component in the additive forgetting bound.
Minimal inference overhead, as only the slow adapter is retained at runtime.

Limitations:

Requires known task boundaries and explicit per-task quotas.
Involves an extra forward pass per task to compute surprise, adding a marginal computational cost.
Currently operates offline; extension to task-free or fully online continual learning remains open.

Deployment: The method is lightweight, with buffer size scaling sublinearly (as 2% of data), LoRA adapters contributing a moderate parameter overhead, and is architecture-agnostic (applicable to vision, speech, etc.).

Slow Surprise Replay establishes surprise-prioritised buffer policies and EMA dual-learner architectures as effective, scalable mechanisms for continual LLM finetuning, substantiating the value of statistically motivated replay and low-variance integration on challenging benchmarks (Hazard et al., 27 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SuRe: Surprise-Driven Prioritised Replay for Continual LLM Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Slow Surprise Replay (SuRe plus EMA dual-learner).