Slow Surprise Replay: SuRe & EMA Dual-Learner
- The paper introduces a novel dual-learner method integrating surprise-prioritised replay with EMA consolidation, achieving state-of-the-art continual learning performance.
- It mitigates selection error by prioritising high-surprise sequences measured by negative log-likelihood, ensuring critical past data is effectively revisited.
- EMA-based consolidation of LoRA adapters minimizes integration error and stabilizes SGD updates, resulting in improved knowledge retention across tasks.
Slow Surprise Replay (SuRe plus EMA dual-learner) is a method for continual finetuning of LLMs designed to mitigate catastrophic forgetting in sequential task learning. It integrates surprise-prioritised replay—storing sequences with high negative log-likelihood ("surprise")—with a dual-learner architecture employing exponential moving average (EMA) consolidation of LoRA adapters. This approach addresses both the selection and integration failure modes in continual learning, achieving state-of-the-art performance in benchmarks with a large number of tasks (Hazard et al., 27 Nov 2025).
1. Continual Learning Failure Modes: Selection and Integration
Continual learning seeks to train models sequentially on tasks without catastrophic forgetting. The expected forgetting of the "slow" model decomposes into two non-vanishing error terms:
- Selection Error: This arises from mismatch between the true past-task distribution and the replay buffer distribution . It is bounded by an integral probability metric , quantifying how the replay buffer fails to sample representative experiences from previous tasks.
- Integration Error: This represents variance in stochastic gradient descent (SGD) updates that can overwrite old knowledge. Lemma 2 in (Hazard et al., 27 Nov 2025) shows that consolidating past updates via EMA reduces this variance by a factor . The slow parameter update is defined as:
yielding a bound for suboptimality that decays with higher and larger replay windows.
Selection and integration errors must both be addressed for robust continual learning, as neither vanishes in isolation given finite buffer and steps.
2. Surprise-Prioritised Replay (SuRe)
SuRe addresses the selection error by storing the most "surprising" sequences as measured by the sequence-level average negative log-likelihood under the current model. For sequence :
High-surprise examples have higher per-example gradient norms and contribute more significantly to gradient geometry; thus, prioritising them in the replay buffer minimizes 0. Buffer management enforces per-task quotas. Let 1 denote total buffer capacity (2% of total data), and after 2 tasks, each receives 3 slots. On arrival of new dataset 4, 5 is computed for each 6, the top 7 samples are retained as candidates 8, and 9 is updated, pruning to maintain 0 examples.
Buffer Management Pseudocode
3 This selection process ensures continual representation of high-impact past data across tasks.
3. Dual-Learner LoRA with Exponential Moving Average
The architecture attaches dual LoRA adapters to each attention layer weight (1, 2):
- Fast LoRA (3): Updated by SGD on the cross-entropy loss over current and replay batches.
- Slow LoRA (4): Not directly optimised; receives updates via EMA using the fast LoRA weights.
During training, the fast head minimises the expected cross-entropy over the mixed batch:
5
The slow head is consolidated:
6
At inference, only the base model and 7 adapter are used, incurring minimal overhead.
4. EMA Consolidation and Hyperparameters
Generally, the EMA consolidation is:
8
where 9 near 1 (e.g., 0) produces an effective averaging window 1 steps, substantially reducing variance.
Key hyperparameters:
| Parameter | Value | Definition/Role |
|---|---|---|
| Buffer size (2) | 2% of total dataset | Replay memory |
| Per-task quota (3) | 4 | Buffer entries per task |
| Replay frequency (5) | 2 | Steps per replay batch |
| Batch sizes | 6, 7 | Current and replay |
| Learning rate | 8 | T5-Large fast adapter |
| LoRA rank/9 | 0, 1 | Adapter structure |
| EMA rate (2) | 0.995 | EMA decay |
| Dropout | 0.1 | Regularisation |
Surprise is computed once per sequence either before or after each task, and the buffer is updated post-training as a stability measure.
5. Algorithmic Summary
The complete training process for task 3 is as follows:
- Compute candidates 4 TopKSamples(5, score=6, K=7).
- Update buffer 8 UpdateBuffer(9, 0, 1).
- For each SGD step 2 on 3:
- Sample 4 from 5.
- Every 6 step, supplement with replay batch 7 from 8; otherwise, use 9 only.
- Update 0 via SGD on cross-entropy over 1.
- Merge into 2 via EMA.
This workflow enforces dense replay of impactful past sequences and consolidates knowledge in a manner robust to SGD noise and task drift.
6. Empirical Performance
On two benchmarks (averaged over three random task orders):
- Standard CL (4 tasks):
- Reservoir replay: 76.9% average accuracy
- Surprise Replay: 77.2%
- Slow Surprise Replay: 78.1%
- Large Number of Tasks (15 tasks):
- Reservoir: 69.1%
- Surprise Replay: 72.1%
- Slow Surprise Replay: 75.1%
This represents an improvement of up to +5 percentage points on LNT over prior state-of-the-art. Ablation studies show resilience under small buffer sizes (e.g., with 300 examples, Slow Surprise achieves ≈75% vs. random replay ≈70%) and infrequent replay. Negative forgetting is observed, indicating increased retention or even gains on prior tasks (Hazard et al., 27 Nov 2025).
7. Advantages, Limitations, and Deployment Considerations
Advantages:
- SuRe reduces selection error via surprise-based prioritisation, storing high-impact samples.
- EMA dual-learner minimises integration error, yielding low-variance slow-weights for deployment.
- Complementary structure, each targeting a distinct component in the additive forgetting bound.
- Minimal inference overhead, as only the slow adapter is retained at runtime.
Limitations:
- Requires known task boundaries and explicit per-task quotas.
- Involves an extra forward pass per task to compute surprise, adding a marginal computational cost.
- Currently operates offline; extension to task-free or fully online continual learning remains open.
Deployment: The method is lightweight, with buffer size scaling sublinearly (as 2% of data), LoRA adapters contributing a moderate parameter overhead, and is architecture-agnostic (applicable to vision, speech, etc.).
Slow Surprise Replay establishes surprise-prioritised buffer policies and EMA dual-learner architectures as effective, scalable mechanisms for continual LLM finetuning, substantiating the value of statistically motivated replay and low-variance integration on challenging benchmarks (Hazard et al., 27 Nov 2025).