Surprise-Prioritized Replay (SuRe)

Updated 4 December 2025

The paper introduces a formal definition of 'surprise' using per-token NLL in LLMs and RPE in RL to robustly guide data replay selection.
It details a dual-learner architecture with fast (plastic) and slow (EMA-consolidated) adapters that enhance adaptation while preventing catastrophic forgetting.
Empirical evaluations show notable performance gains, including up to 5 percentage point improvements on benchmarks, underscoring SuRe's effectiveness.

Surprise-prioritised Replay (SuRe) denotes a family of continual learning and reinforcement learning techniques in which the prioritization of data stored in experience or rehearsal buffers is determined by a formal measure of "surprise." This surprise is quantified using a domain-appropriate metric, such as negative log-likelihood (NLL) in supervised sequence modeling or reward prediction error (RPE) in reinforcement learning. SuRe aims to mitigate catastrophic forgetting by maximizing the informativeness of replayed data, specifically biasing buffer composition toward samples expected to exert greater influence on parameter updates.

1. Formal Definition of Surprise

In the context of LLMs and continual learning, SuRe operationalizes "surprise" according to the per-token negative log-likelihood under current model weights. Precisely, given pretrained parameters $\theta$ and an input token sequence $z = (z_1, ..., z_T)$ , the model computes the per-sequence surprise as

$s_{\theta}(z) = -\frac{1}{T}\sum_{t=1}^{T}\log p_{\theta}(z_t\mid z_{<t}, x)$

Alternatively, the score for a sequence $x$ is $S(x) = - \log P_{\theta}(x)$ , recoverable as an average when divided by sequence length. This definition aligns with steepest-gradient regions in the parameter space, targeting high-loss samples as the most salient for replay (Hazard et al., 27 Nov 2025).

In reinforcement learning, as instantiated by RPE-PER, the surprise is the reward prediction error (RPE):

$\delta^r_t = R_{\theta}(s_t, a_t) - r_t, \qquad \text{RPE}_t = |\delta^r_t|$

Here, $R_{\theta}(s_t, a_t)$ denotes the predicted reward and $r_t$ is the ground truth reward at time step $t$ (Yamani et al., 30 Jan 2025). In both cases, the principle is to prioritize samples where model predictions are maximally inconsistent with observed outcomes.

2. Algorithmic Workflow

The SuRe approach for continual LLM learning is based on task streams, buffer management, dual-adapter training, and surprise ranking:

Initialization: Freeze the base model and attach two sets of LoRA adapters (fast: $\theta_{fast}$ , slow: $\theta_{slow}$ ) to each attention layer.
Surprise Scoring & Buffer Update: For all new samples, compute their $s_\theta(z)$ scores. Select the top- $m_t$ most surprising samples from current data as buffer candidates. Maintain the replay buffer at a size $B_{\max}$ , always keeping the highest-scoring examples.
Batch Construction & Gradient Update: Alternate between batches drawn from current task data and batches mixed with replayed items. $\theta_{fast}$ is updated with standard SGD.
Exponential Moving Average (EMA) Consolidation: After each fast adapter update, slow adapters are updated via

$\theta_{slow} \leftarrow \beta \ \theta_{slow} + (1-\beta)\ \theta_{fast}$

Inference: Only the slow adapter is employed at test time.

This procedure is robust to buffer downsizing (as low as 0.02 of seen data) and reduced replay frequency, based on empirical ablations (Hazard et al., 27 Nov 2025).

In reinforcement learning, RPE-PER replaces traditional TD-error prioritization with RPE prioritization. For each transition, the EMCN critic’s predicted reward and actual reward are compared, and the absolute RPE is used as the prioritization metric for buffer sampling. Batch sampling for updates uses $\alpha$ -powered priorities, and importance sampling may be applied to correct induced sampling bias (Yamani et al., 30 Jan 2025).

3. Buffer Architecture and Prioritization Methods

The SuRe mechanism is characterized by:

Fixed-size buffer storing top- $B_{max}$ surprising samples according to $s_{\theta}(z)$ .
Per-task quota allocation, ensuring equitable buffer representation across tasks, calculated as $\lfloor S/t\rfloor$ slots for task $t$ .
Buffer updates occurring post-task to stabilize rankings after parameter changes, identified as marginally beneficial for training stability.

In RPE-PER, the prioritization mechanism mirrors PER but with RPE as the priority:

Method	Priority Signal	Sampling Distribution
Uniform	None	Uniform
PER	$\|\delta^{Q}_t\|$	Proportional to $\|\delta^{Q}_t\|^{\alpha}$
RPE-PER	$\|\delta^r_t\|$	Proportional to $\|\delta^r_t\|^{\alpha}$

The RPE approach often results in higher stability and informativeness in continuous control domains (Yamani et al., 30 Jan 2025).

4. Dual-Learner Architecture and Parameter Integration

SuRe advances buffer-based CL for LLMs by introducing a dual-learner architecture:

Fast Learner ( $\theta_{fast}$ ): High-plasticity, standard SGD-updated LoRA adapters; enables fast adaptation to novel data.
Slow Learner ( $\theta_{slow}$ ): Low-variance, EMA-filtered LoRA adapters; absorbs knowledge only if sustained across updates, promoting resistance to catastrophic forgetting.
Consolidation Mechanism: The EMA rate $\beta$ directly governs integration error, with optimal forgetting-memory tradeoffs empirically observed at $\beta \approx 0.995$ .
Parameter Efficiency: Only adapter matrices (low-rank LoRA projections on Q/V) are optimized; base model remains frozen, providing compute and memory advantages as well as stability (Hazard et al., 27 Nov 2025).

5. Empirical Evaluation

Performance of SuRe and related prioritization methods is documented across continual learning and reinforcement learning settings:

On LLM benchmarks, SuRe surpasses uniform buffer reservoir replay by 1–3 percentage points (pp) on standard benchmarks and by ~3 pp on LNT (large number of tasks) benchmarks. The combined Slow Surprise Replay (SuRe plus EMA dual-learner) outperforms prior SOTA by up to +5 pp, exceeding Progressive Prompts and MoRA.
Ablation studies indicate:
- Full-sequence NLL surprise outperforms label-only variants by >10 pp.
- Robustness to buffer size reductions (down to 150 samples) and low replay frequency (1:8, 1:16).
- Excessive $\beta$ in EMA introduces integration lag, consistent with theoretical analysis.
In continuous-control RL (MuJoCo), RPE-PER yields higher final evaluation rewards and faster convergence than both plain PER and random buffer sampling in five of six tasks. Gains are especially prominent under TD3 on HalfCheetah (≈75% improvement over PER). Increased weight on reward-prediction loss further improves efficacy (Yamani et al., 30 Jan 2025).

6. Theoretical Framework: Selection and Integration Errors

SuRe provides a formal decomposition of forgetting into additive components, directly tied to replay selection and integration mechanisms:

Selection Error $\propto$ IPM $\left(P_{1:t-1}, q\right)$ : Measures mismatch between the true past task distribution $P_{1:t-1}$ and the buffer-induced replay distribution $q$ . Prioritizing high-NLL samples with SuRe aligns replay to underfit or salient prior regions, contracting this error.
Integration Error $\propto \sigma^2 / \left[(1-\beta)\,N\right]$ : Quantifies parameter variance propagated to the slow weights. Increasing $\beta$ reduces this term but at the cost of greater integration lag. EMA-based adapter consolidation is thus critical for efficiently averaging over stochastic fast learner updates.

Empirically, strong complementarity is observed: SuRe without EMA is superior to uniform+EMA in resource-constrained (small buffer) scenarios, while the full SuRe+EMA design outperforms both, achieving even negative forgetting in some settings (Hazard et al., 27 Nov 2025).

7. Relationship to Broader Replay Prioritization Literature

SuRe and RPE-PER extend the replay prioritization framework established by PER (TD-error prioritization) and model-based critics (e.g., MaPER), replacing the purely scalar loss or TD-error with structured surprise metrics:

SuRe (LLMs): Negative log-likelihood over full sequence input; architecture-agnostic and requires only a forward pass per sample before task training.
RPE-PER (RL): RPE predicted by a model-based critic (EMCN); avoids TD noise in continuous-control domains and draws theoretical inspiration from RPE-driven biological replay (Yamani et al., 30 Jan 2025).
Distinction: While PER tracks "unexpectedness" in action values, SuRe and RPE-PER explicitly utilize the reward- or prediction-level error, demonstrating improved replay efficiency and downstream generalization, especially as task complexity or buffer pressure increases.

Taken together, surprise-prioritized replay advances the state of the art in both continual LLM fine-tuning and deep reinforcement learning by formalizing and operationalizing the replay buffer as a mechanism for high-gradient-region rehearsal and by providing robust, theoretically motivated integration controls. The approach closes much of the gap to joint multitask training under realistic buffer and compute constraints (Hazard et al., 27 Nov 2025, Yamani et al., 30 Jan 2025).