EMA Consolidation in Continual Learning
- EMA consolidation is a weight integration technique that uses exponential moving average to merge fast and slow learner updates, ensuring stability and retention in sequential task learning.
- It operates within dual-learner frameworks by leveraging LoRA adapters and surprise-based replay to balance rapid adaptation with long-term memory.
- Empirical studies show improved final performance and reduced forgetting, highlighting its significance in scalable, sample-efficient continual learning.
Exponential Moving Average (EMA) consolidation is a weight integration technique central to mitigating catastrophic forgetting in continual learning scenarios for LLMs. EMA consolidation, within the dual-learner paradigm, stabilizes model parameters across sequential tasks by merging fast-adapting weights with a slowly-updated set using an exponential moving average. This method directly addresses difficulties in integrating newly acquired knowledge without compromising long-term retention, making it a critical component for scalable, sample-efficient continual learning frameworks such as SuRe (Hazard et al., 27 Nov 2025).
1. Theoretical Framework for EMA Consolidation in Continual Learning
EMA consolidation operates by maintaining two concurrent sets of low-rank adapters attached to attention layers: a “fast” learner updated via stochastic gradient descent (SGD), and a “slow” learner tracking the fast weights through exponential moving average updates. Formally, after each SGD update at step , the slow weights are consolidated according to: where is a decay rate hyperparameter. This selective smoothing dynamically filters out high-variance updates, ensuring the slow adapter encodes more stable, temporally-averaged representations across tasks. A higher (e.g., 0.995) slows adaptation and prioritizes retention; smaller values induce greater plasticity.
2. Implementation within Dual-Learner Architectures
In the SuRe critical-path, EMA consolidation is realized via LoRA adapters integrated into each attention block’s query () and value () matrices. Two low-rank adapter sets per weight matrix are maintained:
- Fast adapters: , trainable via SGD.
- Slow adapters: , updated by EMA from the fast weights.
Each adapter pair typically has rank . After each gradient update on the fast adapters, slow adapters are merged using the EMA rule, as shown in the pseudocode:
5
During inference, only the base LLM and the consolidated slow adapters are employed.
3. Algorithmic Integration with Surprise-Prioritised Replay
EMA consolidation forms one half of a continual learning framework that combines surprise-driven buffer selection (SuRe) with dual-learner integration. The full loop is:
- Surprise-based buffer update: New examples from the current task are prioritized based on their average negative log-likelihood (surprise), maintaining a fixed per-task quota.
- Task training: Fast adapters updated via SGD on mixed batches, combining current and replayed data.
- EMA consolidation: After each gradient step, slow adapters are brought toward fast using exponential moving average.
The following table summarizes the relationship between the main steps and core mechanisms:
| Step | Role in CL Pipeline | Core Mechanism |
|---|---|---|
| Surprise-based Buffer Update | Selection of high priority samples | Negative log-likelihood |
| Fast Adapter Training | Rapid adaptation to current task | SGD |
| EMA Consolidation | Stable knowledge retention | Exponential Moving Avg. |
Hyperparameters, as optimized in experimental studies, include EMA decay 0, LoRA rank 1, and replay interval 2 gradient steps, with buffer size allocated as 2% of all training samples (Hazard et al., 27 Nov 2025).
4. Empirical Performance and Ablation Insights
EMA-based consolidation in dual-learner models demonstrates state-of-the-art continual learning outcomes across both standard four-task and large-scale 15-task benchmarks. Empirical findings include:
- Final Performance (FP) Improvements: On the challenging LNT (Large Number of Tasks) benchmark, Slow Surprise Replay reaches FP ≈ 75.1%, versus prior SOTA methods (MoRA, ProgPrompt) at 72.0–72.5%. Uniform replay baselines achieve 69.1%, compared to 72.1% with surprise-driven selection alone (Hazard et al., 27 Nov 2025).
- Forgetting Mitigation: EMA-equipped models (Slow-SuRe) deliver reduced forgetting and graceful performance degradation under reduced buffer and replay frequencies.
- Stability-Plasticity Tradeoff: 3 values around 0.995 yield the best compromise; setting 4 closer to 1 (e.g., 0.999) excessively retards learning of new tasks.
Ablation studies indicate consistent gains from slow-weight consolidation, with robustness to buffer size variations (e.g., Slow-SuRe peaks at ≈76.0% with buffer size 1500 on LNT), and superior sample efficiency at replay ratios up to 1:16.
5. Relationship to Selection-Integration Duality in Theoretical Bounds
Theoretical analysis frames continual learning as a balance between two failure modes: selection (data subset retained and replayed) and integration (mechanism by which new and old knowledge are merged). Surprise-driven prioritization (SuRe) constrains the selection error by filtering for maladapted (“surprising”) samples. EMA consolidation tightly regulates the integration term by ensuring only incremental, smoothed updates to the knowledge base—addressing stability without sacrificing plasticity. The empirical synergy of these two approaches underpins the performance edge for continual LLM adaptation (Hazard et al., 27 Nov 2025).
6. Limitations and Context Within Broader Continual Learning Research
While EMA consolidation in a dual-learner setting marks a significant improvement over regularization-based methods (such as EWC and O-LoRA) and reservoir replay, it remains bounded by the multi-task learning upper bound (e.g., ≈78.1% FP vs. 80.0% MTL on standard benchmarks). Key limitations include additional storage for dual adapters and sensitivity to EMA hyperparameters. Comparison to older strategies demonstrates that replay and consolidation are complementary: surprise-based selection filters for value, while EMA ensures resilience to catastrophic drift. These principles generalize beyond text to vision and reinforcement learning contexts, but specificity of buffer construction and merging strategies remains domain-dependent (Hazard et al., 27 Nov 2025).