Papers
Topics
Authors
Recent
Search
2000 character limit reached

EMA Consolidation in Continual Learning

Updated 23 June 2026
  • EMA consolidation is a weight integration technique that uses exponential moving average to merge fast and slow learner updates, ensuring stability and retention in sequential task learning.
  • It operates within dual-learner frameworks by leveraging LoRA adapters and surprise-based replay to balance rapid adaptation with long-term memory.
  • Empirical studies show improved final performance and reduced forgetting, highlighting its significance in scalable, sample-efficient continual learning.

Exponential Moving Average (EMA) consolidation is a weight integration technique central to mitigating catastrophic forgetting in continual learning scenarios for LLMs. EMA consolidation, within the dual-learner paradigm, stabilizes model parameters across sequential tasks by merging fast-adapting weights with a slowly-updated set using an exponential moving average. This method directly addresses difficulties in integrating newly acquired knowledge without compromising long-term retention, making it a critical component for scalable, sample-efficient continual learning frameworks such as SuRe (Hazard et al., 27 Nov 2025).

1. Theoretical Framework for EMA Consolidation in Continual Learning

EMA consolidation operates by maintaining two concurrent sets of low-rank adapters attached to attention layers: a “fast” learner updated via stochastic gradient descent (SGD), and a “slow” learner tracking the fast weights through exponential moving average updates. Formally, after each SGD update at step tt, the slow weights θslow(t)\theta_{\mathrm{slow}}^{(t)} are consolidated according to: θslow(t)=βθslow(t1)+(1β)θfast(t)\theta_{\mathrm{slow}}^{(t)} = \beta \theta_{\mathrm{slow}}^{(t-1)} + (1 - \beta) \theta_{\mathrm{fast}}^{(t)} where β(0,1)\beta \in (0,1) is a decay rate hyperparameter. This selective smoothing dynamically filters out high-variance updates, ensuring the slow adapter encodes more stable, temporally-averaged representations across tasks. A higher β\beta (e.g., 0.995) slows adaptation and prioritizes retention; smaller values induce greater plasticity.

2. Implementation within Dual-Learner Architectures

In the SuRe critical-path, EMA consolidation is realized via LoRA adapters integrated into each attention block’s query (WQW_Q) and value (WVW_V) matrices. Two low-rank adapter sets per weight matrix are maintained:

  • Fast adapters: (Afast,Bfast)(A^{\mathrm{fast}}, B^{\mathrm{fast}}), trainable via SGD.
  • Slow adapters: (Aslow,Bslow)(A^{\mathrm{slow}}, B^{\mathrm{slow}}), updated by EMA from the fast weights.

Each adapter pair typically has rank r=8r = 8. After each gradient update on the fast adapters, slow adapters are merged using the EMA rule, as shown in the pseudocode:

θslow(t)\theta_{\mathrm{slow}}^{(t)}5

During inference, only the base LLM and the consolidated slow adapters are employed.

3. Algorithmic Integration with Surprise-Prioritised Replay

EMA consolidation forms one half of a continual learning framework that combines surprise-driven buffer selection (SuRe) with dual-learner integration. The full loop is:

  1. Surprise-based buffer update: New examples from the current task are prioritized based on their average negative log-likelihood (surprise), maintaining a fixed per-task quota.
  2. Task training: Fast adapters updated via SGD on mixed batches, combining current and replayed data.
  3. EMA consolidation: After each gradient step, slow adapters are brought toward fast using exponential moving average.

The following table summarizes the relationship between the main steps and core mechanisms:

Step Role in CL Pipeline Core Mechanism
Surprise-based Buffer Update Selection of high priority samples Negative log-likelihood
Fast Adapter Training Rapid adaptation to current task SGD
EMA Consolidation Stable knowledge retention Exponential Moving Avg.

Hyperparameters, as optimized in experimental studies, include EMA decay θslow(t)\theta_{\mathrm{slow}}^{(t)}0, LoRA rank θslow(t)\theta_{\mathrm{slow}}^{(t)}1, and replay interval θslow(t)\theta_{\mathrm{slow}}^{(t)}2 gradient steps, with buffer size allocated as 2% of all training samples (Hazard et al., 27 Nov 2025).

4. Empirical Performance and Ablation Insights

EMA-based consolidation in dual-learner models demonstrates state-of-the-art continual learning outcomes across both standard four-task and large-scale 15-task benchmarks. Empirical findings include:

  • Final Performance (FP) Improvements: On the challenging LNT (Large Number of Tasks) benchmark, Slow Surprise Replay reaches FP ≈ 75.1%, versus prior SOTA methods (MoRA, ProgPrompt) at 72.0–72.5%. Uniform replay baselines achieve 69.1%, compared to 72.1% with surprise-driven selection alone (Hazard et al., 27 Nov 2025).
  • Forgetting Mitigation: EMA-equipped models (Slow-SuRe) deliver reduced forgetting and graceful performance degradation under reduced buffer and replay frequencies.
  • Stability-Plasticity Tradeoff: θslow(t)\theta_{\mathrm{slow}}^{(t)}3 values around 0.995 yield the best compromise; setting θslow(t)\theta_{\mathrm{slow}}^{(t)}4 closer to 1 (e.g., 0.999) excessively retards learning of new tasks.

Ablation studies indicate consistent gains from slow-weight consolidation, with robustness to buffer size variations (e.g., Slow-SuRe peaks at ≈76.0% with buffer size 1500 on LNT), and superior sample efficiency at replay ratios up to 1:16.

5. Relationship to Selection-Integration Duality in Theoretical Bounds

Theoretical analysis frames continual learning as a balance between two failure modes: selection (data subset retained and replayed) and integration (mechanism by which new and old knowledge are merged). Surprise-driven prioritization (SuRe) constrains the selection error by filtering for maladapted (“surprising”) samples. EMA consolidation tightly regulates the integration term by ensuring only incremental, smoothed updates to the knowledge base—addressing stability without sacrificing plasticity. The empirical synergy of these two approaches underpins the performance edge for continual LLM adaptation (Hazard et al., 27 Nov 2025).

6. Limitations and Context Within Broader Continual Learning Research

While EMA consolidation in a dual-learner setting marks a significant improvement over regularization-based methods (such as EWC and O-LoRA) and reservoir replay, it remains bounded by the multi-task learning upper bound (e.g., ≈78.1% FP vs. 80.0% MTL on standard benchmarks). Key limitations include additional storage for dual adapters and sensitivity to EMA hyperparameters. Comparison to older strategies demonstrates that replay and consolidation are complementary: surprise-based selection filters for value, while EMA ensures resilience to catastrophic drift. These principles generalize beyond text to vision and reinforcement learning contexts, but specificity of buffer construction and merging strategies remains domain-dependent (Hazard et al., 27 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Exponential Moving Average (EMA) Consolidation.