Catastrophic Forgetting in Continual Learning
- Catastrophic Forgetting is the loss of previously learned knowledge when models update on new tasks without revisiting old data.
- It is quantified by metrics such as F_k and average forgetting, with larger models and disjoint tasks experiencing more severe drops.
- Mitigation strategies include replay-based, regularization-based, and architectural methods to balance stability and plasticity in continual learning.
Catastrophic Forgetting (CF) is a fundamental challenge in continual and incremental learning regimes, referring to the dramatic loss of previously acquired knowledge when a model is updated on new data or tasks without explicit access to the earlier data. CF persists as a central obstacle for neural networks, LLMs, and related systems across a wide range of supervised, generative, and reinforcement learning applications.
1. Formal Definitions, Metrics, and Phenomenology
Catastrophic forgetting manifests as a precipitous decline in accuracy or predictive performance on old tasks after sequentially training on new, disjoint data. In the classical continual learning setup, let be a sequence of tasks, the initial model, and the model after training on the -th task. If denotes accuracy on task after learning up to task , forgetting on task is formally measured as
with the average forgetting over 0 tasks given by 1 (Aleixo et al., 2023). Similar metrics such as backward transfer (BWT) and average plasticity (AP) are widely adopted. Application-specific adaptations—for example, percentage-drop metrics on held-out language understanding benchmarks for LLMs—are routine (Luo et al., 2023).
Precise assessment requires careful protocol design. A prominent concern is to avoid “prescient” evaluation, i.e., hyperparameter tuning or early stopping that illegally uses old-task data unavailable in deployment (Pfülb et al., 2019). Under these realistic constraints, most methods are empirically shown to suffer significant CF, especially in challenging class-incremental settings (Pfülb et al., 2019Pfülb et al., 2019).
2. Theoretical Foundations and Loss Landscape Connections
The root cause of CF is the shared parameterization of neural models: SGD or other optimizers minimize the empirical loss on the latest task, drifting away from optima found for earlier tasks. Bayesian analysis frames this as sequential posterior update: 2. Without regularization, the new data rapidly “overwrites” parameter regions critical for old-task function (Loke et al., 14 Jul 2025, Doan et al., 2020).
Empirical and theoretical studies show the geometry of the loss landscape is intimately linked to CF (Li et al., 2024). Flatter solutions—quantified by spectral flatness (surface curvature, average gradient, mean absolute gradient)—produce significantly greater retention across tasks, while sharp minima amplify forgetting. This landscape perspective holds for both deep neural networks and LLMs; successive fine-tuning on divergent instruction sets (e.g., Alpaca 3 Open-Platypus) not only sharpens the loss landscape but also causes stepwise drops of 6–17 percentage points on held-out general tasks such as MMLU (Li et al., 2024).
The neural tangent kernel (NTK) framework gives a formal handle: forgetting is governed by the principal angles between feature subspaces induced by different tasks. High similarity (large overlap eigenvalues) predicts more interference and CF. Orthogonally projecting new-task gradients (OGD), or storing only dominant principal directions (PCA-OGD), provably limits drift in the old-task function (Doan et al., 2020).
3. Empirical Manifestations Across Model Classes and Modalities
CF is universal in LLMs, vision models, audio, time series, and reinforcement learning agents (Luo et al., 2023Hallak et al., 24 Oct 2025Park et al., 2024Early, 2019). It arises in both discriminative (image classification, language understanding) and generative (GAN, VAE) settings. Large-scale meta-analyses show:
- Model size effect: Larger models experience more severe CF, not less (Luo et al., 2023). E.g., on domain knowledge benchmarks, BLOOMZ-1.1B’s FG49.5% vs. BLOOMZ-7.1B's FG518.4%.
- Architecture effect: Decoder-only LLMs can be more robust to CF than encoder-decoder transformers at equivalent scales (Luo et al., 2023).
- Generative models: In GANs, CF in the discriminator destroys wide local maxima at real data, causing mode collapse and non-convergence unless explicitly penalized (Thanh-Tung et al., 2018).
- Time series and federated learning: Non-i.i.d. and temporally-drifting domains (e.g., federated forecasting) further exacerbate CF, even for regularized RNNs and LSTMs (Hallak et al., 24 Oct 2025).
- Recommender systems: Collaborative filtering with standard MLP autoencoders exhibits severe CF, while edge-level parameterization (KANs) can localize updates and substantially mitigate the effect (Park et al., 2024).
4. Algorithmic Approaches to Mitigating Catastrophic Forgetting
4.1. Replay-Based (Rehearsal) Methods
These methods retain a buffer of data (raw or synthetic) from old tasks and interleave them during new-task training. Exemplar-based rehearsal (e.g., iCaRL) and generative replay (e.g., DGR, PRER) are prominent representatives:
- Mini-rehearsal: Maintains a memory coreset, optimizing joint or projected updates (Aleixo et al., 2023Pomponi et al., 2022).
- Pseudo-rehearsal: Uses trained GANs/VAEs or invertible flows to sample from the embedding distribution of old tasks (Pomponi et al., 2022). PRER, for example, achieves BWT6-0.1% on MNIST/SVHN (near-perfect retention with constant-size memory).
- Replay is effective but presents memory and privacy tradeoffs in federated or on-device contexts (Hallak et al., 24 Oct 2025).
4.2. Regularization-Based (Parameter-Centric) Methods
Regularizers penalize updates to parameters deemed critical for previous tasks.
- Elastic Weight Consolidation (EWC): Adds Fisher-weighted quadratic penalties per parameter to anchor old-task optima (Loke et al., 14 Jul 2025Fliss et al., 2024). EWC significantly reduces forgetting to 715–20% on PermutedMNIST, but is much weaker in “realistic” class-incremental regimes (Pfülb et al., 2019).
- Online / Adaptive variants: Accumulate importances with decay (Online-EWC), track time-varying relevance (SI, MAS) (Hallak et al., 24 Oct 2025Aleixo et al., 2023).
- Sharpness-aware minimization (SAM): Flattens the local loss landscape, yielding reductions in SC, AG, and MAG metrics, and up to 7–10 pp retention boosts on LLMs (Li et al., 2024).
- Forgetting-aware pruning: Post-hoc pruning based on the relative perturbation to pre-trained weights; FAPM limits CF to 0.33% on major LLMs (Huang et al., 10 Sep 2025).
4.3. Architectural, Masking, and Sequence Optimization
- Parameter isolation/masking: Binary or continuous task-specific masks (e.g., HAT, Piggyback) prevent overwriting old-task paths (Kumar et al., 2024Aleixo et al., 2023).
- Expansion: Progressive Neural Networks (PNN), Dynamically Expandable Networks (DEN), and OWM/EOWM grow or selectively retrain sub-spaces while maintaining strong task isolation (Li et al., 2021).
- Optimizing task order: Intelligent sequencing of tasks via zero-shot NAS proxies (e.g., NWOT, AID-augmented diversity) actively reduces CF spikes, especially in non-i.i.d. settings (Moussa et al., 18 Dec 2025).
4.4. Representation-Level and Embedding-Space Regularization
Methods that directly regularize or stabilize the embedding space (e.g., centroids matching (Pomponi et al., 2022), function vector regularization (Jiang et al., 16 Feb 2025)) show robust gains in class- and task-incremental protocols. For LLMs, interventions targeting the “function vector” subspace preserve zero/few-shot performance despite extensive continual tuning (Jiang et al., 16 Feb 2025).
| Method Class | Example Algorithms | Core Principle |
|---|---|---|
| Replay-based | iCaRL, PRER, DGR | Rehearse/replay old data (raw or generated) |
| Regularization-based | EWC, SAM, FAPM | Penalize change to key parameters / flatten loss |
| Architectural | HAT, PNN, EOWM | Isolate, expand, or mask subspaces per task |
| Task ordering | NWOT, sequencing | Optimize task order to minimize interference |
| Representation | CentroidsMatching, FV | Preserve structure in embedding/activation spaces |
5. Empirical Best Practices, Limitations, and Tradeoffs
- Replay (real or synthetic) is the most reliable defense—but can incur privacy, compute, or storage costs; experience replay buffers may be infeasible for on-device or federated scenarios (Hallak et al., 24 Oct 2025, Aleixo et al., 2023).
- Quadratic-penalty regularization (e.g., EWC) is simple, scalable, and broadly effective for tasks with limited interference, but quickly degrades as old and new tasks become more semantically aligned or as class-incrementality increases (Pfülb et al., 2019, Kumar et al., 2024).
- Optimization-level methods such as SAM and FAPM, and embedding-level methods such as centroids matching and function vector stabilization, offer strong gains at low overhead in both giant transformers and standard DNNs (Li et al., 2024, Jiang et al., 16 Feb 2025, Huang et al., 10 Sep 2025, Pomponi et al., 2022).
- True task-incremental, class-incremental, and domain-incremental setups can yield divergent CF dynamics. Model selection and early stopping must not rely on unavailable old data to avoid overstating gains (Pfülb et al., 2019).
- No single approach fully solves CF: hybrid strategies—combining replay, adaptive regularizers, architectural isolation, and task-sequence optimization—yield the best results under realistic constraints (Kumar et al., 2024Aleixo et al., 2023).
6. Open Challenges and Future Directions
Despite extensive progress, the field faces unresolved problems:
- Scaling to many tasks: Most methods are validated on 5–20 tasks; scaling beyond this (especially for resource-constrained or privacy-critical settings) is open (Aleixo et al., 2023).
- Task-agnostic inference: Most architectural and replay methods assume known task identity at test time, limiting their scope in unsupervised or streaming regimes (Aleixo et al., 2023).
- Evaluation protocols: The lack of standardized, application-realistic protocols undermines fair comparison; benchmarks that enforce strict no-replay, no-prescience, and constant update cost are necessary (Pfülb et al., 2019Celiberto et al., 2024).
- Theory: There remains no widely accepted theoretical guarantee for bounding retained performance under arbitrary drift (Doan et al., 2020Pfülb et al., 2019).
- Interpretability: Mechanistic analyses of CF (e.g., function vector tracking in LLMs, wide local maxima for GAN discriminators) are emerging but not yet unified (Jiang et al., 16 Feb 2025Thanh-Tung et al., 2018).
- Federated, time series, recommendation: Non-i.i.d., distributed, and temporally evolving domains present particularly insidious challenges, often unaddressed by canonical methods (Hallak et al., 24 Oct 2025Park et al., 2024).
7. Summary Table: Empirical Performance of Key Mitigation Strategies
| Dataset/Model | Naive | EWC | Replay-Based | Architectural | Embedding/FV | Best Reported BWT/AA |
|---|---|---|---|---|---|---|
| PermutedMNIST (10 tasks) | <5% | 60–70% | PRER: 99% | PNN: 99% | Centroids: 92–95% | PRER, PNN ≈0.99 |
| SplitMNIST (class incr.) | ≈20% | <22% | iCaRL: 94% | PNN: 98–99% | CM: 75% | iCaRL: –5.9% BWT |
| LLM (ALPACA, MMLU) | –6.7pp | n.a. | Rehearsal: +3.8pp | Wise-FT: +5.8pp | SAM: +7.0pp | SAM: +7.0pp |
| LLM (MetaMathQA, 13B) | –9.3pp | n.a. | Wise-FT: +10.8pp | FAPM: +99.67% | FAPM: +99.67% | FAPM: +99.67% |
| GAN Discriminator | Collapse | – | Replay+penalty | – | – | GP/Penalty: recovers |
AA: average accuracy; BWT: backward transfer; pp: percentage points relative to baseline.
References
- "Revisiting Catastrophic Forgetting in LLM Tuning" (Li et al., 2024)
- "An Empirical Study of Catastrophic Forgetting in LLMs During Continual Fine-tuning" (Luo et al., 2023)
- "Overcoming catastrophic forgetting in neural networks" (Loke et al., 14 Jul 2025)
- "On Catastrophic Forgetting and Mode Collapse in Generative Adversarial Networks" (Thanh-Tung et al., 2018)
- "Catastrophic forgetting: still a problem for DNNs" (Pfülb et al., 2019)
- "Catastrophic Forgetting in Deep Learning: A Comprehensive Taxonomy" (Aleixo et al., 2023)
- "A Methodology-Oriented Study of Catastrophic Forgetting in Incremental Deep Neural Networks" (Kumar et al., 2024)
- "Centroids Matching: an efficient Continual Learning approach operating in the embedding space" (Pomponi et al., 2022)
- "Unlocking the Power of Function Vectors for Characterizing and Mitigating Catastrophic Forgetting in Continual Instruction Tuning" (Jiang et al., 16 Feb 2025)
- "Mitigating Catastrophic Forgetting in LLMs with Forgetting-aware Pruning" (Huang et al., 10 Sep 2025)
- "Sequencing to Mitigate Catastrophic Forgetting in Continual Learning" (Moussa et al., 18 Dec 2025)
- "CF-KAN: Kolmogorov-Arnold Network-based Collaborative Filtering to Mitigate Catastrophic Forgetting in Recommender Systems" (Park et al., 2024)
- "A Conformal Predictive Measure for Assessing Catastrophic Forgetting" (Pitsiorlas et al., 15 May 2025)
- "Defeating Catastrophic Forgetting via Enhanced Orthogonal Weights Modification" (Li et al., 2021)
- "Continual Learning with Invertible Generative Models" (Pomponi et al., 2022)