Papers
Topics
Authors
Recent
Search
2000 character limit reached

Catastrophic Forgetting in Continual Learning

Updated 23 June 2026
  • Catastrophic Forgetting is the loss of previously learned knowledge when models update on new tasks without revisiting old data.
  • It is quantified by metrics such as F_k and average forgetting, with larger models and disjoint tasks experiencing more severe drops.
  • Mitigation strategies include replay-based, regularization-based, and architectural methods to balance stability and plasticity in continual learning.

Catastrophic Forgetting (CF) is a fundamental challenge in continual and incremental learning regimes, referring to the dramatic loss of previously acquired knowledge when a model is updated on new data or tasks without explicit access to the earlier data. CF persists as a central obstacle for neural networks, LLMs, and related systems across a wide range of supervised, generative, and reinforcement learning applications.

1. Formal Definitions, Metrics, and Phenomenology

Catastrophic forgetting manifests as a precipitous decline in accuracy or predictive performance on old tasks after sequentially training on new, disjoint data. In the classical continual learning setup, let TT be a sequence of NN tasks, M0\mathcal{M}_0 the initial model, and Mn\mathcal{M}_n the model after training on the nn-th task. If ai,ja_{i, j} denotes accuracy on task ii after learning up to task jij \geq i, forgetting on task kk is formally measured as

Fk=maxlkak,lak,NF_k = \max_{l \le k} a_{k, l} - a_{k, N}

with the average forgetting over NN0 tasks given by NN1 (Aleixo et al., 2023). Similar metrics such as backward transfer (BWT) and average plasticity (AP) are widely adopted. Application-specific adaptations—for example, percentage-drop metrics on held-out language understanding benchmarks for LLMs—are routine (Luo et al., 2023).

Precise assessment requires careful protocol design. A prominent concern is to avoid “prescient” evaluation, i.e., hyperparameter tuning or early stopping that illegally uses old-task data unavailable in deployment (Pfülb et al., 2019). Under these realistic constraints, most methods are empirically shown to suffer significant CF, especially in challenging class-incremental settings (Pfülb et al., 2019Pfülb et al., 2019).

2. Theoretical Foundations and Loss Landscape Connections

The root cause of CF is the shared parameterization of neural models: SGD or other optimizers minimize the empirical loss on the latest task, drifting away from optima found for earlier tasks. Bayesian analysis frames this as sequential posterior update: NN2. Without regularization, the new data rapidly “overwrites” parameter regions critical for old-task function (Loke et al., 14 Jul 2025, Doan et al., 2020).

Empirical and theoretical studies show the geometry of the loss landscape is intimately linked to CF (Li et al., 2024). Flatter solutions—quantified by spectral flatness (surface curvature, average gradient, mean absolute gradient)—produce significantly greater retention across tasks, while sharp minima amplify forgetting. This landscape perspective holds for both deep neural networks and LLMs; successive fine-tuning on divergent instruction sets (e.g., Alpaca NN3 Open-Platypus) not only sharpens the loss landscape but also causes stepwise drops of 6–17 percentage points on held-out general tasks such as MMLU (Li et al., 2024).

The neural tangent kernel (NTK) framework gives a formal handle: forgetting is governed by the principal angles between feature subspaces induced by different tasks. High similarity (large overlap eigenvalues) predicts more interference and CF. Orthogonally projecting new-task gradients (OGD), or storing only dominant principal directions (PCA-OGD), provably limits drift in the old-task function (Doan et al., 2020).

3. Empirical Manifestations Across Model Classes and Modalities

CF is universal in LLMs, vision models, audio, time series, and reinforcement learning agents (Luo et al., 2023Hallak et al., 24 Oct 2025Park et al., 2024Early, 2019). It arises in both discriminative (image classification, language understanding) and generative (GAN, VAE) settings. Large-scale meta-analyses show:

  • Model size effect: Larger models experience more severe CF, not less (Luo et al., 2023). E.g., on domain knowledge benchmarks, BLOOMZ-1.1B’s FGNN49.5% vs. BLOOMZ-7.1B's FGNN518.4%.
  • Architecture effect: Decoder-only LLMs can be more robust to CF than encoder-decoder transformers at equivalent scales (Luo et al., 2023).
  • Generative models: In GANs, CF in the discriminator destroys wide local maxima at real data, causing mode collapse and non-convergence unless explicitly penalized (Thanh-Tung et al., 2018).
  • Time series and federated learning: Non-i.i.d. and temporally-drifting domains (e.g., federated forecasting) further exacerbate CF, even for regularized RNNs and LSTMs (Hallak et al., 24 Oct 2025).
  • Recommender systems: Collaborative filtering with standard MLP autoencoders exhibits severe CF, while edge-level parameterization (KANs) can localize updates and substantially mitigate the effect (Park et al., 2024).

4. Algorithmic Approaches to Mitigating Catastrophic Forgetting

4.1. Replay-Based (Rehearsal) Methods

These methods retain a buffer of data (raw or synthetic) from old tasks and interleave them during new-task training. Exemplar-based rehearsal (e.g., iCaRL) and generative replay (e.g., DGR, PRER) are prominent representatives:

  • Mini-rehearsal: Maintains a memory coreset, optimizing joint or projected updates (Aleixo et al., 2023Pomponi et al., 2022).
  • Pseudo-rehearsal: Uses trained GANs/VAEs or invertible flows to sample from the embedding distribution of old tasks (Pomponi et al., 2022). PRER, for example, achieves BWTNN6-0.1% on MNIST/SVHN (near-perfect retention with constant-size memory).
  • Replay is effective but presents memory and privacy tradeoffs in federated or on-device contexts (Hallak et al., 24 Oct 2025).

4.2. Regularization-Based (Parameter-Centric) Methods

Regularizers penalize updates to parameters deemed critical for previous tasks.

4.3. Architectural, Masking, and Sequence Optimization

  • Parameter isolation/masking: Binary or continuous task-specific masks (e.g., HAT, Piggyback) prevent overwriting old-task paths (Kumar et al., 2024Aleixo et al., 2023).
  • Expansion: Progressive Neural Networks (PNN), Dynamically Expandable Networks (DEN), and OWM/EOWM grow or selectively retrain sub-spaces while maintaining strong task isolation (Li et al., 2021).
  • Optimizing task order: Intelligent sequencing of tasks via zero-shot NAS proxies (e.g., NWOT, AID-augmented diversity) actively reduces CF spikes, especially in non-i.i.d. settings (Moussa et al., 18 Dec 2025).

4.4. Representation-Level and Embedding-Space Regularization

Methods that directly regularize or stabilize the embedding space (e.g., centroids matching (Pomponi et al., 2022), function vector regularization (Jiang et al., 16 Feb 2025)) show robust gains in class- and task-incremental protocols. For LLMs, interventions targeting the “function vector” subspace preserve zero/few-shot performance despite extensive continual tuning (Jiang et al., 16 Feb 2025).

Method Class Example Algorithms Core Principle
Replay-based iCaRL, PRER, DGR Rehearse/replay old data (raw or generated)
Regularization-based EWC, SAM, FAPM Penalize change to key parameters / flatten loss
Architectural HAT, PNN, EOWM Isolate, expand, or mask subspaces per task
Task ordering NWOT, sequencing Optimize task order to minimize interference
Representation CentroidsMatching, FV Preserve structure in embedding/activation spaces

5. Empirical Best Practices, Limitations, and Tradeoffs

  • Replay (real or synthetic) is the most reliable defense—but can incur privacy, compute, or storage costs; experience replay buffers may be infeasible for on-device or federated scenarios (Hallak et al., 24 Oct 2025, Aleixo et al., 2023).
  • Quadratic-penalty regularization (e.g., EWC) is simple, scalable, and broadly effective for tasks with limited interference, but quickly degrades as old and new tasks become more semantically aligned or as class-incrementality increases (Pfülb et al., 2019, Kumar et al., 2024).
  • Optimization-level methods such as SAM and FAPM, and embedding-level methods such as centroids matching and function vector stabilization, offer strong gains at low overhead in both giant transformers and standard DNNs (Li et al., 2024, Jiang et al., 16 Feb 2025, Huang et al., 10 Sep 2025, Pomponi et al., 2022).
  • True task-incremental, class-incremental, and domain-incremental setups can yield divergent CF dynamics. Model selection and early stopping must not rely on unavailable old data to avoid overstating gains (Pfülb et al., 2019).
  • No single approach fully solves CF: hybrid strategies—combining replay, adaptive regularizers, architectural isolation, and task-sequence optimization—yield the best results under realistic constraints (Kumar et al., 2024Aleixo et al., 2023).

6. Open Challenges and Future Directions

Despite extensive progress, the field faces unresolved problems:

  • Scaling to many tasks: Most methods are validated on 5–20 tasks; scaling beyond this (especially for resource-constrained or privacy-critical settings) is open (Aleixo et al., 2023).
  • Task-agnostic inference: Most architectural and replay methods assume known task identity at test time, limiting their scope in unsupervised or streaming regimes (Aleixo et al., 2023).
  • Evaluation protocols: The lack of standardized, application-realistic protocols undermines fair comparison; benchmarks that enforce strict no-replay, no-prescience, and constant update cost are necessary (Pfülb et al., 2019Celiberto et al., 2024).
  • Theory: There remains no widely accepted theoretical guarantee for bounding retained performance under arbitrary drift (Doan et al., 2020Pfülb et al., 2019).
  • Interpretability: Mechanistic analyses of CF (e.g., function vector tracking in LLMs, wide local maxima for GAN discriminators) are emerging but not yet unified (Jiang et al., 16 Feb 2025Thanh-Tung et al., 2018).
  • Federated, time series, recommendation: Non-i.i.d., distributed, and temporally evolving domains present particularly insidious challenges, often unaddressed by canonical methods (Hallak et al., 24 Oct 2025Park et al., 2024).

7. Summary Table: Empirical Performance of Key Mitigation Strategies

Dataset/Model Naive EWC Replay-Based Architectural Embedding/FV Best Reported BWT/AA
PermutedMNIST (10 tasks) <5% 60–70% PRER: 99% PNN: 99% Centroids: 92–95% PRER, PNN ≈0.99
SplitMNIST (class incr.) ≈20% <22% iCaRL: 94% PNN: 98–99% CM: 75% iCaRL: –5.9% BWT
LLM (ALPACA, MMLU) –6.7pp n.a. Rehearsal: +3.8pp Wise-FT: +5.8pp SAM: +7.0pp SAM: +7.0pp
LLM (MetaMathQA, 13B) –9.3pp n.a. Wise-FT: +10.8pp FAPM: +99.67% FAPM: +99.67% FAPM: +99.67%
GAN Discriminator Collapse Replay+penalty GP/Penalty: recovers

AA: average accuracy; BWT: backward transfer; pp: percentage points relative to baseline.

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Catastrophic Forgetting (CF).