Stability–Plasticity Dilemma in Continual Learning

Updated 15 December 2025

The stability–plasticity dilemma is the trade-off between preserving learned knowledge (stability) and acquiring new skills (plasticity) in continual learning.
Architectural innovations like Dual-Arch designs and adapter-based modularity separate stability from plasticity, improving performance and parameter efficiency.
Algorithmic methods such as null-space projection and multi-objective training optimize gradient control to mitigate catastrophic forgetting while enabling rapid learning.

The stability–plasticity dilemma is a central challenge in continual and lifelong learning, particularly in artificial neural networks. It denotes the inherent trade-off between preserving previously learned knowledge (stability) and acquiring new knowledge (plasticity) as tasks arrive sequentially. Effective continual learning methodologies seek to optimize both objectives, yet the requirements for retaining old information and flexibly learning novel tasks often conflict, leading to phenomena such as catastrophic forgetting and reduced learning efficiency. Recent research, including architectural, algorithmic, and neurobiological perspectives, has systematized metrics, introduced theoretical bounds, and proposed architectures to navigate and even resolve this dilemma.

1. Formalization and Quantification of Stability–Plasticity

The stability–plasticity dilemma is typically formalized via two objectives:

Stability: The ability of a model to retain performance on previously learned tasks. Quantitatively, this is measured using metrics such as average forgetting (AF), backward transfer (BWT), or explicit retention of task accuracies after sequential training steps:
- $AF_k = (1/(k-1)) \sum_{b=1}^{k-1} (a_b^* − a_b)$ , where $a_b^*$ is peak accuracy on task $b$ and $a_b$ its post-update accuracy (Lu et al., 4 Jun 2025).
- $BWT = \frac{1}{K-1} \sum_{i=1}^{K-1} (A_{K,i}−A_{i,i})$ (Lin et al., 2021).
Plasticity: The model's ability to rapidly learn new tasks. Metrics include average accuracy on new tasks (AAN), learning accuracy (LA), or forward transfer (FWT):
- $AAN = (1/K) \sum_{k=1}^K A_k^{new}$ (Lu et al., 4 Jun 2025).
- $LA = (1/T)\sum_{j=1}^T a_{j,j}$ (Jung et al., 2023).

Many frameworks cast task learning as minimizing a combination of stability and plasticity objectives: $\min_\theta \; (f_s(\theta), f_p(\theta)) \quad \Rightarrow \quad \min_\theta \left(\lambda_1 f_s(\theta) + \lambda_2 f_p(\theta)\right)$ where $f_s$ scores retention and $f_p$ adaptation, and $\lambda_1+\lambda_2=1$ tunes the trade-off (Lai et al., 30 Mar 2025).

2. Architectural Drivers: Depth–Width and Modular Design

Historically, the stability–plasticity trade-off was managed at the parameter or algorithmic level. Recent work demonstrates that network architecture alone—specifically depth and width under a fixed parameter budget—governs the trade-off:

Deeper/narrower models exhibit greater plasticity (better at learning new data); wider/shallower models retain more stability (less forgetting), as empirically demonstrated using split-budgets for networks like ResNet-18, MLPs, and Vision Transformers (Lu et al., 4 Jun 2025).
Recent frameworks such as Dual-Arch assign the plasticity objective to a deep/thin learner (Pla-Net) and the stability objective to a wide/shallow learner (Sta-Net), merging their expertise via knowledge distillation. This division not only improves both metrics, but also achieves substantial parameter efficiency (up to 87% reduction) without performance loss (Lu et al., 4 Jun 2025).
Adapter-based modularity, as in AdaLL, co-trains a backbone for stability (task-invariant features) and adapters for plasticity (task-specific features) under regularization constraints, outperforming both frozen and single-block fine-tuning (Wang et al., 8 Mar 2025).

Architectural Approach	Stability Role	Plasticity Role
Deep/narrow	Poor	Strong
Wide/shallow	Strong	Poor
Dual-Arch (2-network)	Sta-Net: stability	Pla-Net: plasticity
Adapters (AdaLL)	Backbone: stability	Adapter: plasticity

These architectural insights add a new dimension, beyond regularization or replay, to stability–plasticity optimization.

3. Algorithmic Innovations and Gradient Control

Algorithmic methods for managing the trade-off include:

Null-Space Projection: Advanced Null Space (AdNS) projects new-task gradients onto the null-space of feature representations for prior tasks, securing stability. The dimension of the null-space directly governs plasticity—larger null-space allows faster adaptation but increases forgetting, as formalized by dual theoretical bounds (Kong et al., 2022, Lin et al., 2021).
Multi-Objective Training: ParetoCL learns preference-conditioned solutions representing the full Pareto front between stability and plasticity by sampling trade-off weights $\alpha \in [0,1]$ during training and dynamically selecting the optimal mixture during inference via entropy minimization. This method achieves superior adaptation and robustness over prior single-weight replay approaches (Lai et al., 30 Mar 2025).
Selective Memory Replay and Inequality Constraints: Synergetic Memory Rehearsal (SyReM) augments memory stability via hard inequality constraints (never increasing average buffer loss), while amplifying plasticity through gradient-aligned rehearsal—replaying only those samples whose gradients are most similar to the current learning direction (Lin et al., 27 Aug 2025).
Neuron-Level Targeting: NBSP identifies "RL skill neurons" relevant to previous tasks and suppresses their gradients while enabling full updates elsewhere, improving both metrics with minimal architectural modification (Lan et al., 9 Apr 2025).

4. Empirical and Theoretical Trade-off Analysis

Extensive empirical studies validate the existence and importance of the trade-off:

Across diverse benchmarks (CIFAR-100, ImageNet, CORe50, Meta-World, TT100K, railway scheduling) methods that favor only stability (e.g., rigid knowledge distillation, strict null-space projection) suffer poor adaptation; methods focusing solely on plasticity risk catastrophic forgetting.
Branch-Tuning and MuFAN analyses highlight the layerwise divergence: batch normalization layers primarily underpin stability, while convolutional layers control plasticity. Decoupling their updates and using complementary normalization mechanisms (e.g., SPN) recover both objectives (Liu et al., 27 Mar 2024, Jung et al., 2023).
Theoretical bounds underline that the precise dimension of a projection (null-space) or architectural expansion (Fly Model) shifts the Pareto frontier, dictating the achievable stability–plasticity mix (Zou et al., 3 Feb 2025, Kong et al., 2022).
In reinforcement learning, AltNet demonstrates that periodic full parameter resets restore plasticity but historically cost stability—its twin-network mechanism avoids performance dips, anchoring stability while enabling arbitrarily frequent restoration of plasticity (Maheshwari et al., 30 Nov 2025).

Methodology	Key Empirical Benefit	Trade-off Control Mechanism
Dual-Arch	+2–10pt acc., –2–12pt forgetting, –87% param	Architectural decoupling
SyReM	–0.01% BWT, –27% error vs. vanilla	Selective replay + hard constraint
Branch-Tuning	+4–5pt accuracy vs. finetune	Layer-wise branch expansion/compress
AdNS	+2–5pt accuracy, optimal BWT	Null-space rank tuning

5. Domain-Generalization and Biological Inspiration

Recent advances move beyond supervised class-incremental settings:

Continual Named Entity Recognition (CNER) strategies balance recall and acquisition via two-term loss functions that encompass pooled attention distillation and weighted parameter merging (Zhang et al., 5 Aug 2025).
In Object Detection, decoupling classification (plasticity via LoRA adapters) from localization (stability via freezing) in DETR-style detectors achieves robust cross-domain adaptation and anti-forgetting (Li et al., 14 Apr 2025).
Biological models such as the Fly olfactory circuit introduce large expansion (random projection), sparse connectivity, and winner-take-all inhibition, empirically and theoretically boosting both stability and plasticity via increased orthogonality and reduced weight magnitudes; these principles inspire plug-and-play modules for machine learning backbones (Zou et al., 3 Feb 2025).

6. Feature-Based and Representation Analysis

CKA and t-SNE feature-space evaluations reveal the hidden bias of many continual learning algorithms toward stability, sometimes at the expense of meaningful plasticity. Algorithms maintaining nearly static feature extractors can still post strong task metrics. Feature representation analysis (ΔM′, layerwise CKA) is recommended alongside accuracy curves to more precisely monitor the trade-off (Kim et al., 2023, Liu et al., 27 Mar 2024).

7. Perspectives and Dynamic Adaptation

Research consensus emphasizes that paradigms locking in a fixed trade-off (via static hyperparameters or architecture) tend to be suboptimal. Dynamic control—whether through preference-conditioned inference (ParetoCL), modular gating (Adapters, Dual-Arch), or curriculum-driven expansion—yields substantially improved adaptation, generalization, and resource efficiency (Lai et al., 30 Mar 2025, Lan et al., 9 Apr 2025, Jaziri et al., 19 Aug 2024).

Future directions include automated hyperparameter adaptation (e.g., null-space rank), integration of biological modules, further layerwise isolation, and standardization of representation-based trade-off metrics in continual learning evaluations. The field continues to expand into more complex domains, including self-supervision, reinforcement learning, and open-world detection, leveraging diverse algorithmic and architectural levers for simultaneous stability and plasticity optimization.