Stability–Plasticity Dilemma

Updated 6 February 2026

The stability–plasticity dilemma is the trade-off between acquiring new information (plasticity) and retaining established knowledge (stability) to prevent catastrophic forgetting.
It is formalized using multi-objective optimization techniques that quantify forgetting and retention through metrics such as backward and forward transfer.
Architectural and algorithmic strategies, including null-space methods and branch-tuning, are employed to balance these competing demands in continual learning systems.

The stability–plasticity dilemma describes the fundamental challenge in continual, lifelong, and incremental learning: balancing a system’s ability to rapidly acquire new information (plasticity) without catastrophically overwriting established knowledge (stability). This phenomenon is universal across artificial neural networks, reinforcement learning agents, dense prediction pipelines, and even biological circuits. The dilemma manifests as a trade-off: unconstrained learning enables swift adaptation but induces catastrophic forgetting, while overly conservative update rules preserve history at the expense of ongoing learning capacity. Formally, it is a central constraint in the design and analysis of algorithms for continual, online, and class-incremental learning.

1. Formalization and Quantification of the Stability–Plasticity Dilemma

Mathematically, the stability–plasticity dilemma is expressed through the evolution of a model’s loss or accuracy on sequences of tasks $\mathcal{T}_1, \ldots, \mathcal{T}_T$ . As each new task arrives, plasticity is the ability to drive loss $\mathcal{L}_t$ low on task $t$ , while stability is the ability to keep prior losses $\mathcal{L}_{1:t-1}$ approximately unchanged.

The dilemma is often captured through multi-objective or Pareto optimization:

$\min_{\theta}\;\big\{\,\mathcal{L}_{\text{new}}(\theta)\,,\ \max_{j<t} F_{j\gets t}\,\big\}$

where $F_{j\gets t} = \mathcal{L}_{j}(\theta_{t}) - \mathcal{L}_{j}(\theta_{j})$ quantifies forgetting on old task $j$ after learning task $t$ (Spotorno et al., 29 Jan 2026, Liu et al., 2024). Scalarizations, such as $\mathcal{L}_{\text{total}} = \lambda\mathcal{L}_{\text{new}} + (1-\lambda)\mathcal{L}_{\text{old}}$ , operationalize the trade-off for algorithm design (Lai et al., 30 Mar 2025). In self-supervised learning, representation-similarity metrics such as Centered Kernel Alignment (CKA) quantify how much extracted features are preserved (stability) or shift (plasticity) across incremental steps (Liu et al., 2024, Kim et al., 2023).

Key metrics include:

Property	Quantification (Typical)
Stability	Backward transfer (BWT): $(1/(T-1))\sum_{i=1}^{T-1} (A_{T,i} - A_{i,i})$
Plasticity	Forward transfer (FWT): $\mathcal{L}_t$ 0
Joint	Average incremental/test accuracy; multi-objective accuracy and forgetting trade-off

Here $\mathcal{L}_t$ 1 is accuracy on task $\mathcal{L}_t$ 2 after learning $\mathcal{L}_t$ 3 tasks, $\mathcal{L}_t$ 4 is accuracy on $\mathcal{L}_t$ 5 just after it is learned, and $\mathcal{L}_t$ 6 is accuracy when trained from scratch.

In reinforcement learning, plasticity loss is quantified by the difference in return between a plastic (freshly-initialized) agent and one that has undergone prolonged training: $\mathcal{L}_t$ 7, where $\mathcal{L}_t$ 8 denotes expected return (Maheshwari et al., 30 Nov 2025).

2. Architectural and Algorithmic Manifestations

The capacity and dynamics of the underlying architecture determine the attainable stability–plasticity equilibrium.

Depth vs. Width: Deeper networks are empirically more plastic, while wider networks yield greater stability under parameter-equalized constraints (Lu et al., 4 Jun 2025).
Branching and Adapters: Splitting the model into a shared backbone and task-specific adapters enables the backbone to accumulate invariant knowledge (stability) while adapters absorb task-specific shifts (plasticity) (Wang et al., 8 Mar 2025).
Null-space and Subspace Methods: Advanced Null Space approaches project gradients into low-rank subspaces orthogonal to previous task data, explicitly controlling allowable deviation for plasticity (Kong et al., 2022, Lin et al., 2021).

Several algorithmic paradigms exhibit distinctive stability–plasticity regimes:

Method Paradigm	Plasticity Impact	Stability Impact	Notable Examples
Experience Replay	High (if buffer is sufficient)	Moderate–High	ParetoCL (Lai et al., 30 Mar 2025), SyReM (Lin et al., 27 Aug 2025)
Regularization (EWC, SI, etc.)	Moderate	High (may suppress plasticity)	EWC (Zou et al., 3 Feb 2025), AdNS (Kong et al., 2022)
Architectural Expansion/Branching	Very High (new params)	Very High (isolation per task)	Branch-Tuning (Liu et al., 2024), DER, AdaLL (Wang et al., 8 Mar 2025)
Modular/Library Sovereignty	Very High (plasticity via switching)	Absolute (frozen specialists)	HYDRA (Spotorno et al., 29 Jan 2026)

3. Theoretical Analyses and Capacity Dynamics

Recent work formalizes stability–plasticity via effective model capacity, showing that in any non-stationary continual learning regime, a neural network’s ability to represent both past and new tasks is inherently non-stationary (Chakraborty et al., 11 Aug 2025). The Continual Learning Effective Model Capacity (CLEMC):

$\mathcal{L}_t$ 9

drifts upward (capacity deteriorates) under continual distributional shift, regardless of architecture or optimizer. Weighted loss regularization and replay can slow but not halt this drift; model expansion or adaptation is required for long-term balance (Chakraborty et al., 11 Aug 2025). This formalizes why replay, regularization, and fixed-parameter strategies never fully resolve the dilemma.

In null-space approaches, the rank (dimension) of the projected subspace directly modulates the trade-off:

Larger null space: more plasticity, less stability.
Smaller null space: less plasticity, more stability.

Similar trade-offs exist for the number of branching parameters or width of adapters (Kong et al., 2022, Wang et al., 8 Mar 2025).

4. Algorithmic Strategies: Modulating and Decoupling the Trade-off

Gradient Projection and Null Space Approaches

Advanced Null Space (AdNS) projects increments into shared low-rank subspaces while tightening constraints as tasks accumulate, implementing a non-uniform interference bound to interpolate between full stability and unrestricted plasticity (Kong et al., 2022, Lin et al., 2021). Linear connectors blend stability-oriented and plasticity-oriented optima via explicit interpolation in parameter space, controlling the trade-off via a convex combination parameter $t$ 0 (Lin et al., 2021).

Multi-Objective and Preference-Conditioned Optimization

Pareto Continual Learning (ParetoCL) recasts stability and plasticity as multi-objective criteria, learning a continuum of models parameterized by trade-off preferences (e.g., $t$ 1 weighting for new vs. prior data). At inference, the most confident prediction under the learned Pareto front is selected per sample (Lai et al., 30 Mar 2025).

Module Specialization and Modular Sovereignty

The HYDRA paradigm solves the dilemma by assembling frozen libraries of regime-specific specialist networks, blended online via uncertainty-aware gating, eliminating catastrophic forgetting by design (Spotorno et al., 29 Jan 2026). Similarly, Dual-Arch uses independent specialist networks for stability and plasticity, trained sequentially with knowledge distillation (Lu et al., 4 Jun 2025).

Replay and Sample Selection

Selective rehearsal (e.g., SyReM) elevates plasticity while maintaining buffer-enforced stability by replaying only those memory samples whose gradients are maximally aligned with the current data, enforced by explicit gradient projection constraints (Lin et al., 27 Aug 2025).

Neuron-level Control and Fine-Grained Modulation

Neuron-level strategies, such as gradient masking over skill neurons in RL, further refine the balance by targeting stability only for neurons empirically found to be critical for previously acquired skills, maintaining global plasticity elsewhere (Lan et al., 9 Apr 2025).

5. Empirical and Domain-Specific Manifestations

Self-supervised vision: Freezing BatchNorm layers implements stability; tuning only convolutional layers offers plasticity. Branch-tuning isolates new information to trainable “branch” kernels before merging, achieving near-optimal stability/plasticity (Liu et al., 2024).
Class-incremental Learning: Most methods overemphasize stability, often leaving feature extractors functionally unchanged across tasks, causing a lack of genuine plastic feature acquisition. Representation analysis via linear-probe retraining and CKA exposes this phenomenon (Kim et al., 2023).
Reinforcement Learning: Alternating twin network resets (AltNet) restores plasticity without catastrophic performance drops, whereas single-network resets degrade stability (Maheshwari et al., 30 Nov 2025). Neural architectures inspired by the fly olfactory circuit employ sparse expansion, high-dimensional mixing, and winner-take-all coding to enhance both stability and plasticity (Zou et al., 3 Feb 2025).
Adaptive Control and CPS: In certifiable cyber-physical systems, modular sovereignty offers robust guarantees against catastrophic instability, as frozen modules ensure regime-specific retention and online blending provides necessary adaptation (Spotorno et al., 29 Jan 2026).

6. Limitations, Open Problems, and Future Directions

Despite algorithmic advances, the stability–plasticity dilemmas remain only partially resolved. Empirical and theoretical results indicate that:

Model capacity must be treated as an evolving, not fixed, resource; remodeling or expansion may be needed in highly non-stationary regimes (Chakraborty et al., 11 Aug 2025).
Explicit, adaptive trade-off mechanisms—branch size in branching, null-space rank, preference conditioning—require further study and dynamic tuning (Kong et al., 2022, Lai et al., 30 Mar 2025).
Many current benchmarks and evaluation metrics can be gamed by pathological solutions that freeze large portions of the model, achieving apparent stability at the expense of meaningful learning (Kim et al., 2023).
Domain transfer (class-incremental segmentation, named-entity recognition, motion forecasting) raises new challenges for expressing and quantifying stability and plasticity due to output granularity, dynamic label semantics, or real-time requirements (Li et al., 2024, Zhang et al., 5 Aug 2025, Lin et al., 27 Aug 2025).
Modular and library-based methods trade off parameter footprint and latency against provable absence of forgetting, requiring coordinated research on uncertainty decomposition, gating, and model pruning (Spotorno et al., 29 Jan 2026).

Emerging research aims to unify capacity-aware control loops, architectural modularization, and adaptive regularization into a coherent methodology for continually reconciling stability and plasticity under application- and safety-driven constraints.

7. Summary Table: Key Aspects Across Domains

Aspect	Vision (SSL, CIL)	RL/Control	Dense Prediction/NLP	Theory & Generalization
Main Metrics	ACC, BWT, FWT, CKA	Episodic return, FM, FWT, skill retention	mIoU, F1(old/new), FM	CLEMC, differential equations
Core Mechanisms	Branch tuning, null space, adapters	Twin network resets, neuron masking, CDE	Loss constraints, module fusion	Dynamic capacity, Pareto fronts
Notable Limits	Over-stabilization, parameter bloat	Reset instability, capacity drift	Label drift, buffer limits	Capacity divergence, design trade-offs
Exemplars	(Liu et al., 2024 Kong et al., 2022 Wang et al., 8 Mar 2025)	(Maheshwari et al., 30 Nov 2025 Lan et al., 9 Apr 2025 Jaziri et al., 2024)	(Li et al., 2024 Zhang et al., 5 Aug 2025)	(Chakraborty et al., 11 Aug 2025 Lai et al., 30 Mar 2025 Spotorno et al., 29 Jan 2026)

The stability–plasticity dilemma remains the defining constraint for scalable, general continual learning, motivating ongoing innovations in algorithmic modulation, modular architectural design, and system-level certification.