Loss of Plasticity in Neural Networks

Updated 3 July 2026

Loss of Plasticity is the gradual reduction in a network's ability to adapt to new data or tasks during continual learning, distinct from catastrophic forgetting.
It is quantified using measures like gradient norm, activation footprint, and effective Hessian rank, revealing reduced adaptability across diverse learning paradigms.
Mitigation strategies include normalization, selective weight reinitialization, and curvature regularization to sustain neural network plasticity over time.

Loss of plasticity is the degradation of a neural network's ability to adapt its parameters to new data, tasks, or environments as training progresses—especially in non-stationary or continual learning scenarios. Unlike catastrophic forgetting (loss of previous knowledge), loss of plasticity refers to an impaired capacity to acquire new skills or knowledge, even though the network may maintain performance on old tasks. This phenomenon manifests in supervised, reinforcement, and multi-agent learning, and affects densely and sparsely parameterized models, including LLMs and multi-expert architectures.

1. Formal Definitions, Measurement, and Distinction from Forgetting

Plasticity is most precisely defined as a model’s ability to reduce its objective on newly arriving data or tasks as effectively as a freshly initialized model under a fixed training budget. Loss of plasticity occurs when, after a sequence of tasks or prolonged training, the network’s adaptation to new targets is systematically poorer than that of a freshly initialized counterpart (Lyle et al., 2024, He, 22 Mar 2026, Joudaki et al., 30 Sep 2025).

Formally, given a parameter vector $\theta$ at training step $t$ and new task $k$ with data from $P_k$ , plasticity can be quantified by the gap: $\Delta_k = L_k(\theta_k) - \min_\theta L_k(\theta)$ where $L_k$ is the loss on task $k$ , and $\theta_k$ is the parameter after training on prior tasks. A rising or non-shrinking $\Delta_k$ with $k$ indicates plasticity loss (Wang et al., 25 Mar 2025).

Alternative proxies include:

Online or per-task accuracy immediately after presentation of new tasks (Dohare et al., 2023, Wang et al., 2024).
Gradient-norm and activation-footprint measures (e.g., fraction of ReLU units with nonzero activation) (Abbas et al., 2023, Klein et al., 2024).
Effective rank of the Hessian (curvature) or Neural Tangent Kernel (NTK) (He et al., 26 Sep 2025, Lewandowski et al., 2023, Luo et al., 6 May 2026).
Dormant/stagnant neuron ratios or mean update activity (Liu et al., 24 Jun 2026, Liu et al., 2024).

Plasticity loss is distinct from forgetting: a model may retain old skills yet be unable to fit new tasks (plasticity loss), or conversely, forget without losing the ability to re-learn (pure catastrophic forgetting) (Klein et al., 2024).

2. Mechanistic Explanations and Diagnostic Theory

Multiple interrelated mechanisms contribute to loss of plasticity, with convergent evidence across theoretical and empirical works:

2.1 Collapse of Directions of Curvature

Loss of plasticity is often linked to the collapse of the Hessian or NTK spectrum, such that meaningful directions for effective gradient-based learning vanish. Formally, the effective rank (e.g., $t$ 0 or $t$ 1 for Hessian or NTK $t$ 2) shrinks over tasks, causing gradient steps to be ineffective and stalling loss reduction (Lewandowski et al., 2023, He et al., 26 Sep 2025, Luo et al., 6 May 2026).

2.2 Saturation and Dormancy of Units

“Dead” (ReLU) or “saturated” (tanh) units arise when preactivation statistics drift, producing vanishing derivatives and zero gradient flows. This leads to units that neither contribute forward nor backward signals (dormant neurons), resulting in shrinking representational subspaces and expressivity (Lyle et al., 2024, Klein et al., 2024, Liu et al., 2024).

2.3 Optimization Landscapes and Trapping

Optimization-centric perspectives (OCP) theoretically equate plasticity loss to being trapped in local optima that are poor for new objectives but were good for previous ones, and formalize dormancy as equivalent to persistent zero-gradient states. Parameter entrenchment reduces the effective dimensionality available for learning on future tasks (He, 22 Mar 2026). In multi-agent RL, neurons may become “stagnant” with negligible gradient updates relative to their weight norm, hindering adaptation (Liu et al., 24 Jun 2026).

2.4 Rank Collapse and Feature Representation

Plasticity loss frequently coincides with collapse of the effective rank of feature representations, especially in recurrent or mixture-of-experts policies (spectral collapse), further constraining adaptation (Luo et al., 6 May 2026, Klein et al., 2024).

2.5 Weight Growth and Sharpness

Unchecked parameter norm growth increases loss-surface sharpness, leading to low sensitivity to parameter changes and further impeding adaptation. This manifests in both regression (when target magnitudes are large) and reinforcement learning (bootstrapped TD targets with large scale) (Lyle et al., 2024, Klein et al., 2024, Lyle et al., 2023).

2.6 Non-stationarity, Primacy Bias, and Churn

Input or target non-stationarity causes early experiences (primacy bias) to dominate the parameter space, leading to capacity loss and drift toward local minima specific to earlier tasks (Abbas et al., 2023, Klein et al., 2024, Tang et al., 31 May 2025). Churn—the output drift for out-of-batch data between updates—grows as NTK rank collapses, exacerbating plasticity loss (Tang et al., 31 May 2025).

3. Empirical Demonstrations and Scaling Behavior

Plasticity loss is robustly observed across settings:

Supervised Continual Learning: Sequentially scheduled binary classification on ImageNet (2,000 tasks) results in a 12% accuracy drop with vanilla training; on permuted-MNIST, accuracy decays to chance (Dohare et al., 2023).
Deep Reinforcement Learning: Cycling through Atari games for 1B+ frames, value networks exhibit a >90% drop in gradient norm and collapse in activation footprint; learning stalls despite uncompromised replay buffer size (Abbas et al., 2023).
LLMs: GPT-style transformers, even up to 314M parameters, display loss of plasticity in multilingual continual pretraining, as measured by increased adaptation time (AUC) on a held-out probing language. The onset follows a scaling law $t$ 3, sublinear in parameter count (Hernandez-Garcia et al., 23 Jun 2026).
Gradual vs Abrupt Environment Shifts: Abrupt task changes induce pronounced loss of plasticity, while gradual interpolation or task mixing preserves both trainability and generalizability, indicating that much of the reported effect is an artifact of unrealistic experimental protocols (Liu et al., 9 Feb 2026).
Multi-Agent RL: Value factorization methods lose plasticity due to growing populations of stagnant neurons in the mixing network, directly traced via relative update activity metrics (Liu et al., 24 Jun 2026).

4. Mitigation Strategies

No single mitigation is universally sufficient—effective solutions typically target specific causal mechanisms and are often complementary.

4.1 Layer Normalization and Weight Decay

Layer normalization enforces stable preactivation statistics, forestalling saturation and maintaining gradient flow. L2 regularization prevents explosive growth in parameter norms (sharpening), keeping representations in a regime conducive to gradient-based adaptation. The joint application (“Swiss-cheese” model) effectively preserves trainability in nonstationary settings (Lyle et al., 2024, Lyle et al., 2023, He et al., 26 Sep 2025).

4.2 Structural and Reset-based Interventions

Neuron and Weight Reinitialization: Periodic stochastic reinitialization of dormant units (“Continual Backpropagation”, ReDo), or of low-utility weights (Selective Weight Reinitialization), can indefinitely sustain high plasticity, especially when granularity is tuned to match network architecture (per-weight preferred in small or layer-normed networks) (Dohare et al., 2023, Hernandez-Garcia et al., 31 Jul 2025, Klein et al., 2024).
Concatenated ReLUs and Activation Redesign: CReLU and related designs prevent activation footprint collapse, guaranteeing persistent gradient flow and activity (Abbas et al., 2023, Klein et al., 2024).
Neuroplastic Expansion: Dynamically growing and pruning the network topology based on gradient magnitude and activation (with experience review) effectively maintains the active neuron ratio and adaptability, outperforming reset and normalization baselines in RL (Liu et al., 2024).
Knowledge-retentive Neuron Surgery (KNIFE): In multi-agent RL, composite neuron replacements—which preserve knowledge while restoring adaptation capacity—address persistent “stagnant” neurons (Liu et al., 24 Jun 2026).

4.3 Curvature and Rank Regularization

Feature Rank and Spectral Penalties: Directly penalizing representation collapse (effective rank regularization, Parseval/Feature Gram metrics) preserves the diversity of update directions needed for task adaptation. This is especially critical for MoE-type actor-critic architectures (He et al., 26 Sep 2025, Luo et al., 6 May 2026).
2-Wasserstein Regularization: Penalizing the discrepancy between current and initial weight distributions at the layer level maintains curvature, outperforming naive L2 or regularization toward the initial parameter vector (Lewandowski et al., 2023).

4.4 Experience Replay and In-context Learning

A small replay buffer, when processed by a transformer architecture (in-context learning), suffices to maintain plasticity—even in the absence of architectural or optimizer modifications (Wang et al., 25 Mar 2025).

4.5 Regenerative Methods

Continuously regularizing toward the original initialization (“regenerative regularization”, “shrink and perturb”) curbs drift in parameter space and mitigates the accumulation of dead units and curvature collapse. These methods are robust across both on-policy and off-policy RL, often outperforming algorithm-aware resets (Juliani et al., 2024).

4.6 Churn-Reduction

Minimizing output variability on out-of-batch data (“churn”) by explicit regularization of the NTK structure improves plasticity and maintains learning in continual RL benchmarks (Tang et al., 31 May 2025).

4.7 Adaptive Restarts

Adaptive shrink-restore (ASR) uses label-flip statistics as an early warning of plasticity decline, triggering partial or full reinitialization based on empirical thresholds, effectively arresting long-term plasticity decay in domain adaptation (Wang et al., 2024).

5. Architectural Scaling, Environmental Design, and Open Challenges

Plasticity loss in LLMs scales sublinearly in parameter count; increasing model size delays but does not eliminate the phenomenon (Hernandez-Garcia et al., 23 Jun 2026). Similarly, simply increasing width or depth in RL or supervised models results in low-rank, “cloned” states that can trap dynamics (LoP manifolds), unless explicit symmetry-breaking is incorporated (Joudaki et al., 30 Sep 2025).

Gradual task changes, as opposed to abrupt switches, fundamentally mitigate plasticity loss, aligning empirical behavior with real-world, smoothly evolving data streams (Liu et al., 9 Feb 2026).

Empirical best practices for sustaining plasticity include architectural normalization, weight growth control, periodic unit/weight resets, experienced-based replay, and explicit regularization of feature or Hessian rank. However, the interplay of mechanisms is subtle, and interventions effective in one regime (e.g., LayerNorm, Dropout) may exacerbate loss in others (e.g., small networks, layer-normed MLPs, stationary domains) (Lyle et al., 2024, Hernandez-Garcia et al., 31 Jul 2025).

Unified benchmarks and standardized metrics for plasticity loss and trainability across domains remain an open need (Klein et al., 2024). Additionally, a principled causal theory synthesizing the roles of curvature, representation collapse, non-stationarity, and parameter entrenchment is still in development.

6. Connections to Broader Problems and Future Directions

Loss of plasticity is tightly linked to core challenges in non-stationary and continual learning, including:

Training instabilities and scaling failures in deep RL and large transformer models.
Overestimation bias and shallowness of exploration in policy learning.
The stability–plasticity dilemma: balancing the preservation of prior knowledge (stability) with the ability to continuously acquire new information (plasticity).

Emerging research areas include:

Plasticity-preserving algorithms for ultra-LLMs.
Task-adaptive modular architectures and implicit regularizers.
Experience schedule design and environmental curriculum engineering to minimize abrupt landscape changes.
Principles for detection and proactive restoration of plasticity in live, deployed systems.

A complete theoretical understanding and unified practical framework remain under active investigation, with current methodologies best viewed as a set of complementary, mechanistically targeted tools for sustaining lifelong adaptability in deep neural systems.