The Rank and Gradient Lost in Non-stationarity: Sample Weight Decay for Mitigating Plasticity Loss in Reinforcement Learning

Published 2 Apr 2026 in cs.LG | (2604.01913v1)

Abstract: Deep reinforcement learning (RL) suffers from plasticity loss severely due to the nature of non-stationarity, which impairs the ability to adapt to new data and learn continually. Unfortunately, our understanding of how plasticity loss arises, dissipates, and can be dissolved remains limited to empirical findings, leaving the theoretical end underexplored.To address this gap, we study the plasticity loss problem from the theoretical perspective of network optimization. By formally characterizing the two culprit factors in online RL process: the non-stationarity of data distributions and the non-stationarity of targets induced by bootstrapping, our theory attributes the loss of plasticity to two mechanisms: the rank collapse of the Neural Tangent Kernel (NTK) Gram matrix and the $Θ(\frac{1}{k})$ decay of gradient magnitude. The first mechanism echoes prior empirical findings from the theoretical perspective and sheds light on the effects of existing methods, e.g., network reset, neuron recycle, and noise injection. Against this backdrop, we focus primarily on the second mechanism and aim to alleviate plasticity loss by addressing the gradient attenuation issue, which is orthogonal to existing methods. We propose Sample Weight Decay -- a lightweight method to restore gradient magnitude, as a general remedy to plasticity loss for deep RL methods based on experience replay. In experiments, we evaluate the efficacy of \methodName upon TD3, \myadded{Double DQN} and SAC with SimBa architecture in MuJoCo, \myadded{ALE} and DeepMind Control Suite tasks. The results demonstrate that \methodName effectively alleviates plasticity loss and consistently improves learning performance across various configurations of deep RL algorithms, UTD, network architectures, and environments, achieving SOTA performance on challenging DMC Humanoid tasks.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces Sample Weight Decay to mitigate gradient signal decay and restore neural plasticity in deep RL.
It rigorously connects plasticity loss to NTK rank collapse and gradient attenuation caused by non-stationarity.
Empirical results show 13.7%-30.1% performance gains on benchmarks, validating SWD's effectiveness across varied environments.

Theoretical and Algorithmic Advances in Mitigating Plasticity Loss in Deep Reinforcement Learning

Introduction

Plasticity loss—the progressive decline in a neural network's adaptation capability during reinforcement learning (RL)—is a central obstacle to efficient long-horizon RL. This paper presents a rigorous theoretical framework connecting plasticity loss to optimization dynamics and non-stationarity, and introduces Sample Weight Decay (SWD), a lightweight, practical method to mitigate gradient attenuation and restore network plasticity during RL. The method is validated across TD3, Double DQN, and SAC on MuJoCo, Arcade Learning Environment (ALE), and DeepMind Control Suite (DMC), demonstrating robust and state-of-the-art improvements (2604.01913).

Theoretical Characterization of Plasticity Loss

The analysis formalizes two mechanisms underpinning plasticity loss in deep RL:

Rank Collapse of Neural Tangent Kernel (NTK): Non-stationary training induces rank deficiency in the NTK Gram matrix, impeding the network's capacity to fit new data. While prior empirical works hypothesized episodic NTK collapse, this paper establishes a rigorous link between sequential loss function restarts and the decay of NTK rank.
Gradient Attenuation: The gradient of the RL loss function decays at the rate $\mathcal{O}(1/k)$ with the number of training iterations $k$ , reducing the optimizer's ability to escape saddle points and diminishing the effectiveness of parameter updates. This effect is fundamentally tied to the non-stationary replay buffer and target shift from bootstrapping, persisting regardless of network capacity or initialization.

Both effects are shown to be necessary and sufficient for the observed decline in RL agent adaptability and learning over extended training.

Sample Weight Decay (SWD): Methodology

SWD targets the orthogonal issue of gradient signal decay, distinct from NTK-aware strategies which emphasize architecture- or reset-based interventions. The algorithm applies a temporally decaying weight to each sample in the experience buffer: recent samples receive higher probability during batch sampling, proportional to $w_i = \max(w_\mathrm{min}, 1 - \mathrm{age}_i / T)$ where $\mathrm{age}_i$ is the time since sample collection and $T$ the decay horizon. This strategy robustly compensates for gradient attenuation induced by frequent updates to stale data, restoring the effective batch gradient magnitude and sustaining plasticity over time.

Notably, SWD is algorithm-agnostic, incurs minimal computational overhead, and is compatible with established methods such as NTK-reset and Plasticity Injection.

Empirical Evaluation

Experiments systematically assess SWD's efficacy across continuous and discrete control benchmarks:

Performance Gains: SWD yields 13.7%-30.1% improvement in Interquartile Mean (IQM) scores and faster convergence relative to uniform sampling and Prioritized Experience Replay (PER). In MuJoCo's Ant and Humanoid, as well as DMC's Humanoid tasks, SWD achieves SOTA scores with consistently higher sample efficiency.
Plasticity Metrics: Using the GraMa metric, SWD maintains non-sparse, high-magnitude gradients throughout training, directly correlating with enhanced network adaptation. In contrast, control methods using older data exhibit increased gradient sparsity, validating the core theoretical predictions.
Ablative Analysis: A reverse ablation—Sample Weight Augmentation (SWA), which preferentially samples older data—results in severe gradient decay, confirming temporal weighting as a decisive factor. Linear decay outperforms exponential or polynomial alternatives, supporting the theory-driven design choice.
Hyperparameter Robustness: SWD is insensitive to decay horizon and minimum weight within practical ranges, and a bucketed implementation achieves near-zero computational cost without loss of performance.
Synergistic Compatibility: SWD remains orthogonal and complementary to NTK-based methods such as S&P, ReGraMa, and Plasticity Injection, enabling compositional use and combined SOTA performance.

Implications and Future Directions

This work situates plasticity loss as a gradient-centric phenomenon arising from distributional non-stationarity and experience replay design, clarifying long-standing empirical observations in deep RL. SWD's simplicity and efficacy allow principled mitigation without disruptive architectural interventions.

Theoretically, the framework unifies the field's understanding of continual RL learning capacity, connects it to fundamental kernel dynamics, and exposes the limitations of approaches focused solely on early-stage network resets.

Practically, SWD extends the operational lifetime and performance of deep RL agents, especially for long-horizon, high-update-to-data ratio regimes. This is critical for real-world applications in robotics, autonomous navigation, and continual learning systems.

Future research should expand SWD's integration with distributed RL, lifelong/continual benchmarks, and more complex state/action spaces and explore adaptive or environment-aware temporal decay functions. Deeper analysis linking SWD to representational drift and network sparsity in large-scale RL remains pertinent.

Conclusion

This paper establishes a rigorous theoretical foundation linking plasticity loss to gradient and NTK dynamics in deep RL and proposes Sample Weight Decay—an efficient, general-purpose sampling strategy. SWD restores gradient signal strength, preserves neural plasticity, and yields consistent SOTA empirical gains. Its orthogonality to prior interventions and computational efficiency position SWD as a practical mechanism for enhancing the adaptability and performance of RL agents across a broad spectrum of environments and configurations (2604.01913).

Markdown Report Issue