Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Disentangling the Causes of Plasticity Loss in Neural Networks (2402.18762v1)

Published 29 Feb 2024 in cs.LG

Abstract: Underpinning the past decades of work on the design, initialization, and optimization of neural networks is a seemingly innocuous assumption: that the network is trained on a \textit{stationary} data distribution. In settings where this assumption is violated, e.g.\ deep reinforcement learning, learning algorithms become unstable and brittle with respect to hyperparameters and even random seeds. One factor driving this instability is the loss of plasticity, meaning that updating the network's predictions in response to new information becomes more difficult as training progresses. While many recent works provide analyses and partial solutions to this phenomenon, a fundamental question remains unanswered: to what extent do known mechanisms of plasticity loss overlap, and how can mitigation strategies be combined to best maintain the trainability of a network? This paper addresses these questions, showing that loss of plasticity can be decomposed into multiple independent mechanisms and that, while intervening on any single mechanism is insufficient to avoid the loss of plasticity in all cases, intervening on multiple mechanisms in conjunction results in highly robust learning algorithms. We show that a combination of layer normalization and weight decay is highly effective at maintaining plasticity in a variety of synthetic nonstationary learning tasks, and further demonstrate its effectiveness on naturally arising nonstationarities, including reinforcement learning in the Arcade Learning Environment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Loss of plasticity in continual deep reinforcement learning. ICML, 2023.
  2. Scalable second order optimization for deep learning. arXiv preprint arXiv:2002.09018, 2020.
  3. Resetting the optimizer in deep rl: An empirical study. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  4. On warm-starting neural network training. Advances in Neural Information Processing Systems, 33:3884–3894, 2020.
  5. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  6. The shattered gradients problem: If resnets are the answer, then what is the question? In International Conference on Machine Learning, pages 342–350. PMLR, 2017.
  7. Implicit gradient regularization. arXiv preprint arXiv:2009.11162, 2020.
  8. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449–458. PMLR, 2017.
  9. A study on the plasticity of neural networks. arXiv preprint arXiv:2106.00042, 2021.
  10. Self-stabilization: The implicit bias of gradient descent at the edge of stability. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022. URL https://openreview.net/forum?id=enoU_Kp7Dz.
  11. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances in Neural Information Processing Systems, volume 29, 2016.
  12. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
  13. Continual backprop: Stochastic gradient descent with persistent randomness. arXiv preprint arXiv:2108.06325, 2021.
  14. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR, 2018.
  15. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
  16. Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989, 2019.
  17. On the impact of the activation function on deep neural networks training. In International conference on machine learning, pages 2672–2680. PMLR, 2019.
  18. Stable resnet. In International Conference on Artificial Intelligence and Statistics, pages 1324–1332. PMLR, 2021.
  19. Deep transformers without shortcuts: Modifying self-attention for faithful signal propagation. arXiv preprint arXiv:2302.10322, 2023.
  20. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  21. Transient non-stationarity and generalisation in deep reinforcement learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Qun8fv4qSby.
  22. Improving regression performance with distributional losses. In International conference on machine learning, pages 2157–2166. PMLR, 2018.
  23. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
  24. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  25. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637–5664. PMLR, 2021.
  26. Implicit under-parameterization inhibits data-efficient deep reinforcement learning. In International Conference on Learning Representations, 2020.
  27. Continual learning as computationally constrained reinforcement learning. arXiv preprint arXiv:2307.04345, 2023a.
  28. Maintaining plasticity via regenerative regularization. arXiv preprint arXiv:2308.11958, 2023b.
  29. Plastic: Improving input and label plasticity for sample efficient reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  30. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
  31. Curvature explains loss of plasticity. arXiv preprint arXiv:2312.00246, 2023.
  32. How far can we go without convolution: Improving fully-connected networks. ICLR Workshop Track, 2016.
  33. Understanding and preventing capacity loss in reinforcement learning. In International Conference on Learning Representations, 2021.
  34. Learning dynamics and generalization in deep reinforcement learning. In International Conference on Machine Learning, pages 14560–14581. PMLR, 2022.
  35. Understanding plasticity in neural networks. In International Conference on Machine Learning, 2023.
  36. On the representational efficiency of restricted boltzmann machines. Advances in Neural Information Processing Systems, 26, 2013.
  37. Rapid training of deep neural networks without skip connections or normalization layers using deep kernel shaping. arXiv preprint arXiv:2110.01765, 2021.
  38. Effects of parameter norm growth during transformer training: Inductive bias from gradient descent. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1766–1781, 2021.
  39. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  40. On the number of linear regions of deep neural networks. NeurIPS, 2014.
  41. Path-normalized optimization of recurrent neural networks with relu activations. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/74563ba21a90da13dacf2a73e3ddefa7-Paper.pdf.
  42. The primacy bias in deep reinforcement learning. In International Conference on Machine Learning, pages 16828–16847. PMLR, 2022.
  43. Deep reinforcement learning with plasticity injection. International Conference on Learning Representations, 2023.
  44. Gradient starvation: A learning proclivity in neural networks. Advances in Neural Information Processing Systems, 34:1256–1272, 2021.
  45. Exponential expressivity in deep neural networks through transient chaos. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/148510031349642de5ca0c544f31b2ef-Paper.pdf.
  46. On the expressive power of deep neural networks. In international conference on machine learning, pages 2847–2854. PMLR, 2017.
  47. Decoupling value and policy for generalization in reinforcement learning. In International Conference on Machine Learning, pages 8787–8798. PMLR, 2021.
  48. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29, 2016.
  49. Deep information propagation. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=H1W1UN9gg.
  50. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  51. On the origin of implicit regularization in stochastic gradient descent. arXiv preprint arXiv:2101.12176, 2021.
  52. The dormant neuron phenomenon in deep reinforcement learning. ICML, 2023.
  53. Regression as classification: Influence of task formulation on neural network features. In International Conference on Artificial Intelligence and Statistics, pages 11563–11582. PMLR, 2023.
  54. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR, 2013.
  55. Learning values across many orders of magnitude. Advances in neural information processing systems, 29, 2016.
  56. Small-scale proxies for large-scale transformer training instabilities. arXiv preprint arXiv:2309.14322, 2023.
  57. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, pages 5393–5402. PMLR, 2018.
  58. Disentangling trainability and generalization in deep neural networks. In International Conference on Machine Learning, pages 10462–10472. PMLR, 2020.
  59. Understanding and improving layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  60. Greg Yang. Wide feedforward or recurrent neural networks of any architecture are gaussian processes. Advances in Neural Information Processing Systems, 32, 2019.
  61. Deep learning without shortcuts: Shaping the kernel with tailored rectifiers. In International Conference on Learning Representations, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Clare Lyle (36 papers)
  2. Zeyu Zheng (60 papers)
  3. Khimya Khetarpal (25 papers)
  4. Hado van Hasselt (57 papers)
  5. Razvan Pascanu (138 papers)
  6. James Martens (20 papers)
  7. Will Dabney (53 papers)
Citations (24)

Summary

Disentangling the Causes of Plasticity Loss in Neural Networks: Insights and Mitigations

Understanding Plasticity Loss in Neural Networks

The phenomenon of plasticity loss in neural networks, where the ability to update predictions based on new information diminishes over time, poses significant challenges for maintaining the trainability and adaptability of models, especially under nonstationary conditions. This paper provides a comprehensive analysis of the causes behind plasticity loss and introduces a multifaceted approach to mitigating this issue effectively through a combination of layer normalization and weight decay.

Exploring the Causes

The investigation begins by identifying distinct mechanisms contributing to plasticity loss. These include:

  • Preactivation Distribution Shift: Changes in the distribution of inputs to activation functions can lead to dead units (where units no longer activate) and zombie units (which become predominantly linear and lose their nonlinearity), subsequently reducing a network's effective capacity.
  • Parameter Norm Growth: Excessive growth in the weight's magnitude affects the network's output sensitivity and can make optimization more difficult.
  • Regression Target Magnitude: In scenarios such as reinforcement learning, the magnitude of regression targets can lead to optimization difficulties, thereby impairing the network's learning capacity.

Mechanisms and Mitigations

The paper's analysis reveals how each identified mechanism independently contributes to plasticity loss and discusses targeted mitigation strategies. For instance:

  • Implementing layer normalization can counteract the adverse effects of preactivation distribution shifts by ensuring activations remain within a functional range.
  • Applying weight decay (L2 regularization) controls parameter norm growth, preventing extreme weight magnitudes that could otherwise hamper learning.
  • Adjusting for regression target magnitude through techniques like the 'two-hot' trick in reinforcement learning setups or applying distributional loss strategies can mitigate optimization difficulties arising from large target values.

Combining Interventions for Additive Benefits

A pivotal contribution of this work is the development of a 'Swiss cheese model' of mitigation strategies. By targeting the independent mechanisms of plasticity loss concurrently, the paper demonstrates how a combined intervention approach can significantly enhance the robustness and adaptability of learning algorithms. Empirical results across various nonstationary learning tasks—including synthetic benchmarks, reinforcement learning environments, and natural distribution shifts—underscore the effectiveness of layer normalization coupled with L2 regularization in preserving network plasticity.

Implications and Future Directions

The findings have profound implications for the design and optimization of neural networks, especially in realms where learning under nonstationary conditions is paramount. The proposed mitigation framework sets a foundation for future research endeavors aimed at further enhancing model resilience against plasticity loss. Areas ripe for exploration include refining norm control strategies to balance the trade-offs between mitigating plasticity loss and preserving convergence speed, as well as investigating additional independent mechanisms that may contribute to plasticity loss.

Acknowledgments and Collaborative Efforts

This research benefitted from discussions and feedback from colleagues at Google DeepMind, showcasing the collaborative spirit within the AI research community.