Disentangling the Causes of Plasticity Loss in Neural Networks (2402.18762v1)
Abstract: Underpinning the past decades of work on the design, initialization, and optimization of neural networks is a seemingly innocuous assumption: that the network is trained on a \textit{stationary} data distribution. In settings where this assumption is violated, e.g.\ deep reinforcement learning, learning algorithms become unstable and brittle with respect to hyperparameters and even random seeds. One factor driving this instability is the loss of plasticity, meaning that updating the network's predictions in response to new information becomes more difficult as training progresses. While many recent works provide analyses and partial solutions to this phenomenon, a fundamental question remains unanswered: to what extent do known mechanisms of plasticity loss overlap, and how can mitigation strategies be combined to best maintain the trainability of a network? This paper addresses these questions, showing that loss of plasticity can be decomposed into multiple independent mechanisms and that, while intervening on any single mechanism is insufficient to avoid the loss of plasticity in all cases, intervening on multiple mechanisms in conjunction results in highly robust learning algorithms. We show that a combination of layer normalization and weight decay is highly effective at maintaining plasticity in a variety of synthetic nonstationary learning tasks, and further demonstrate its effectiveness on naturally arising nonstationarities, including reinforcement learning in the Arcade Learning Environment.
- Loss of plasticity in continual deep reinforcement learning. ICML, 2023.
- Scalable second order optimization for deep learning. arXiv preprint arXiv:2002.09018, 2020.
- Resetting the optimizer in deep rl: An empirical study. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- On warm-starting neural network training. Advances in Neural Information Processing Systems, 33:3884–3894, 2020.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- The shattered gradients problem: If resnets are the answer, then what is the question? In International Conference on Machine Learning, pages 342–350. PMLR, 2017.
- Implicit gradient regularization. arXiv preprint arXiv:2009.11162, 2020.
- A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449–458. PMLR, 2017.
- A study on the plasticity of neural networks. arXiv preprint arXiv:2106.00042, 2021.
- Self-stabilization: The implicit bias of gradient descent at the edge of stability. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022. URL https://openreview.net/forum?id=enoU_Kp7Dz.
- Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances in Neural Information Processing Systems, volume 29, 2016.
- Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
- Continual backprop: Stochastic gradient descent with persistent randomness. arXiv preprint arXiv:2108.06325, 2021.
- Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842–1850. PMLR, 2018.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
- Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989, 2019.
- On the impact of the activation function on deep neural networks training. In International conference on machine learning, pages 2672–2680. PMLR, 2019.
- Stable resnet. In International Conference on Artificial Intelligence and Statistics, pages 1324–1332. PMLR, 2021.
- Deep transformers without shortcuts: Modifying self-attention for faithful signal propagation. arXiv preprint arXiv:2302.10322, 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Transient non-stationarity and generalisation in deep reinforcement learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Qun8fv4qSby.
- Improving regression performance with distributional losses. In International conference on machine learning, pages 2157–2166. PMLR, 2018.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637–5664. PMLR, 2021.
- Implicit under-parameterization inhibits data-efficient deep reinforcement learning. In International Conference on Learning Representations, 2020.
- Continual learning as computationally constrained reinforcement learning. arXiv preprint arXiv:2307.04345, 2023a.
- Maintaining plasticity via regenerative regularization. arXiv preprint arXiv:2308.11958, 2023b.
- Plastic: Improving input and label plasticity for sample efficient reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
- Curvature explains loss of plasticity. arXiv preprint arXiv:2312.00246, 2023.
- How far can we go without convolution: Improving fully-connected networks. ICLR Workshop Track, 2016.
- Understanding and preventing capacity loss in reinforcement learning. In International Conference on Learning Representations, 2021.
- Learning dynamics and generalization in deep reinforcement learning. In International Conference on Machine Learning, pages 14560–14581. PMLR, 2022.
- Understanding plasticity in neural networks. In International Conference on Machine Learning, 2023.
- On the representational efficiency of restricted boltzmann machines. Advances in Neural Information Processing Systems, 26, 2013.
- Rapid training of deep neural networks without skip connections or normalization layers using deep kernel shaping. arXiv preprint arXiv:2110.01765, 2021.
- Effects of parameter norm growth during transformer training: Inductive bias from gradient descent. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1766–1781, 2021.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- On the number of linear regions of deep neural networks. NeurIPS, 2014.
- Path-normalized optimization of recurrent neural networks with relu activations. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/74563ba21a90da13dacf2a73e3ddefa7-Paper.pdf.
- The primacy bias in deep reinforcement learning. In International Conference on Machine Learning, pages 16828–16847. PMLR, 2022.
- Deep reinforcement learning with plasticity injection. International Conference on Learning Representations, 2023.
- Gradient starvation: A learning proclivity in neural networks. Advances in Neural Information Processing Systems, 34:1256–1272, 2021.
- Exponential expressivity in deep neural networks through transient chaos. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/148510031349642de5ca0c544f31b2ef-Paper.pdf.
- On the expressive power of deep neural networks. In international conference on machine learning, pages 2847–2854. PMLR, 2017.
- Decoupling value and policy for generalization in reinforcement learning. In International Conference on Machine Learning, pages 8787–8798. PMLR, 2021.
- Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29, 2016.
- Deep information propagation. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=H1W1UN9gg.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
- On the origin of implicit regularization in stochastic gradient descent. arXiv preprint arXiv:2101.12176, 2021.
- The dormant neuron phenomenon in deep reinforcement learning. ICML, 2023.
- Regression as classification: Influence of task formulation on neural network features. In International Conference on Artificial Intelligence and Statistics, pages 11563–11582. PMLR, 2023.
- On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR, 2013.
- Learning values across many orders of magnitude. Advances in neural information processing systems, 29, 2016.
- Small-scale proxies for large-scale transformer training instabilities. arXiv preprint arXiv:2309.14322, 2023.
- Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, pages 5393–5402. PMLR, 2018.
- Disentangling trainability and generalization in deep neural networks. In International Conference on Machine Learning, pages 10462–10472. PMLR, 2020.
- Understanding and improving layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
- Greg Yang. Wide feedforward or recurrent neural networks of any architecture are gaussian processes. Advances in Neural Information Processing Systems, 32, 2019.
- Deep learning without shortcuts: Shaping the kernel with tailored rectifiers. In International Conference on Learning Representations, 2021.
- Clare Lyle (36 papers)
- Zeyu Zheng (60 papers)
- Khimya Khetarpal (25 papers)
- Hado van Hasselt (57 papers)
- Razvan Pascanu (138 papers)
- James Martens (20 papers)
- Will Dabney (53 papers)