Stabilizing RNN Gradients through Pre-training
Abstract: Numerous theories of learning propose to prevent the gradient from exponential growth with depth or time, to stabilize and improve training. Typically, these analyses are conducted on feed-forward fully-connected neural networks or simple single-layer recurrent neural networks, given their mathematical tractability. In contrast, this study demonstrates that pre-training the network to local stability can be effective whenever the architectures are too complex for an analytical initialization. Furthermore, we extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution, a theory we call the Local Stability Condition (LSC). Our investigation reveals that the classical Glorot, He, and Orthogonal initialization schemes satisfy the LSC when applied to feed-forward fully-connected neural networks. However, analysing deep recurrent networks, we identify a new additive source of exponential explosion that emerges from counting gradient paths in a rectangular grid in depth and time. We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient, instead of the classical weight of one. Our empirical results confirm that pre-training both feed-forward and recurrent networks, for differentiable, neuromorphic and state-space models to fulfill the LSC, often results in improved final performance. This study contributes to the field by providing a means to stabilize networks of any complexity. Our approach can be implemented as an additional step before pre-training on large augmented datasets, and as an alternative to finding stable initializations analytically.
- Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universität München, 91(1), 1991.
- Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Yoshua Bengio and Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks. Cambridge University Press, 2022.
- A scaling calculus for the design and initialization of relu networks. Neural Computing and Applications, pages 1–15, 2022.
- On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318. PMLR, 2013.
- Unitary evolution recurrent neural networks. In International conference on machine learning, pages 1120–1128. PMLR, 2016.
- Non-normal recurrent neural network (NNRNN): learning long time dependencies while improving expressivity with transient dynamics. Advances in neural information processing systems, 32, 2019.
- Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020.
- Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
- Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
- Analysis and control of nonlinear process systems, volume 13. Springer, 2004.
- Legendre memory units: Continuous-time representation in recurrent neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Efficiently modeling long sequences with structured state spaces. In The International Conference on Learning Representations (ICLR), 2022.
- Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023.
- Resurrecting recurrent neural networks for long sequences. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Towards the systematic reporting of the energy and carbon footprints of machine learning. Journal of Machine Learning Research, 21(248):1–43, 2020.
- Benchmarking keyword spotting efficiency on neuromorphic hardware. In Proceedings of the 7th Annual Neuro-inspired Computational Elements Workshop, pages 1–8, 2019.
- Advancing neuromorphic computing with loihi: A survey of results and outlook. Proceedings of the IEEE, 109(5):911–934, 2021.
- Louis Lapique. Recherches quantitatives sur l’excitation electrique des nerfs traitee comme une polarization. Journal of Physiology and Pathololgy, 9:620–635, 1907.
- Eugene M Izhikevich. Simple model of spiking neurons. IEEE Transactions on neural networks, 14(6):1569–1572, 2003.
- Gated feedback recurrent neural networks. In International conference on machine learning, pages 2067–2075. PMLR, 2015.
- Long short-term memory and learning-to-learn in networks of spiking neurons. In Advances in Neural Information Processing Systems, 2018.
- Convolutional networks for fast, energy-efficient neuromorphic computing. Proceedings of the national academy of sciences, 113(41):11441–11446, 2016.
- Superspike: Supervised learning in multilayer spiking neural networks. Neural computation, 30(6):1514–1541, 2018.
- Genrich Belitskii et al. Matrix norms and their applications, volume 36. Birkhäuser, 2013.
- Charles R. Johnson Roger A. Horn. Matrix Analysis. Cambridge University Press, 1990.
- Edwin T Jaynes. Prior probabilities. IEEE Transactions on systems science and cybernetics, 4(3):227–241, 1968.
- Random matrices: Universality of ESDs and the circular law. The Annals of Probability, 38(5):2023–2065, 2010.
- Orthogonal deep neural networks. IEEE transactions on pattern analysis and machine intelligence, 43(4):1352–1368, 2019.
- Enhancing the trainability and expressivity of deep MLPs with globally orthogonal initialization. In The Symbiosis of Deep Learning and Differential Equations. NeurIPS Workshop, 2021.
- Implicit neural representations with periodic activation functions. Advances in neural information processing systems, 33:7462–7473, 2020.
- Accurate and efficient time-domain classification with adaptive spiking recurrent neural networks. Nature Machine Intelligence, 3(10):905–913, 2021.
- Neuronal dynamics: From single neurons to networks and models of cognition. Cambridge University Press, 2014.
- Spike frequency adaptation supports network computations on temporally dispersed information. eLife, 10:e65459, jul 2021.
- A simple way to initialize recurrent networks of rectified linear units. ArXiv, abs/1504.00941, 2015.
- Independently recurrent neural network (indrnn): Building a longer and deeper rnn. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5457–5466, 2018.
- Recurrent orthogonal networks and long-memory tasks. In International Conference on Machine Learning, pages 2034–2042. PMLR, 2016.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009.
- The remarkable robustness of surrogate gradient learning for instilling complex function in spiking neural networks. Neural Computation, 33(4):899–925, 2021.
- The heidelberg spiking data sets for the systematic evaluation of spiking neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2020.
- Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993.
- Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in Neural Information Processing Systems, 33, 2020.
- Improving language understanding by generative pre-training. OpenAI blog, 2018.
- Lookahead optimizer: k steps forward, 1 step back. Advances in neural information processing systems, 32, 2019.
- Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63–S63, 1977.
- A solution to the learning dilemma for recurrent networks of spiking neurons. Nature communications, 11(1):1–15, 2020.
- Gradients are not all you need. arXiv preprint arXiv:2111.05803, 2021.
- Sergei Natanovich Bernshtein. Sur la loi des grands nombres. Communications de la Société mathématique de Kharkow, 16(1):82–87, 1918.
- Valerii V Kozlov. Weighted averages, uniform distribution, and strict ergodicity. Russian Mathematical Surveys, 60(6):1121, 2005.
- Theophilos Cacoullos. Exercises in probability. Springer Science & Business Media, 2012.
- Matrix computations. JHU press, 2013.
- Augustin Louis Cauchy. Cours d’analyse de l’Ecole royale polytechnique; par m. Augustin-Louis Cauchy… 1. re partie. Analyse algébrique. de l’Imprimerie royale, 1821.
- Four unit mathematics, 1993.
- Johan Ludwig William Valdemar Jensen. Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30(1):175–193, 1906.
- Simon Foucart. Drexel University, Math 504 - Linear Algebra and Matrix Analysis. Lecture 6: Matrix Norms and Spectral Radii, Fall 2012.
- Eric Kostlan. On the spectra of gaussian matrices. Linear algebra and its applications, 162:385–388, 1992.
- Guillaume Dubach. Powers of ginibre eigenvalues. Electronic Journal of Probability, 23:1–31, 2018.
- Michael Z Spivey. The art of proving binomial identities. CRC Press, 2019.
- Irena Penev. Charles University (IUUK) Combinatorics and Graph Theory 1. Lecture 1: Estimates of factorials and binomial coefficients, Fall 2022.
- Heinrich Dörrie. 100 great problems of elementary mathematics. Courier Corporation, 2013.
- Norm and anti-norm inequalities for positive semi-definite matrices. International Journal of Mathematics, 22(08):1121–1138, 2011.
- Interpolating log-determinant and trace of the powers of matrix A+ tB. Statistics and Computing, 32(6):1–18, 2022.
- Anti-norms on finite von neumann algebras. Publications of the Research Institute for Mathematical Sciences, 51(2):207–235, 2015.
- An antinorm theory for sets of matrices: bounds and approximations to the lower spectral radius. Linear Algebra and its Applications, 607:89–117, 2020.
- Michel Loeve. Elementary probability theory. In Probability theory i, pages 1–52. Springer, 1977.
- Deep learning incorporating biologically inspired neural dynamics and in-memory computing. Nature Machine Intelligence, 2(6):325–336, 2020.
- Gaussian error linear units (GeLUs). arXiv preprint arXiv:1606.08415, 2016.
- Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
- Adaptive exponential integrate-and-fire model as an effective description of neuronal activity. Journal of neurophysiology, 94(5):3637–3642, 2005.
- Neuronal spike-rate adaptation supports working memory in language processing. Proceedings of the National Academy of Sciences, 117(34):20881–20889, 2020.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
- Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR, 2017.
- Averaging weights leads to wider optima and better generalization. In Conference on Uncertainty in Artificial Intelligence, 2018.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
- Array programming with NumPy. Nature, 585(7825):357–362, September 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.