Fading memory as inductive bias in residual recurrent networks (2307.14823v2)
Abstract: Residual connections have been proposed as an architecture-based inductive bias to mitigate the problem of exploding and vanishing gradients and increased task performance in both feed-forward and recurrent networks (RNNs) when trained with the backpropagation algorithm. Yet, little is known about how residual connections in RNNs influence their dynamics and fading memory properties. Here, we introduce weakly coupled residual recurrent networks (WCRNNs) in which residual connections result in well-defined Lyapunov exponents and allow for studying properties of fading memory. We investigate how the residual connections of WCRNNs influence their performance, network dynamics, and memory properties on a set of benchmark tasks. We show that several distinct forms of residual connections yield effective inductive biases that result in increased network expressivity. In particular, those are residual connections that (i) result in network dynamics at the proximity of the edge of chaos, (ii) allow networks to capitalize on characteristic spectral properties of the data, and (iii) result in heterogeneous memory properties. In addition, we demonstrate how our results can be extended to non-linear residuals and introduce a weakly coupled residual initialization scheme that can be used for Elman RNNs.
- Unitary evolution recurrent neural networks, in: International Conference on Machine Learning, PMLR. pp. 1120–1128.
- Approximation and estimation bounds for artificial neural networks. Machine learning 14, 115–133.
- Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 .
- Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5, 157–166.
- Real-time computation at the edge of chaos in recurrent neural networks. Neural computation 16, 1413–1436.
- Antisymmetricrnn: A dynamical system view on recurrent neural networks. arXiv preprint arXiv:1902.09689 .
- Neural ordinary differential equations. arXiv preprint arXiv:1806.07366 .
- On lazy training in differentiable programming. Advances in neural information processing systems 32.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 .
- Regimes and mechanisms of transient amplification in abstract and biological neural networks. PLoS Computational Biology 18, e1010365.
- Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2, 303–314.
- Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems 27.
- Ergodic theory of chaos and strange attractors. The theory of chaotic attractors , 273–312.
- The functional role of oscillatory dynamics in neocortical circuits: a computational perspective. bioRxiv doi:10.1101/2022.11.29.518360.
- Lyapunov spectra of chaotic recurrent neural networks. arXiv preprint arXiv:2006.02427 .
- Lipschitz recurrent neural networks. arXiv preprint arXiv:2006.12070 .
- Rich and lazy learning of task representations in brains and neural networks. BioRxiv , 2021–04.
- On the approximate realization of continuous mappings by neural networks. Neural networks 2, 183–192.
- Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment 2020, 113301.
- Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings. pp. 249–256.
- Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A 478, 20210068.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396 .
- Improving the gating mechanism of recurrent neural networks, in: International Conference on Machine Learning, PMLR. pp. 3800–3809.
- Embracing change: Continual learning in deep neural networks. Trends in cognitive sciences 24, 1028–1040.
- Complexity of linear regions in deep networks, in: International Conference on Machine Learning, PMLR. pp. 2596–2604.
- Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
- Orthogonal recurrent neural networks with scaled cayley transform, in: International Conference on Machine Learning, PMLR. pp. 1969–1978.
- Long short-term memory. Neural computation 9, 1735–1780.
- Multilayer feedforward networks are universal approximators. Neural networks 2, 359–366.
- Different eigenvalue distributions encode the same temporal tasks in recurrent neural networks. Cognitive Neurodynamics 17, 257–275.
- Exploring weight initialization, diversity of solutions, and degradation in recurrent neural networks trained for temporal and decision-making tasks. Journal of Computational Neuroscience , 1–25.
- An introduction to computational learning theory. MIT press.
- Non-normal recurrent neural network (nnrnn): learning long time dependencies while improving expressivity with transient dynamics. Advances in neural information processing systems 32.
- On neural architecture inductive biases for relational tasks. arXiv preprint arXiv:2206.05056 .
- A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941 .
- Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361, 1995.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324.
- How connectivity structure shapes rich and lazy learning in neural circuits. ArXiv .
- Linking connectivity, dynamics, and computations in low-rank recurrent neural networks. Neuron 99, 609–623.
- Stable recurrent models. arXiv preprint arXiv:1805.10369 .
- All you need is a good init. arXiv preprint arXiv:1511.06422 .
- On second order behaviour in augmented neural odes. Advances in neural information processing systems 33, 5911–5921.
- Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349 .
- A multiplicative ergodic theorem. characteristic ljapunov, exponents of dynamical systems. Trudy Moskovskogo Matematicheskogo Obshchestva 19, 179–210.
- Beyond geometry: Comparing the temporal structure of computation in neural circuits with dynamical similarity analysis. arXiv preprint arXiv:2306.10168 .
- On the difficulty of training recurrent neural networks. International conference on machine learning , 1310–1318.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32.
- Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048 .
- Neural heterogeneity promotes robust learning. Nature communications 12, 1–9.
- Stimulus-dependent suppression of chaos in recurrent neural networks. Physical review e 82, 011903.
- U-net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer. pp. 234–241.
- Coupled oscillatory recurrent neural network (cornn): An accurate and (gradient) stable architecture for learning long time dependencies. arXiv preprint arXiv:2010.00951 .
- Heterogeneity extends criticality. Frontiers in Complex Systems 1, 1111486.
- Numerical calculation of lyapunov exponents. Mathematica Journal 6, 78–84.
- Deep information propagation. arXiv preprint arXiv:1611.01232 .
- Fast curvature matrix-vector products for second-order gradient descent. Neural computation 14, 1723–1738.
- Aligned and oblique dynamics in recurrent neural networks. arXiv preprint arXiv:2307.07654 .
- Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006 .
- Efficient transformers: A survey. ACM Computing Surveys 55, 1–28.
- On the interplay between noise and curvature and its effect on optimization and generalization, in: International Conference on Artificial Intelligence and Statistics, PMLR. pp. 3503–3513.
- Attention is all you need. Advances in neural information processing systems 30.
- SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17, 261–272. doi:10.1038/s41592-019-0686-2.
- On lyapunov exponents for rnns: Understanding information propagation using dynamical systems tools. arXiv preprint arXiv:2006.14123 .
- On orthogonality and learning recurrent networks with long term dependencies, in: International Conference on Machine Learning, PMLR. pp. 3570–3578.
- Recurrent residual learning for sequence classification, in: Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 938–943.
- Operating in a reverberating regime enables rapid tuning of network states to task requirements. Frontiers in Systems Neuroscience 12, 55.
- 25 years of criticality in neuroscience—established results, open controversies, novel concepts. Current opinion in neurobiology 58, 105–111.
- Mean field residual networks: On the edge of chaos. Advances in neural information processing systems 30.
- Residual recurrent neural networks for learning sequential representations. Information 9, 56.
- Online learning of long range dependencies. arXiv preprint arXiv:2305.15947 .