Robustifying State-space Models for Long Sequences via Approximate Diagonalization (2310.01698v1)
Abstract: State-space models (SSMs) have recently emerged as a framework for learning long-range sequence tasks. An example is the structured state-space sequence (S4) layer, which uses the diagonal-plus-low-rank structure of the HiPPO initialization framework. However, the complicated structure of the S4 layer poses challenges; and, in an effort to address these challenges, models such as S4D and S5 have considered a purely diagonal structure. This choice simplifies the implementation, improves computational efficiency, and allows channel communication. However, diagonalizing the HiPPO framework is itself an ill-posed problem. In this paper, we propose a general solution for this and related ill-posed diagonalization problems in machine learning. We introduce a generic, backward-stable "perturb-then-diagonalize" (PTD) methodology, which is based on the pseudospectral theory of non-normal operators, and which may be interpreted as the approximate diagonalization of the non-normal matrices defining SSMs. Based on this, we introduce the S4-PTD and S5-PTD models. Through theoretical analysis of the transfer functions of different initialization schemes, we demonstrate that the S4-PTD/S5-PTD initialization strongly converges to the HiPPO framework, while the S4D/S5 initialization only achieves weak convergences. As a result, our new models show resilience to Fourier-mode noise-perturbed inputs, a crucial property not achieved by the S4D/S5 models. In addition to improved robustness, our S5-PTD model averages 87.6% accuracy on the Long-Range Arena benchmark, demonstrating that the PTD methodology helps to improve the accuracy of deep learning models.
- On the scalar rational interpolation problem. IMA Journal of Mathematical Control and Information, 3(2-3):61–88, 1986.
- Unitary evolution recurrent neural networks. In International Conference on Machine Learning, pages 1120–1128. PMLR, 2016.
- Practical challenges in data-driven interpolation: dealing with noise, enforcing stability, and computing realizations. arXiv preprint arXiv:2301.04906, 2023.
- An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
- Pseudospectral shattering, the sign function, and diagonalization in nearly matrix multiplication time. Foundations of Computational Mathematics, pages 1–89, 2022.
- Gaussian regularization of the pseudospectrum and davies’ conjecture. Communications on Pure and Applied Mathematics, 74(10):2114–2131, 2021.
- Antisymmetricrnn: A dynamical system view on recurrent neural networks. In International Conference on Machine Learning, 2019.
- Rethinking attention with performers. In International Conference on Machine Learning, 2020.
- P. M. Cohn. Further algebra and applications. Springer-Verlag London, Ltd., London, 2003.
- Perturbations of Jordan matrices. Journal of Approximation Theory, 156(1):82–94, 2009.
- E.B. Davies. Approximate diagonalization. SIAM journal on matrix analysis and applications, 29(4):1051–1064, 2008.
- James Demmel. The componentwise distance to the nearest singular matrix. SIAM Journal on Matrix Analysis and Applications, 13(1):10–19, 1992.
- Lipschitz recurrent neural networks. In International Conference on Learning Representations, 2021.
- Gated recurrent neural networks with weighted time-delay feedback. arXiv preprint arXiv:2212.00228, 2022.
- Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
- On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
- Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022.
- Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
- How to train your hippo: State space models with generalized orthogonal basis projections. International Conference on Learning Representations, 2023.
- Liquid structural state-space models. International Conference on Learning Representations, 2023.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
- Non-normal recurrent neural network (nnrnn): learning long time dependencies while improving expressivity with transient dynamics. Advances in neural information processing systems, 32, 2019.
- Reformer: The efficient transformer. In International Conference on Machine Learning, 2020.
- Learning multiple layers of features from tiny images. 2009.
- Non-normality in neural networks. In AI and Optical Data Sciences III, volume 12019, pages 70–76. SPIE, 2022.
- A time series is worth 64 words: Long-term forecasting with transformers. In The Eleventh International Conference on Learning Representations, 2023.
- Improved memory in recurrent neural networks with sequential non-normal dynamics. Internation Conference on Learning Representations, 2020.
- Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349, 2023.
- Ckconv: Continuous kernel convolution for sequential data. In International Conference on Machine Learning, 2022.
- Unicornn: A recurrent model for learning very long time dependencies. In International Conference on Machine Learning, pages 9168–9178. PMLR, 2021.
- How robust are deep neural networks? arXiv preprint arXiv:1804.11313, 2018.
- Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023.
- Long range arena: A benchmark for efficient transformers. International Conference in Learning Representations, 2021.
- Spectra and Pseudospectra: The Behaviour of Non-normal Matrices and Operators. Springer, 2005.
- Legendre memory units: Continuous-time representation in recurrent neural networks. Advances in neural information processing systems, 32, 2019.
- Max A. Woodbury. Inverting modified matrices. Princeton University, Princeton, N. J., 1950. Statistical Research Group, Memo. Rep. no. 42,.
- Continuous-time model-based reinforcement learning. In International Conference on Machine Learning, pages 12009–12018. PMLR, 2021.
- Essentials of robust control, volume 104. Prentice Hall, Upper Saddle River, NJ, 1998.
- Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International Conference on Machine Learning, pages 27268–27286. PMLR, 2022.
- Annan Yu (12 papers)
- Arnur Nigmetov (9 papers)
- Dmitriy Morozov (29 papers)
- Michael W. Mahoney (233 papers)
- N. Benjamin Erichson (45 papers)