Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robustifying State-space Models for Long Sequences via Approximate Diagonalization (2310.01698v1)

Published 2 Oct 2023 in cs.LG and stat.ML

Abstract: State-space models (SSMs) have recently emerged as a framework for learning long-range sequence tasks. An example is the structured state-space sequence (S4) layer, which uses the diagonal-plus-low-rank structure of the HiPPO initialization framework. However, the complicated structure of the S4 layer poses challenges; and, in an effort to address these challenges, models such as S4D and S5 have considered a purely diagonal structure. This choice simplifies the implementation, improves computational efficiency, and allows channel communication. However, diagonalizing the HiPPO framework is itself an ill-posed problem. In this paper, we propose a general solution for this and related ill-posed diagonalization problems in machine learning. We introduce a generic, backward-stable "perturb-then-diagonalize" (PTD) methodology, which is based on the pseudospectral theory of non-normal operators, and which may be interpreted as the approximate diagonalization of the non-normal matrices defining SSMs. Based on this, we introduce the S4-PTD and S5-PTD models. Through theoretical analysis of the transfer functions of different initialization schemes, we demonstrate that the S4-PTD/S5-PTD initialization strongly converges to the HiPPO framework, while the S4D/S5 initialization only achieves weak convergences. As a result, our new models show resilience to Fourier-mode noise-perturbed inputs, a crucial property not achieved by the S4D/S5 models. In addition to improved robustness, our S5-PTD model averages 87.6% accuracy on the Long-Range Arena benchmark, demonstrating that the PTD methodology helps to improve the accuracy of deep learning models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. On the scalar rational interpolation problem. IMA Journal of Mathematical Control and Information, 3(2-3):61–88, 1986.
  2. Unitary evolution recurrent neural networks. In International Conference on Machine Learning, pages 1120–1128. PMLR, 2016.
  3. Practical challenges in data-driven interpolation: dealing with noise, enforcing stability, and computing realizations. arXiv preprint arXiv:2301.04906, 2023.
  4. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
  5. Pseudospectral shattering, the sign function, and diagonalization in nearly matrix multiplication time. Foundations of Computational Mathematics, pages 1–89, 2022.
  6. Gaussian regularization of the pseudospectrum and davies’ conjecture. Communications on Pure and Applied Mathematics, 74(10):2114–2131, 2021.
  7. Antisymmetricrnn: A dynamical system view on recurrent neural networks. In International Conference on Machine Learning, 2019.
  8. Rethinking attention with performers. In International Conference on Machine Learning, 2020.
  9. P. M. Cohn. Further algebra and applications. Springer-Verlag London, Ltd., London, 2003.
  10. Perturbations of Jordan matrices. Journal of Approximation Theory, 156(1):82–94, 2009.
  11. E.B. Davies. Approximate diagonalization. SIAM journal on matrix analysis and applications, 29(4):1051–1064, 2008.
  12. James Demmel. The componentwise distance to the nearest singular matrix. SIAM Journal on Matrix Analysis and Applications, 13(1):10–19, 1992.
  13. Lipschitz recurrent neural networks. In International Conference on Learning Representations, 2021.
  14. Gated recurrent neural networks with weighted time-delay feedback. arXiv preprint arXiv:2212.00228, 2022.
  15. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
  16. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
  17. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022.
  18. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
  19. How to train your hippo: State space models with generalized orthogonal basis projections. International Conference on Learning Representations, 2023.
  20. Liquid structural state-space models. International Conference on Learning Representations, 2023.
  21. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
  22. Non-normal recurrent neural network (nnrnn): learning long time dependencies while improving expressivity with transient dynamics. Advances in neural information processing systems, 32, 2019.
  23. Reformer: The efficient transformer. In International Conference on Machine Learning, 2020.
  24. Learning multiple layers of features from tiny images. 2009.
  25. Non-normality in neural networks. In AI and Optical Data Sciences III, volume 12019, pages 70–76. SPIE, 2022.
  26. A time series is worth 64 words: Long-term forecasting with transformers. In The Eleventh International Conference on Learning Representations, 2023.
  27. Improved memory in recurrent neural networks with sequential non-normal dynamics. Internation Conference on Learning Representations, 2020.
  28. Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349, 2023.
  29. Ckconv: Continuous kernel convolution for sequential data. In International Conference on Machine Learning, 2022.
  30. Unicornn: A recurrent model for learning very long time dependencies. In International Conference on Machine Learning, pages 9168–9178. PMLR, 2021.
  31. How robust are deep neural networks? arXiv preprint arXiv:1804.11313, 2018.
  32. Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023.
  33. Long range arena: A benchmark for efficient transformers. International Conference in Learning Representations, 2021.
  34. Spectra and Pseudospectra: The Behaviour of Non-normal Matrices and Operators. Springer, 2005.
  35. Legendre memory units: Continuous-time representation in recurrent neural networks. Advances in neural information processing systems, 32, 2019.
  36. Max A. Woodbury. Inverting modified matrices. Princeton University, Princeton, N. J., 1950. Statistical Research Group, Memo. Rep. no. 42,.
  37. Continuous-time model-based reinforcement learning. In International Conference on Machine Learning, pages 12009–12018. PMLR, 2021.
  38. Essentials of robust control, volume 104. Prentice Hall, Upper Saddle River, NJ, 1998.
  39. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International Conference on Machine Learning, pages 27268–27286. PMLR, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Annan Yu (12 papers)
  2. Arnur Nigmetov (9 papers)
  3. Dmitriy Morozov (29 papers)
  4. Michael W. Mahoney (233 papers)
  5. N. Benjamin Erichson (45 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.