StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization
Abstract: In this paper, we investigate the long-term memory learning capabilities of state-space models (SSMs) from the perspective of parameterization. We prove that state-space models without any reparameterization exhibit a memory limitation similar to that of traditional RNNs: the target relationships that can be stably approximated by state-space models must have an exponential decaying memory. Our analysis identifies this "curse of memory" as a result of the recurrent weights converging to a stability boundary, suggesting that a reparameterization technique can be effective. To this end, we introduce a class of reparameterization techniques for SSMs that effectively lift its memory limitations. Besides improving approximation capabilities, we further illustrate that a principled choice of reparameterization scheme can also enhance optimization stability. We validate our findings using synthetic datasets, LLMs and image classifications.
- Recurrent neural networks and robust time series prediction. IEEE transactions on neural networks, 5(2):240–254, 1994.
- Generating Text with Recurrent Neural Networks. In International Conference on Machine Learning, pages 1017–1024, January 2011.
- Hungry Hungry Hippos: Towards Language Modeling with State Space Models. In The Eleventh International Conference on Learning Representations, February 2023.
- Hyena Hierarchy: Towards Larger Convolutional Language Models. In International Conference on Machine Learning, June 2023.
- Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Efficiently Modeling Long Sequences with Structured State Spaces. In International Conference on Learning Representations, October 2021.
- Simplified State Space Layers for Sequence Modeling. In International Conference on Learning Representations, February 2023.
- RWKV: Reinventing RNNs for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
- Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
- Long Range Arena : A Benchmark for Efficient Transformers. In International Conference on Learning Representations, January 2021.
- Parallelizing Linear Recurrent Neural Nets Over Sequence Length. In International Conference on Learning Representations, February 2018.
- Inverse approximation theory for nonlinear recurrent neural networks. arXiv preprint arXiv:2305.19190, 2023.
- State-space models with layer-wise nonlinearity are universal approximators with exponential decaying memory. In Thirty-Seventh Conference on Neural Information Processing Systems, November 2023.
- Learning representations by back-propagating errors. Nature, 323(6088):533–536, October 1986. ISSN 1476-4687. doi: 10.1038/323533a0.
- On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis. In International Conference on Learning Representations, October 2020.
- On the universality of linear recurrences followed by nonlinear projections. arXiv preprint arXiv:2307.11888, 2023.
- On the Parameterization and Initialization of Diagonal State Space Models. Advances in Neural Information Processing Systems, 35:35971–35983, December 2022.
- Approximation and Optimization Theory for Linear Continuous-Time Recurrent Neural Networks. Journal of Machine Learning Research, 23(42):1–85, 2022. ISSN 1533-7928.
- HiPPO: Recurrent Memory with Optimal Polynomial Projections. In Advances in Neural Information Processing Systems, volume 33, pages 1474–1487. Curran Associates, Inc., 2020.
- Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, March 1994. ISSN 1941-0093. doi: 10.1109/72.279181.
- Long Short-term Memory. Neural computation, 9:1735–80, December 1997. doi: 10.1162/neco.1997.9.8.1735.
- Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies. In International Conference on Learning Representations, February 2022.
- A Brief Survey on the Approximation Theory for Sequence Modelling. Journal of Machine Learning, 2(1):1–30, June 2023. ISSN 2790-203X, 2790-2048. doi: 10.4208/jml.221221.
- Improve long-term memory learning through rescaling the error temporally. arXiv preprint arXiv:2307.11462, 2023.
- V. Siivola and A. Honkela. A state-space method for language modeling. In 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), pages 548–553, St Thomas, VI, USA, 2003. IEEE. ISBN 978-0-7803-7980-0. doi: 10.1109/ASRU.2003.1318499.
- Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
- Gradient descent learns linear dynamical systems. The Journal of Machine Learning Research, 19(1):1025–1068, 2018.
- A convex parameterization of robust recurrent neural networks. IEEE Control Systems Letters, 5(4):1363–1368, 2020.
- Small-scale proxies for large-scale Transformer training instabilities. arXiv preprint arXiv:2309.14322, 2023.
- Analytical Foundations of Volterra Series. IMA Journal of Mathematical Control and Information, 1(3):243–282, January 1984. ISSN 0265-0754. doi: 10.1093/imamci/1.3.243.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Super-convergence: Very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, volume 11006, pages 369–386. SPIE, 2019.
- Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- Sepp Hochreiter. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 06(02):107–116, April 1998. ISSN 0218-4885, 1793-6411. doi: 10.1142/S0218488598000094.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.