Papers
Topics
Authors
Recent
2000 character limit reached

StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization

Published 24 Nov 2023 in cs.LG, cs.AI, cs.CL, and math.DS | (2311.14495v4)

Abstract: In this paper, we investigate the long-term memory learning capabilities of state-space models (SSMs) from the perspective of parameterization. We prove that state-space models without any reparameterization exhibit a memory limitation similar to that of traditional RNNs: the target relationships that can be stably approximated by state-space models must have an exponential decaying memory. Our analysis identifies this "curse of memory" as a result of the recurrent weights converging to a stability boundary, suggesting that a reparameterization technique can be effective. To this end, we introduce a class of reparameterization techniques for SSMs that effectively lift its memory limitations. Besides improving approximation capabilities, we further illustrate that a principled choice of reparameterization scheme can also enhance optimization stability. We validate our findings using synthetic datasets, LLMs and image classifications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Recurrent neural networks and robust time series prediction. IEEE transactions on neural networks, 5(2):240–254, 1994.
  2. Generating Text with Recurrent Neural Networks. In International Conference on Machine Learning, pages 1017–1024, January 2011.
  3. Hungry Hungry Hippos: Towards Language Modeling with State Space Models. In The Eleventh International Conference on Learning Representations, February 2023.
  4. Hyena Hierarchy: Towards Larger Convolutional Language Models. In International Conference on Machine Learning, June 2023.
  5. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Efficiently Modeling Long Sequences with Structured State Spaces. In International Conference on Learning Representations, October 2021.
  8. Simplified State Space Layers for Sequence Modeling. In International Conference on Learning Representations, February 2023.
  9. RWKV: Reinventing RNNs for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  10. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
  11. Long Range Arena : A Benchmark for Efficient Transformers. In International Conference on Learning Representations, January 2021.
  12. Parallelizing Linear Recurrent Neural Nets Over Sequence Length. In International Conference on Learning Representations, February 2018.
  13. Inverse approximation theory for nonlinear recurrent neural networks. arXiv preprint arXiv:2305.19190, 2023.
  14. State-space models with layer-wise nonlinearity are universal approximators with exponential decaying memory. In Thirty-Seventh Conference on Neural Information Processing Systems, November 2023.
  15. Learning representations by back-propagating errors. Nature, 323(6088):533–536, October 1986. ISSN 1476-4687. doi: 10.1038/323533a0.
  16. On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis. In International Conference on Learning Representations, October 2020.
  17. On the universality of linear recurrences followed by nonlinear projections. arXiv preprint arXiv:2307.11888, 2023.
  18. On the Parameterization and Initialization of Diagonal State Space Models. Advances in Neural Information Processing Systems, 35:35971–35983, December 2022.
  19. Approximation and Optimization Theory for Linear Continuous-Time Recurrent Neural Networks. Journal of Machine Learning Research, 23(42):1–85, 2022. ISSN 1533-7928.
  20. HiPPO: Recurrent Memory with Optimal Polynomial Projections. In Advances in Neural Information Processing Systems, volume 33, pages 1474–1487. Curran Associates, Inc., 2020.
  21. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, March 1994. ISSN 1941-0093. doi: 10.1109/72.279181.
  22. Long Short-term Memory. Neural computation, 9:1735–80, December 1997. doi: 10.1162/neco.1997.9.8.1735.
  23. Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies. In International Conference on Learning Representations, February 2022.
  24. A Brief Survey on the Approximation Theory for Sequence Modelling. Journal of Machine Learning, 2(1):1–30, June 2023. ISSN 2790-203X, 2790-2048. doi: 10.4208/jml.221221.
  25. Improve long-term memory learning through rescaling the error temporally. arXiv preprint arXiv:2307.11462, 2023.
  26. V. Siivola and A. Honkela. A state-space method for language modeling. In 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), pages 548–553, St Thomas, VI, USA, 2003. IEEE. ISBN 978-0-7803-7980-0. doi: 10.1109/ASRU.2003.1318499.
  27. Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
  28. Gradient descent learns linear dynamical systems. The Journal of Machine Learning Research, 19(1):1025–1068, 2018.
  29. A convex parameterization of robust recurrent neural networks. IEEE Control Systems Letters, 5(4):1363–1368, 2020.
  30. Small-scale proxies for large-scale Transformer training instabilities. arXiv preprint arXiv:2309.14322, 2023.
  31. Analytical Foundations of Volterra Series. IMA Journal of Mathematical Control and Information, 1(3):243–282, January 1984. ISSN 0265-0754. doi: 10.1093/imamci/1.3.243.
  32. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  33. Super-convergence: Very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, volume 11006, pages 369–386. SPIE, 2019.
  34. Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  35. Sepp Hochreiter. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 06(02):107–116, April 1998. ISSN 0218-4885, 1793-6411. doi: 10.1142/S0218488598000094.
Citations (9)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 21 likes about this paper.