Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Expressive Capacity of State Space Models: A Formal Language Perspective (2405.17394v2)

Published 27 May 2024 in cs.CL, cs.FL, and cs.LG

Abstract: Recently, recurrent models based on linear state space models (SSMs) have shown promising performance in LLMing (LM), competititve with transformers. However, there is little understanding of the in-principle abilities of such models, which could provide useful guidance to the search for better LM architectures. We present a comprehensive theoretical study of the capacity of such SSMs as it compares to that of transformers and traditional RNNs. We find that SSMs and transformers have overlapping but distinct strengths. In star-free state tracking, SSMs implement straightforward and exact solutions to problems that transformers struggle to represent exactly. They can also model bounded hierarchical structure with optimal memory even without simulating a stack. On the other hand, we identify a design choice in current SSMs that limits their expressive power. We discuss implications for SSM and LM research, and verify results empirically on a recent SSM, Mamba.

Overview of the Expressive Capacity of State Space Models: A Formal Language Perspective

This paper undertakes a detailed examination of the expressive capacity of State Space Models (SSMs) in the context of LLMing, comparing them to transformers and traditional recurrent neural networks (RNNs). While transformers have rapidly risen to prominence due to their parallelized training advantages and strong empirical performance, SSMs have emerged as a competitive alternative, potentially offering capabilities that transformers inherently lack. The paper applies a formal language theoretical lens to uncover the underlying in-principle abilities of SSMs, providing insights into their strengths and limitations relative to other architectures.

Key Contributions

  1. Expressive Capacity in Formal Languages:
    • The paper explores the ability of SSMs to model different classes of formal languages, effectively framing the discussion in terms of language classes traditionally used to understand computational problems.
    • A core finding is that while SSMs and transformers cover overlapping yet distinct fragments of the TC circuit complexity class, there are specific areas where each architecture struggles or excels.
  2. Differences Between SSMs and Transformers:
    • It is demonstrated that SSMs can handle certain regular languages and bounded hierarchical structures with optimal memory efficiency. An example is the successful modeling of Flip Flop state tracking, where SSMs provide simple and exact solutions, contrasting with the empirical difficulties transformers face in generalization.
    • Conversely, SSMs face limitations with modular counting in regular languages, as observed in their struggle with the PARITY function, a problem that requires more advanced state-tracking capabilities not inherently supported by the current design of SSMs.
  3. Theoretical Characterization of Expressive Power:
    • The paper identifies that the expressive power of non-time-invariant SSMs with nonnegative gates corresponds to the class of star-free regular languages. This finding aligns with the Krohn-Rhodes theorem, revealing that SSMs easily model set-reset automata through cascade products without the Kleene star operation.
    • This characterization provides a decisive criterion for the finite-state problems SSMs can solve, simplifying understanding and predictability compared to transformers, for which length generalization remains problematic.

Implications and Future Directions

  • The results presented imply that SSMs could provide distinct advantages in handling certain language tasks, suggesting avenues for hybrid architectures that integrate the strengths of SSMs and transformers.
  • The theoretical implications underscore the potential need to revisit the parametrization of SSMs, especially regarding the handling of nonlinearities and precision, to overcome expressivity bottlenecks.
  • Practically, these insights can guide the design of more efficient and capable LLMs by highlighting scenarios where SSMs outperform or can complement existing transformer models.

Conclusions

The paper delivers a rigorous account of the expressive capacities of SSMs, yielding clear implications for both theoretical computer science and practical AI model development. By employing formal language frameworks, the research elucidates strengths and weaknesses in SSM design choices and suggests future directions for leveraging the unique characteristics of SSM-based architectures. The findings advocate for explorations into hybrid models combining SSMs with other architectures, potentially unlocking new capabilities in LLMing tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. In-context language learning: Arhitectures and algorithms. arXiv preprint arXiv:2401.12973, 2024.
  2. J. Almeida. Finite semigroups and universal algebra, volume 3. World Scientific, 1995.
  3. Masked hard-attention transformers and boolean rasp recognize exactly the star-free languages. arXiv preprint arXiv:2310.13897, 2023.
  4. Layer normalization. stat, 1050:21, 2016.
  5. Regular languages in nc1. Journal of Computer and System Sciences, 44(3):478–499, 1992.
  6. On the ability and limitations of transformers to recognize formal languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7096–7116, 2020.
  7. On the distribution of deep clausal embeddings: A large cross-linguistic study. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3938–3943, 2019.
  8. Quasi-recurrent neural networks. In International Conference on Learning Representations, 2016.
  9. D. Chiang and P. Cholak. Overcoming a theoretical limitation of self-attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7654–7664, 2022.
  10. Tighter bounds on the expressivity of transformer encoders. 2023.
  11. N. Chomsky. Syntactic structures, 1957.
  12. N. Chomsky and M. P. Schützenberger. The algebraic theory of context-free languages. In Studies in Logic and the Foundations of Mathematics, volume 35, pages 118–161. Elsevier, 1963.
  13. Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR, 2017.
  14. Griffin: Mixing gated linear recurrences with local attention for efficient language models. CoRR, abs/2402.19427, 2024. doi: 10.48550/ARXIV.2402.19427. URL https://doi.org/10.48550/arXiv.2402.19427.
  15. Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations, 2022.
  16. S. Eilenberg. Automata, languages, and machines. Academic press, 1974.
  17. J. L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
  18. Structures, not strings: linguistics as part of the cognitive sciences. Trends in cognitive sciences, 19(12):729–743, 2015.
  19. Counter machines and counter languages. Mathematical systems theory, 2(3):265–283, Sep 1968a. ISSN 1433-0490. doi: 10.1007/BF01694011. URL https://doi.org/10.1007/BF01694011.
  20. Counter machines and counter languages. Math. Syst. Theory, 2(3):265–283, 1968b. doi: 10.1007/BF01694011. URL https://doi.org/10.1007/BF01694011.
  21. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=COZDy0WYGg.
  22. A. Ginzburg. Algebraic theory of automata. Academic Press, 1968.
  23. A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. CoRR, abs/2312.00752, 2023. doi: 10.48550/ARXIV.2312.00752. URL https://doi.org/10.48550/arXiv.2312.00752.
  24. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021.
  25. M. Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171, 2020.
  26. M. Hahn and M. Rofin. Why are sensitive functions hard for transformers? CoRR, abs/2402.09963, 2024. doi: 10.48550/ARXIV.2402.09963. URL https://doi.org/10.48550/arXiv.2402.09963.
  27. Visibly counter languages and the structure of nc11{}^{\mbox{1}}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT. In G. F. Italiano, G. Pighizzini, and D. Sannella, editors, Mathematical Foundations of Computer Science 2015 - 40th International Symposium, MFCS 2015, Milan, Italy, August 24-28, 2015, Proceedings, Part II, volume 9235 of Lecture Notes in Computer Science, pages 384–394. Springer, 2015. doi: 10.1007/978-3-662-48054-0\_32. URL https://doi.org/10.1007/978-3-662-48054-0_32.
  28. Rnns can generate bounded hierarchical languages with optimal memory. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1978–2010, 2020.
  29. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  30. Introduction to automata theory, languages, and computation. ACM New York, NY, USA, 2001.
  31. B. Horne and D. Hush. Bounds on the complexity of recurrent neural network implementations of finite state machines. Advances in neural information processing systems, 6, 1993.
  32. P. Indyk. Optimal simulation of automata by neural nets. In Annual Symposium on Theoretical Aspects of Computer Science, pages 337–348. Springer, 1995.
  33. Repeat after me: Transformers are better than state space models at copying, 2024.
  34. R. E. Kalman. On the general theory of control systems. In Proceedings First International Conference on Automatic Control, Moscow, USSR, pages 481–492, 1960.
  35. R. E. Kalman. Mathematical description of linear dynamical systems. Journal of the Society for Industrial and Applied Mathematics, Series A: Control, 1(2):152–192, 1963.
  36. F. Karlsson. Constraints on multiple center-embedding of clauses. Journal of Linguistics, 43(2):365–392, 2007.
  37. S. Kleene. Representation of events in nerve nets and finite automata. In Automata Studies. 1951.
  38. Visibly counter languages and constant depth circuits. In E. W. Mayr and N. Ollinger, editors, 32nd International Symposium on Theoretical Aspects of Computer Science, STACS 2015, March 4-7, 2015, Garching, Germany, volume 30 of LIPIcs, pages 594–607. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2015. doi: 10.4230/LIPICS.STACS.2015.594. URL https://doi.org/10.4230/LIPIcs.STACS.2015.594.
  39. K. Krohn and J. Rhodes. Algebraic theory of machines. i. prime decomposition theorem for finite semigroups and machines. Transactions of the American Mathematical Society, 116:450–464, 1965.
  40. Input-driven multi-counter automata. Theoretical Computer Science, 870:121–136, 2021.
  41. Simple recurrent units for highly parallelizable recurrence. arXiv preprint arXiv:1709.02755, 2017.
  42. Jamba: A hybrid transformer-mamba language model, 2024.
  43. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535, 2016.
  44. Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022.
  45. Exposing attention glitches with flip-flop language modeling. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/510ad3018bbdc5b6e3b10646e2e35771-Abstract-Conference.html.
  46. R. McNaughton and S. A. Papert. Counter-Free Automata (MIT research monograph no. 65). The MIT Press, 1971.
  47. Long range language modeling via gated state spaces. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=5MkYIYCbva.
  48. W. Merrill and A. Sabharwal. A logic for expressing log-precision transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  49. The illusion of state in state-space models. In ICML, 2024.
  50. G. A. Miller and N. Chomsky. Finitary models of language users. 1963.
  51. S. M. S. Mirac Suzgun, Yonatan Belinkov. On evaluating the generalization of lstm models in formal languages. volume 2, pages 277–286. University of Massachusetts Amherst Libraries, 1 2019. doi: 10.7275/s02b-4d91. URL https://openpublishing.library.umass.edu/scil/article/id/1167/.
  52. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pages 26670–26698. PMLR, 2023.
  53. On limitations of the transformer architecture. arXiv preprint arXiv:2402.08164, 2024.
  54. On the turing completeness of modern neural network architectures. arXiv preprint arXiv:1901.03429, 2019.
  55. P. C. Phillips and V. Solo. Asymptotics for linear processes. The Annals of Statistics, pages 971–1001, 1992.
  56. Hgrn2: Gated linear rnns with state expansion. arXiv preprint arXiv:2404.07904, 2024a.
  57. Hierarchically gated recurrent neural network for sequence modeling. Advances in Neural Information Processing Systems, 36, 2024b.
  58. J. Sakarovitch. Elements of automata theory. Cambridge university press, 2009.
  59. Representational strengths and limitations of transformers. Advances in Neural Information Processing Systems, 36, 2024.
  60. M. P. Schützenberger. On finite monoids having only trivial subgroups. Inf. Control., 8(2):190–194, 1965.
  61. N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  62. H. Siegelman and E. D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50:132–150, 1995.
  63. H. T. Siegelmann. Neural networks and analog computation: beyond the Turing limit. Springer Science & Business Media, 1999.
  64. H. Straubing. Finite automata, formal logic, and circuit complexity. Birkhaeuser, 1994.
  65. L. Strobl. Average-hard attention transformers are constant-depth uniform threshold circuits, 2023.
  66. Transformers as recognizers of formal languages: A survey on expressivity. CoRR, abs/2311.00208, 2023. doi: 10.48550/ARXIV.2311.00208. URL https://doi.org/10.48550/arXiv.2311.00208.
  67. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
  68. M. Tomita. Dynamic construction of finite-state automata from examples using hill-climbing. In Proceedings of the Fourth Annual Conference of the Cognitive Science Society, pages 105–108, 1982.
  69. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  70. On the practical computational power of finite precision rnns for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 740–745, 2018.
  71. A. Yang and D. Chiang. Counting like transformers: Compiling temporal counting logic into softmax transformers. arXiv preprint arXiv:2404.04393, 2024.
  72. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
  73. Self-attention networks can process bounded hierarchical languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 2021a. doi: 10.18653/v1/2021.acl-long.292. URL http://dx.doi.org/10.18653/v1/2021.acl-long.292.
  74. Self-attention networks can process bounded hierarchical languages. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3770–3785, Online, Aug. 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.292. URL https://aclanthology.org/2021.acl-long.292.
  75. B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yash Sarrof (3 papers)
  2. Yana Veitsman (4 papers)
  3. Michael Hahn (48 papers)
Citations (1)
Youtube Logo Streamline Icon: https://streamlinehq.com