Theoretical Foundations of Deep Selective State-Space Models (2402.19047v4)
Abstract: Structured state-space models (SSMs) such as S4, stemming from the seminal work of Gu et al., are gaining popularity as effective approaches for modeling sequential data. Deep SSMs demonstrate outstanding performance across a diverse set of domains, at a reduced training and inference cost compared to attention-based transformers. Recent developments show that if the linear recurrence powering SSMs allows for multiplicative interactions between inputs and hidden states (e.g. GateLoop, Mamba, GLA), then the resulting architecture can surpass in both in accuracy and efficiency attention-powered foundation models trained on text, at scales of billion parameters. In this paper, we give theoretical grounding to this recent finding using tools from Rough Path Theory: we show that when random linear recurrences are equipped with simple input-controlled transitions (selectivity mechanism), then the hidden state is provably a low-dimensional projection of a powerful mathematical object called the signature of the input -- capturing non-linear interactions between tokens at distinct timescales. Our theory not only motivates the success of modern selective state-space models such as Mamba but also provides a solid framework to understand the expressive power of future SSM variants.
- Sig-sdes model for quantitative finance. In Proceedings of the First ACM International Conference on AI in Finance, pp. 1–8, 2020.
- Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
- Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer US, 2011. ISBN 9781441990969. URL https://books.google.co.uk/books?id=bX3TBwAAQBAJ.
- Fading memory and the problem of approximating nonlinear operators with volterra series. IEEE Transactions on circuits and systems, 32(11):1150–1161, 1985.
- Chen, K.-T. Integration of paths–a faithful representation of paths by noncommutative formal power series. Transactions of the American Mathematical Society, 89(2):395–407, 1958. ISSN 00029947. URL http://www.jstor.org/stable/1993193.
- A primer on the signature method in machine learning. arXiv preprint arXiv:1603.03788, 2016.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- Neural signature kernels as infinite-width-depth-limits of controlled resnets, 2023.
- Sk-tree: a systematic malware detection algorithm on streaming trees via the signature kernel. In 2021 IEEE international conference on cyber security and resilience (CSR), pp. 35–40. IEEE, 2021.
- Expressive power of randomized signature. In The Symbiosis of Deep Learning and Differential Equations, 2021a.
- Discrete-time signatures and randomness in reservoir computing. IEEE Transactions on Neural Networks and Learning Systems, Forthcoming, 04 2021b. doi: 10.1109/TNNLS.2021.3076777.
- An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22, 2003. URL https://api.semanticscholar.org/CorpusID:10327785.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- On words of non-Hermitian random matrices. The Annals of Probability, 49(4):1886 – 1916, 2021. doi: 10.1214/20-AOP1496. URL https://doi.org/10.1214/20-AOP1496.
- Fermanian, A. Embedding and learning with signatures, 2020.
- New directions in the applications of rough path theory. IEEE BITS the Information Theory Magazine, 2023.
- Multidimensional Stochastic Processes as Rough Paths: Theory and Applications. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2010. doi: 10.1017/CBO9780511845079.
- Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2022.
- It’s raw! audio generation with state-space models. International Conference on Machine Learning, 2022.
- Mamba: Linear-time sequence modeling with selective state spaces, 2023.
- Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
- Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021.
- On the parameterization and initialization of diagonal state space models. arXiv preprint arXiv:2206.11893, 2022.
- Uniqueness for the signature of a path of bounded variation and the reduced path group. Annals of Mathematics, pp. 109–167, 2010.
- Universal simulation of stable dynamical systems by recurrent neural nets. In Learning for Dynamics and Control, pp. 384–392. PMLR, 2020.
- Long short-term memory. Neural computation, 1997.
- A neural rde approach for continuous-time non-markovian stochastic control problems. arXiv preprint arXiv:2306.14258, 2023.
- Non-adversarial training of neural sdes with signature kernel scores. Advances in Neural Information Processing Systems, 2023.
- Katsch, T. Gateloop: Fully data-controlled linear recurrence for sequence modeling, 2023.
- Kidger, P. On neural differential equations, 2022.
- Deep signature transforms. Advances in Neural Information Processing Systems, 32, 2019.
- Neural controlled differential equations for irregular time series. Advances in Neural Information Processing Systems, 33:6696–6707, 2020.
- On the computational power of rnns. arXiv preprint arXiv:1906.06349, 2019.
- Efficient BackProp, pp. 9–48. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-35289-8. doi: 10.1007/978-3-642-35289-8˙3. URL https://doi.org/10.1007/978-3-642-35289-8_3.
- Distribution regression for sequential data, 2021.
- What makes convolutional models great on long sequence modeling? arXiv preprint arXiv:2210.09298, 2022a.
- Approximation and optimization theory for linear continuous-time recurrent neural networks. J. Mach. Learn. Res., 23:42–1, 2022b.
- Structured state space models for in-context reinforcement learning. Advances in Neural Information Processing Systems, 2023.
- Reservoir computing approaches to recurrent neural network training. Comput. Sci. Rev., 3:127–149, 2009. URL https://api.semanticscholar.org/CorpusID:554006.
- Signature methods in machine learning, 2024.
- Differential equations driven by rough paths. Springer, 2007.
- Parallelizing linear recurrent neural nets over sequence length. arXiv preprint arXiv:1709.04057, 2017.
- Neural rough differential equations for long time series. In International Conference on Machine Learning, pp. 7829–7838. PMLR, 2021.
- S4nd: Modeling images and videos as multidimensional signals using state spaces. Advances in Neural Information Processing Systems, 2022.
- On the universality of linear recurrences followed by nonlinear projections. arXiv preprint arXiv:2307.11888, 2023a.
- Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349, 2023b.
- Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
- The signature kernel is the solution of a goursat pde. SIAM Journal on Mathematics of Data Science, 3(3):873–899, 2021a. doi: 10.1137/20M1366794. URL https://doi.org/10.1137/20M1366794.
- Siggpde: Scaling sparse gaussian processes on sequential data. In International Conference on Machine Learning, pp. 6233–6242. PMLR, 2021b.
- Higher order kernel mean embeddings to capture filtrations of stochastic processes. Advances in Neural Information Processing Systems, 34:16635–16647, 2021c.
- Neural stochastic pdes: Resolution-invariant learning of continuous spatiotemporal dynamics. Advances in Neural Information Processing Systems, 35:1333–1344, 2022.
- On the computational power of neural nets. In Proceedings of the fifth annual workshop on Computational learning theory, pp. 440–449, 1992.
- Simplified state space layers for sequence modeling, 2023.
- Retentive network: A successor to transformer for large language models, 2023.
- Can recurrent neural networks warp time?, 2018.
- Long range arena: A benchmark for efficient transformers. In International Conference on Learning Representations, 2020.
- Attention is all you need. Advances in Neural Information Processing Systems, 2017.
- Log neural controlled differential equations: The lie brackets make a difference, 2024.
- Pretraining without attention. arXiv preprint arXiv:2212.10544, 2022.
- State-space models with layer-wise nonlinearity are universal approximators with exponential decaying memory. arXiv preprint arXiv:2309.13414, 2023.
- Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
- Online learning of long-range dependencies, 2023.