State Space Models are Comparable to Transformers in Estimating Functions with Dynamic Smoothness (2405.19036v1)
Abstract: Deep neural networks based on state space models (SSMs) are attracting much attention in sequence modeling since their computational cost is significantly smaller than that of Transformers. While the capabilities of SSMs have been primarily investigated through experimental comparisons, theoretical understanding of SSMs is still limited. In particular, there is a lack of statistical and quantitative evaluation of whether SSM can replace Transformers. In this paper, we theoretically explore in which tasks SSMs can be alternatives of Transformers from the perspective of estimating sequence-to-sequence functions. We consider the setting where the target function has direction-dependent smoothness and prove that SSMs can estimate such functions with the same convergence rate as Transformers. Additionally, we prove that SSMs can estimate the target function, even if the smoothness changes depending on the input sequence, as well as Transformers. Our results show the possibility that SSMs can replace Transformers when estimating the functions in certain classes that appear in practice.
- State space models as foundation models: A control theoretic overview. arXiv preprint arXiv:2403.16899, 2024.
- Using fast weights to attend to the recent past. Advances in neural information processing systems, 29, 2016.
- Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recovery. Information and Inference: A Journal of the IMA, 11(4):1203–1253, 2022.
- Theoretical foundations of deep selective state-space models. arXiv preprint arXiv:2402.19047, 2024.
- Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
- Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2022.
- It’s raw! audio generation with state-space models. In International Conference on Machine Learning, pages 7616–7633. PMLR, 2022.
- Genomic benchmarks: a collection of datasets for genomic sequence classification. BMC Genomic Data, 24(1):25, 2023.
- A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
- P. Hall and J. L. Horowitz. Methodology and convergence rates for functional linear regression. 2007.
- M. Imaizumi and K. Fukumizu. Deep neural networks learn non-smooth functions effectively. In The 22nd international conference on artificial intelligence and statistics, pages 869–878. PMLR, 2019.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
- Laughing hyena distillery: Extracting compact recurrences from convolutions. Advances in Neural Information Processing Systems, 36, 2024.
- The illusion of state in state-space models. arXiv preprint arXiv:2404.08819, 2024.
- R. Nakada and M. Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. Journal of Machine Learning Research, 21(174):1–38, 2020.
- Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. Advances in neural information processing systems, 36, 2024.
- Diffusion models are minimax optimal distribution estimators. In International Conference on Machine Learning, pages 26517–26582. PMLR, 2023.
- S. Okumoto and T. Suzuki. Learnability of convolutional neural networks for infinite dimensional input via mixed and anisotropic smoothness. In International Conference on Learning Representations, 2021.
- The universal approximation power of finite-width deep relu networks. arXiv preprint arXiv:1806.01528, 2018.
- P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296–330, 2018.
- Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
- Diagonal state space augmented transformers for speech recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- J. Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. 2020.
- T. Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: optimal rate and curse of dimensionality. In International Conference on Learning Representations, 2018.
- T. Suzuki and A. Nitanda. Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic besov space. Advances in Neural Information Processing Systems, 34:3609–3621, 2021.
- S. Takakura and T. Suzuki. Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input. In Proceedings of the 40th International Conference on Machine Learning, pages 33416–33447. PMLR, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- S. Wang and B. Xue. State-space models with layer-wise nonlinearity are universal approximators with exponential decaying memory. Advances in Neural Information Processing Systems, 36, 2024.
- Naoki Nishikawa (4 papers)
- Taiji Suzuki (119 papers)