State-Free Inference of State-Space Models: The Transfer Function Approach (2405.06147v2)
Abstract: We approach designing a state-space model for deep learning applications through its dual representation, the transfer function, and uncover a highly efficient sequence parallel inference algorithm that is state-free: unlike other proposed algorithms, state-free inference does not incur any significant memory or computational cost with an increase in state size. We achieve this using properties of the proposed frequency domain transfer function parametrization, which enables direct computation of its corresponding convolutional kernel's spectrum via a single Fast Fourier Transform. Our experimental results across multiple sequence lengths and state sizes illustrates, on average, a 35% training speed improvement over S4 layers -- parametrized in time-domain -- on the Long Range Arena benchmark, while delivering state-of-the-art downstream performances over other attention-free approaches. Moreover, we report improved perplexity in LLMing over a long convolutional Hyena baseline, by simply introducing our transfer function parametrization. Our code is available at https://github.com/ruke1ire/RTF.
- In-context language learning: Architectures and algorithms, 2024.
- Bounding the zeros of polynomials using the frobenius companion matrix partitioned by the cartesian decomposition. Algorithms, 15(6), 2022. ISSN 1999-4893. doi: 10.3390/a15060184. URL https://www.mdpi.com/1999-4893/15/6/184.
- Never train from scratch: Fair comparison of long-sequence models requires data-driven priors. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PdaPky8MUn.
- Unitary evolution recurrent neural networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1120–1128. JMLR.org, 2016.
- Adaptive input representations for neural language modeling. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ByxZX20qFQ.
- Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994. doi: 10.1109/72.279181.
- Understanding in-context learning in transformers and LLMs by learning to learn discrete functions. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ekeyCgeRfC.
- Blelloch, G. E. Prefix sums and their applications. In Sythesis of parallel algorithms, pp. 35—60. Morgan Kaufmann Publishers Inc., 1990. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.6430.
- JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- Convolution algorithms. Citeseer: New York, NY, USA, 6:15, 1985.
- Chen, C.-T. Linear System Theory and Design. Oxford University Press, Inc., USA, 3rd edition, 1998. ISBN 0195117778.
- Decision transformer: Reinforcement learning via sequence modeling. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=a7APmM4B9d.
- Rethinking attention with performers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014, 2014.
- Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 933–941. JMLR.org, 2017.
- Hungry Hungry Hippos: Towards language modeling with state space models. In International Conference on Learning Representations, 2023.
- FlashFFTConv: Efficient convolutions for long sequences with tensor cores. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gPKTTAfYBp.
- The pile: An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027, 2021. URL https://arxiv.org/abs/2101.00027.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- Understanding the difficulty of training deep feedforward neural networks. In Teh, Y. W. and Titterington, M. (eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp. 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL https://proceedings.mlr.press/v9/glorot10a.html.
- Digital image processing. Prentice Hall, Upper Saddle River, N.J., 2008. ISBN 9780131687288 013168728X 9780135052679 013505267X.
- Mamba: Linear-time sequence modeling with selective state spaces, 2023.
- On the parameterization and initialization of diagonal state space models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 35971–35983. Curran Associates, Inc., 2022a.
- Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=uYLFoz1vlAC.
- How to train your HIPPO: State space models with generalized orthogonal basis projections. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=klK17OQ3KB.
- Diagonal state spaces are as effective as structured state spaces. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022, Advances in Neural Information Processing Systems. Neural information processing systems foundation, 2022. Publisher Copyright: © 2022 Neural information processing systems foundation. All rights reserved.; 36th Conference on Neural Information Processing Systems, NeurIPS 2022 ; Conference date: 28-11-2022 Through 09-12-2022.
- Liquid structural state-space models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=g4OTKRKfS7R.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034, 2015. doi: 10.1109/ICCV.2015.123.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016. doi: 10.1109/CVPR.2016.90.
- Gaussian error linear units (gelus), 2023.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Matrix Analysis. Cambridge University Press, 1985.
- Katsch, T. Gateloop: Fully data-controlled linear recurrence for sequence modeling, 2023.
- Neural controlled differential equations for irregular time series. Advances in Neural Information Processing Systems, 33:6696–6707, 2020.
- Krizhevsky, A. Learning multiple layers of features from tiny images. 2009. URL https://api.semanticscholar.org/CorpusID:18268744.
- Learning long-range spatial dependencies with horizontal gated recurrent units. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/ec8956637a99787bd197eacd77acce5e-Paper.pdf.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Learning word vectors for sentiment analysis. In Lin, D., Matsumoto, Y., and Mihalcea, R. (eds.), Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL https://aclanthology.org/P11-1015.
- Parallelizing linear recurrent neural nets over sequence length. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HyUNwulC-.
- Differentiable multiple shooting layers. Advances in Neural Information Processing Systems, 34:16532–16544, 2021.
- Laughing hyena distillery: Extracting compact recurrences from convolutions. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=OWELckerm6.
- ListOps: A diagnostic dataset for latent tree learning. In Cordeiro, S. R., Oraby, S., Pavalanathan, U., and Rim, K. (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 92–99, New Orleans, Louisiana, USA, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-4013. URL https://aclanthology.org/N18-4013.
- Discrete-Time Signal Processing. Prentice-hall Englewood Cliffs, second edition, 1999.
- Resurrecting recurrent neural networks for long sequences. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Pan, V. Y. Structured Matrices and Polynomials: Unified Superfast Algorithms. Springer-Verlag, Berlin, Heidelberg, 2001. ISBN 0817642404.
- On the difficulty of training recurrent neural networks. In Dasgupta, S. and McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 1310–1318, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/pascanu13.html.
- Hyena hierarchy: towards larger convolutional language models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023a.
- Stripedhyena: Moving beyond transformers with hybrid signal processing models. 2023b.
- The ACL Anthology network corpus. In Kan, M.-Y. and Teufel, S. (eds.), Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL), pp. 54–61, Suntec City, Singapore, August 2009. Association for Computational Linguistics. URL https://aclanthology.org/W09-3607.
- Improving language understanding by generative pre-training. 2018.
- Sparse modular activation for efficient sequence modeling, 2023.
- Ckconv: Continuous kernel convolution for sequential data. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=8FhxBtXSl0.
- Sandberg, I. W. On the theory of linear multi-loop feedback systems. Bell System Technical Journal, 42(2):355–382, 1963.
- Fast convolution and filtering. In Digital Signal Processing Fundamentals, pp. 185–208. CRC Press, 2017.
- Implicit neural representations with periodic activation functions. In Proc. NeurIPS, 2020.
- Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Ai8Hw3AXqks.
- Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=qVyeW-grC2k.
- Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- Effectively modeling time series with simple discrete state spaces. International Conference on Learning Representations, 2023.
- Rom N. Parnichkun (3 papers)
- Stefano Massaroli (28 papers)
- Alessandro Moro (4 papers)
- Jimmy T. H. Smith (7 papers)
- Ramin Hasani (40 papers)
- Mathias Lechner (39 papers)
- Qi An (99 papers)
- Christopher Ré (194 papers)
- Hajime Asama (20 papers)
- Stefano Ermon (279 papers)
- Taiji Suzuki (119 papers)
- Atsushi Yamashita (25 papers)
- Michael Poli (33 papers)