Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Understanding the differences in Foundation Models: Attention, State Space Models, and Recurrent Neural Networks (2405.15731v3)

Published 24 May 2024 in cs.LG, cs.AI, cs.SY, and eess.SY

Abstract: Softmax attention is the principle backbone of foundation models for various artificial intelligence applications, yet its quadratic complexity in sequence length can limit its inference throughput in long-context settings. To address this challenge, alternative architectures such as linear attention, State Space Models (SSMs), and Recurrent Neural Networks (RNNs) have been considered as more efficient alternatives. While connections between these approaches exist, such models are commonly developed in isolation and there is a lack of theoretical understanding of the shared principles underpinning these architectures and their subtle differences, greatly influencing performance and scalability. In this paper, we introduce the Dynamical Systems Framework (DSF), which allows a principled investigation of all these architectures in a common representation. Our framework facilitates rigorous comparisons, providing new insights on the distinctive characteristics of each model class. For instance, we compare linear attention and selective SSMs, detailing their differences and conditions under which both are equivalent. We also provide principled comparisons between softmax attention and other model classes, discussing the theoretical conditions under which softmax attention can be approximated. Additionally, we substantiate these new insights with empirical validations and mathematical arguments. This shows the DSF's potential to guide the systematic development of future more efficient and scalable foundation models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. The hidden attention of mamba models. arXiv preprint arXiv:2403.01590, 2024.
  2. State space models as foundation models: A control theoretic overview. arXiv preprint arXiv:2403.16899, 2024.
  3. Zoology: Measuring and Improving Recall in Efficient Language Models. arXiv:2312.04927, 2023.
  4. xLSTM: Extended Long Short-Term Memory. arXiv preprint arXiv:2405.04517, 2024.
  5. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  6. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  7. Language Models are Few-Shot Learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  8. Theoretical foundations of deep selective state-space models. arXiv preprint arXiv:2402.19047, 2024.
  9. Transformers are SSMs: Generalized Models and Efficient Algorithms with Structured State Space Duality. In ICML 2024, 2024.
  10. Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models, 2024. URL https://arxiv.org/abs/2402.19427.
  11. Hungry Hungry Hippos: Towards Language Modeling with State Space Models, 2023. URL https://arxiv.org/abs/2212.14052.
  12. It’s raw! audio generation with state-space models. arXiv preprint arXiv:2202.09729, 2022.
  13. Mamba: Linear-Time Sequence Modeling with Selective State Spaces, 2023. URL https://arxiv.org/abs/2312.00752.
  14. HiPPO: Recurrent Memory with Optimal Polynomial Projections. In Advances in Neural Information Processing Systems, volume 33, pages 1474–1487. Curran Associates, Inc., 2020.
  15. Efficiently Modeling Long Sequences with Structured State Spaces. In The International Conference on Learning Representations (ICLR), 2022a.
  16. On the Parameterization and Initialization of Diagonal State Space Models, 2022b. URL https://arxiv.org/abs/2206.11893.
  17. Diagonal state spaces are as effective as structured state spaces. In Advances in Neural Information Processing Systems, volume 35, pages 22982–22994. Curran Associates, Inc., 2022.
  18. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  19. Transformers are rnns: fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
  20. Tobias Katsch. Gateloop: Fully data-controlled linear recurrence for sequence modeling, 2023.
  21. What makes convolutional models great on long sequence modeling? arXiv preprint arXiv:2210.09298, 2022.
  22. Point mamba: A novel point cloud backbone based on state space model with octree-based ordering strategy. arXiv preprint arXiv:2403.06467, 2024a.
  23. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024b.
  24. Data-independent random projections from the feature-space of the homogeneous polynomial kernel. Pattern Recognition, 82:130–146, 2018.
  25. Decoupled Weight Decay Regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  26. Structured state space models for in-context reinforcement learning. Advances in Neural Information Processing Systems, 2023.
  27. The illusion of state in state-space models. arXiv preprint arXiv:2404.08819, 2024.
  28. Taylorshift: Shifting the complexity of self-attention from squared to linear (and back) using taylor-softmax. arXiv preprint arXiv:2403.02920, 2024.
  29. S4nd: Modeling images and videos as multidimensional signals using state spaces. Advances in Neural Information Processing Systems, 2022.
  30. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  31. Resurrecting Recurrent Neural Networks for Long Sequences. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 26670–26698. PMLR, 23–29 Jul 2023.
  32. Universality of linear recurrences followed by non-linear projections: Finite-width guarantees and benefits of complex eigenvalues. International Conference on Machine Learning, 2024.
  33. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  34. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043–28078. PMLR, 2023.
  35. HGRN2: Gated Linear RNNs with State Expansion. arXiv preprint arXiv:2404.07904, 2024.
  36. Caduceus: Bi-directional equivariant long-range dna sequence modeling. arXiv preprint arXiv:2403.03234, 2024.
  37. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pages 9355–9366. PMLR, 2021.
  38. Amnon Shashua. Introduction to machine learning: Class notes 67577. arXiv preprint arXiv:0904.3664, 2009.
  39. Simplified State Space Layers for Sequence Modeling. In The Eleventh International Conference on Learning Representations, 2023.
  40. The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute, 2023.
  41. Retentive network: A successor to transformer for large language models, 2023.
  42. Long Range Arena : A Benchmark for Efficient Transformers. In International Conference on Learning Representations (ICLR), 2021.
  43. Deit iii: Revenge of the vit. arXiv preprint arXiv:2204.07118, 2022.
  44. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  45. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  46. Pretraining without attention. arXiv preprint arXiv:2212.10544, 2022.
  47. Mambabyte: Token-free selective state space model. arXiv preprint arXiv:2401.13660, 2024.
  48. State-space models with layer-wise nonlinearity are universal approximators with exponential decaying memory. Advances in Neural Information Processing Systems, 2023.
  49. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
  50. Vision mamba: Efficient visual representation learning with bidirectional state space model. 2024.
  51. Gated recurrent neural networks discover attention. arXiv preprint arXiv:2309.01775, 2023a.
  52. Online learning of long-range dependencies, 2023b.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com