Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
86 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
53 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Taipan: Efficient and Expressive State Space Language Models with Selective Attention (2410.18572v1)

Published 24 Oct 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Efficient long-context LLMing remains a significant challenge in NLP. While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs). These SALs identify tokens requiring long-range interactions, remove less important features, and then augment their representations using the attention module. This approach balances Mamba's efficiency with Transformer-like performance in memory-intensive tasks. By constraining the attention budget, Taipan extends accurate predictions to context lengths of up to 1 million tokens while preserving computational efficiency. Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context LLMing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Simple linear attention language models balance the recall-throughput tradeoff. arXiv preprint arXiv:2402.18668, 2024.
  2. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  3. Smollm-corpus, 2024. URL https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.
  4. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013. URL http://arxiv.org/abs/1308.3432.
  5. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  7432–7439, 2020.
  6. Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  7. Scaling transformer to 1m tokens and beyond with rmt. ArXiv, abs/2304.11062, 2023. URL https://api.semanticscholar.org/CorpusID:258291566.
  8. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  9. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  10. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024.
  11. Griffin: Mixing gated linear recurrences with local attention for efficient language models, 2024. URL https://arxiv.org/abs/2402.19427.
  12. Longnet: Scaling transformers to 1, 000, 000, 000 tokens. CoRR, abs/2307.02486, 2023. doi: 10.48550/ARXIV.2307.02486. URL https://doi.org/10.48550/arXiv.2307.02486.
  13. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  14. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=COZDy0WYGg.
  15. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records/12608602.
  16. Zamba: A compact 7b ssm hybrid model, 2024. URL https://arxiv.org/abs/2405.16712.
  17. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  18. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021a.
  19. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021b.
  20. On the parameterization and initialization of diagonal state space models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  35971–35983. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/e9a32fade47b906de908431991440f7c-Paper-Conference.pdf.
  21. Diagonal state spaces are as effective as structured state spaces. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  22982–22994. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9156b0f6dfa9bbd18c79cc459ef5d61c-Paper-Conference.pdf.
  22. Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  5961–5971, October 2023.
  23. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=rkE3y85ee.
  24. Transformers are RNNs: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp.  5156–5165. PMLR, 2020.
  25. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1082. URL https://aclanthology.org/D17-1082.
  26. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf), 2024.
  27. Starcoder: may the source be with you! 2023.
  28. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024.
  29. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  30. I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  31. Long range language modeling via gated state spaces. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=5MkYIYCbva.
  32. The illusion of state in state-space models. arXiv preprint arXiv:2404.08819, 2024.
  33. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
  34. Orca-math: Unlocking the potential of slms in grade school math. arXiv preprint arXiv:2402.14830, 2024.
  35. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution, 2023. URL https://arxiv.org/abs/2306.15794.
  36. Openwebmath: An open dataset of high-quality mathematical web text, 2023.
  37. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pp.  28043–28078. PMLR, 2023.
  38. cosformer: Rethinking softmax in attention. arXiv preprint arXiv:2202.08791, 2022.
  39. Know what you don’t know: Unanswerable questions for squad. In ACL 2018, 2018.
  40. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  41. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  42. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.  3531–3539, 2021.
  43. Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Ai8Hw3AXqks.
  44. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  45. Retentive network: A successor to transformer for large language models, 2023. URL https://arxiv.org/abs/2307.08621.
  46. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  47. A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
  48. An empirical study of mamba-based language models. arXiv preprint arXiv:2406.07887, 2024.
  49. Rnns are not transformers (yet): The key bottleneck on in-context retrieval. arXiv preprint arXiv:2402.18510, 2024.
  50. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  51. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800. Association for Computational Linguistics, 2019. URL https://aclanthology.org/P19-1472.
  52. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  53. Explicit sparse transformer: Concentrated attention through explicit selection. arXiv preprint arXiv:1912.11637, 2019.

Summary

  • The paper introduces Taipan’s hybrid architecture, merging state space models with selective attention to efficiently manage long-context tasks.
  • It demonstrates notable improvements in zero-shot language modeling and in-context retrieval over Transformer++ and Mamba baselines.
  • Experimental results reveal that Taipan scales to 1M tokens with lower perplexity and latency, highlighting its practical applicability.

Taipan: Efficient and Expressive State Space LLMs with Selective Attention

The paper introduces Taipan, a novel LLM architecture that addresses the challenges inherent in efficiently handling long-context language tasks. The primary innovation lies in combining the strengths of State Space Models (SSMs) like Mamba with Selective Attention Layers (SALs) to balance computational efficiency with the capability to model long-range dependencies.

Background and Motivation

Transformers have achieved significant success across various NLP tasks due to their self-attention mechanisms. However, they incur quadratic computational complexity concerning sequence length, presenting challenges in processing long sequences. SSMs, exemplified by architectures like Mamba-2, offer potential solutions with constant memory usage but traditionally struggle with in-context retrieval and long-range dependency handling.

Taipan's Architecture

Taipan's architecture introduces a hybrid model that integrates Mamba-2 with Selective Attention Layers. The SALs refine tokens requiring long-range attention by filtering out less significant features and augmenting meaningful ones using softmax attention. This innovation maintains Mamba's efficiency while achieving Transformer-like performance for memory-intensive tasks. Notably, Taipan scales effectively to handle sequences up to 1 million tokens without compromising on computational efficiency.

Experimental Validation

The paper provides comprehensive experimental results illustrating Taipan's superior performance across a wide array of tasks and model sizes, from 190M to 1.3B parameters. Key findings include:

  • Zero-shot LLMing: Taipan demonstrates enhanced performance metrics against both Transformer++ and Mamba baselines in standard LLMing tasks, evidencing its proficient language understanding capabilities.
  • In-Context Retrieval Tasks: In structured information extraction and question-answering tasks, Taipan outperforms baseline models by effectively leveraging its selective attention mechanism to focus on critical tokens, thereby excelling in retrieving contextually relevant information.
  • Long-Context Extrapolation: Taipan maintains lower perplexity and latency compared to traditional models like Transformers and Mamba when handling sequences much longer than its training context, underscoring its scalability and efficiency.

Implications and Future Directions

Taipan represents a significant stride in reconciling efficiency with deep contextual understanding in LLMing. Its design aligns well with increasing demands for models that can handle complex, memory-intensive processes over vast sequences. The inclusion of Selective Attention allows for highly efficient processing, making it suitable for various practical applications, including real-time language services and large-scale data analysis.

Future research can explore optimizing the gating mechanisms, enhancing generalization further, and expanding this architecture's application to different data modalities. Additionally, hybrid architectures like Taipan might set a precedent for combining other efficient computational architectures with selective enhancement strategies to address specific challenges in AI and ML tasks.

In conclusion, Taipan presents a robust framework for developing efficient, expressive LLMs capable of managing extensive contextual information without excessive computational overhead. Its innovations provide a promising path forward in the quest for efficient, scalable NLP solutions.