Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MambaByte: Token-free Selective State Space Model (2401.13660v3)

Published 24 Jan 2024 in cs.CL and cs.LG

Abstract: Token-free LLMs learn directly from raw bytes and remove the inductive bias of subword tokenization. Operating on bytes, however, results in significantly longer sequences. In this setting, standard autoregressive Transformers scale poorly as the effective memory required grows with sequence length. The recent development of the Mamba state space model (SSM) offers an appealing alternative approach with a fixed-sized memory state and efficient decoding. We propose MambaByte, a token-free adaptation of the Mamba SSM trained autoregressively on byte sequences. In terms of modeling, we show MambaByte to be competitive with, and even to outperform, state-of-the-art subword Transformers on LLMing tasks while maintaining the benefits of token-free LLMs, such as robustness to noise. In terms of efficiency, we develop an adaptation of speculative decoding with tokenized drafting and byte-level verification. This results in a $2.6\times$ inference speedup to the standard MambaByte implementation, showing similar decoding efficiency as the subword Mamba. These findings establish the viability of SSMs in enabling token-free LLMing.

Introduction to MambaByte

In the field of LLMing, there is a paradigm shift away from subword tokenization towards token-free models that learn from raw bytes. This shift, however, poses a fundamental challenge due to the resultant increase in sequence length, which in turn puts a strain on existing model architectures like the Transformers, where attention mechanisms lead to quadratic scaling with respect to sequence length. Researchers are therefore exploring alternative architectures that can manage the computational load of such lengthy sequences while maintaining or surpassing the performance of traditional subword-based models.

MambaByte: An Efficient Token-Free Model

To address these challenges, the MambaByte model has been proposed and is notable for being a token-free adaptation of the Mamba state space model. It operates as an autoregressive LLM directly on byte sequences. By leveraging the Mamba architecture which is inherently designed for linear-time complexity with sequence length, MambaByte bypasses the computational issues that hamstring Transformers at byte scale. Furthermore, MambaByte's design is simple: the existing Mamba architecture was adapted without modifications, which indicates its innate efficiency for the application of LLMing.

Empirical Evaluation

In a comparative paper, the MambaByte model was tested against a suite of leading architectures, including Transformer and PerceiverAR models. This comparison was conducted under a fixed parameter and computational budget across various text datasets. The MambaByte model not only showcased better performance in a shorter time frame but also surpassed the efficiency of byte-level Transformers, thanks to its linear scaling in sequence length. It also demonstrated competitiveness with, and in certain cases, superiority over state-of-the-art subword Transformers. These findings were supported by metrics—including bits per byte (BPB)—and indicated that the token-free approach of the MambaByte model does not compromise on model performance.

Generative Capabilities and Potential Impact

Perhaps one of the most remarkable facets of MambaByte is its capability for fast text generation. Unlike Transformer models that must cache extensive contexts for autoregressive inference, the MambaByte model facilitates constant time generation steps by evolving a single hidden state per model layer through time. This characteristic not only enables speedier text generation but also makes it pragmatically feasible for utilization in practical applications.

The evidence gleaned from these experiments underscores token-free models like MambaByte as feasible alternatives to the traditional tokenizer-dependent models. Moreover, it carves a pathway towards end-to-end learning from byte sequences for future LLMs, promising potential efficiency gains and generalizability improvements across diverse textual formats.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Attention Is All You Need. Advances in neural information processing systems, 30, 2017.
  2. Roformer: Enhanced transformer with rotary position embedding. arXiv e-prints, pages arXiv–2104, 2021.
  3. MegaByte: Predicting Million-byte Sequences with Multiscale Transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=JTmO2V9Xpz.
  4. Long Range Language Modeling via Gated State Spaces. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=5MkYIYCbva.
  5. A Neural Probabilistic Language Model. Advances in neural information processing systems, 13, 2000.
  6. Japanese and Korean Voice Search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5149–5152. IEEE, 2012.
  7. Neural Machine Translation of Rare Words with Subword Units. arXiv preprint arXiv:1508.07909, 2015.
  8. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144, 2016.
  9. Neural Machine Translation with Byte-Level Subwords. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 9154–9160, 2020.
  10. Character-Level Translation with Self-attention. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1591–1604, 2020a.
  11. ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022.
  12. Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. Transactions of the Association for Computational Linguistics, 10:73–91, 2022.
  13. Language Models are Few-Shot Learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  14. Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023.
  15. OPT: Open Pre-trained Transformer Language Models. arXiv preprint arXiv:2205.01068, 2022.
  16. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. Advances in neural information processing systems, 33:4271–4282, 2020.
  17. Hierarchical Transformers Are More Efficient Language Models. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1559–1571, 2022.
  18. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752, 2023.
  19. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv preprint arXiv:2111.00396, 2021.
  20. Diagonal State Spaces are as Effective as Structured State Spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022.
  21. On the Parameterization and Initialization of Diagonal State Space Models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
  22. Simplified State Space Layers for Sequence Modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Ai8Hw3AXqks.
  23. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
  24. Guy E Blelloch. Prefix Sums and Their Applications. (CMU-CS-90-190), nov 1990. URL https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf.
  25. Compressive Transformers for Long-Range Sequence Modelling. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SylKikSYDH.
  26. A Simple Method for Commonsense Reasoning. arXiv preprint arXiv:1806.02847, 2018.
  27. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027, 2020b.
  28. Shortformer: Better Language Modeling using Shorter Inputs. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5493–5505, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.427. URL https://aclanthology.org/2021.acl-long.427.
  29. Efficient Content-Based Sparse Attention with Routing Transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021. doi: 10.1162/tacl_a_00353. URL https://aclanthology.org/2021.tacl-1.4.
  30. General-purpose, long-context autoregressive modeling with Perceiver AR. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 8535–8558. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/hawthorne22a.html.
  31. Block-Recurrent Transformers. Advances in Neural Information Processing Systems, 35:33248–33261, 2022.
  32. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  33. Resurrecting Recurrent Neural Networks for Long Sequences. arXiv preprint arXiv:2303.06349, 2023.
  34. How to Train your HiPPO: State Space Models with Generalized Orthogonal Basis Projections. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=klK17OQ3KB.
  35. S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces. Advances in neural information processing systems, 35:2846–2861, 2022.
  36. The Curious Case of Neural Text Degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Junxiong Wang (13 papers)
  2. Tushaar Gangavarapu (3 papers)
  3. Jing Nathan Yan (11 papers)
  4. Alexander M. Rush (115 papers)
Citations (27)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com