MambaByte: Token-free Selective State Space Model (2401.13660v3)

Published 24 Jan 2024 in cs.CL and cs.LG

Abstract: Token-free LLMs learn directly from raw bytes and remove the inductive bias of subword tokenization. Operating on bytes, however, results in significantly longer sequences. In this setting, standard autoregressive Transformers scale poorly as the effective memory required grows with sequence length. The recent development of the Mamba state space model (SSM) offers an appealing alternative approach with a fixed-sized memory state and efficient decoding. We propose MambaByte, a token-free adaptation of the Mamba SSM trained autoregressively on byte sequences. In terms of modeling, we show MambaByte to be competitive with, and even to outperform, state-of-the-art subword Transformers on LLMing tasks while maintaining the benefits of token-free LLMs, such as robustness to noise. In terms of efficiency, we develop an adaptation of speculative decoding with tokenized drafting and byte-level verification. This results in a $2.6\times$ inference speedup to the standard MambaByte implementation, showing similar decoding efficiency as the subword Mamba. These findings establish the viability of SSMs in enabling token-free LLMing.

PDF HTML Abstract

Introduction to MambaByte

In the field of LLMing, there is a paradigm shift away from subword tokenization towards token-free models that learn from raw bytes. This shift, however, poses a fundamental challenge due to the resultant increase in sequence length, which in turn puts a strain on existing model architectures like the Transformers, where attention mechanisms lead to quadratic scaling with respect to sequence length. Researchers are therefore exploring alternative architectures that can manage the computational load of such lengthy sequences while maintaining or surpassing the performance of traditional subword-based models.

MambaByte: An Efficient Token-Free Model

To address these challenges, the MambaByte model has been proposed and is notable for being a token-free adaptation of the Mamba state space model. It operates as an autoregressive LLM directly on byte sequences. By leveraging the Mamba architecture which is inherently designed for linear-time complexity with sequence length, MambaByte bypasses the computational issues that hamstring Transformers at byte scale. Furthermore, MambaByte's design is simple: the existing Mamba architecture was adapted without modifications, which indicates its innate efficiency for the application of LLMing.

Empirical Evaluation

In a comparative paper, the MambaByte model was tested against a suite of leading architectures, including Transformer and PerceiverAR models. This comparison was conducted under a fixed parameter and computational budget across various text datasets. The MambaByte model not only showcased better performance in a shorter time frame but also surpassed the efficiency of byte-level Transformers, thanks to its linear scaling in sequence length. It also demonstrated competitiveness with, and in certain cases, superiority over state-of-the-art subword Transformers. These findings were supported by metrics—including bits per byte (BPB)—and indicated that the token-free approach of the MambaByte model does not compromise on model performance.

Generative Capabilities and Potential Impact

Perhaps one of the most remarkable facets of MambaByte is its capability for fast text generation. Unlike Transformer models that must cache extensive contexts for autoregressive inference, the MambaByte model facilitates constant time generation steps by evolving a single hidden state per model layer through time. This characteristic not only enables speedier text generation but also makes it pragmatically feasible for utilization in practical applications.

The evidence gleaned from these experiments underscores token-free models like MambaByte as feasible alternatives to the traditional tokenizer-dependent models. Moreover, it carves a pathway towards end-to-end learning from byte sequences for future LLMs, promising potential efficiency gains and generalizability improvements across diverse textual formats.

PDF Markdown Bookmark Chat (Pro)

References (36)

Authors (4)

Junxiong Wang (13 papers)
Tushaar Gangavarapu (3 papers)
Jing Nathan Yan (11 papers)
Alexander M. Rush (115 papers)

Citations (27)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1750332780540461451

https://twitter.com/srush_nlp/status/1750533008199958982

https://twitter.com/_akhaliq/status/1750378082299453878

https://twitter.com/srush_nlp/status/1783146078377832469

https://twitter.com/srush_nlp/status/1750546810907558221

https://twitter.com/tri_dao/status/1828955669392961539