Papers
Topics
Authors
Recent
Search
2000 character limit reached

MambaByte Model: Token-Free Byte-Level SSM

Updated 6 January 2026
  • MambaByte Model is a token-free byte-level language model that leverages input-dependent selective SSM layers for efficient next-byte prediction and robust inference.
  • It removes subword tokenization by operating directly on UTF-8 bytes, enabling fixed memory usage and mitigating quadratic scaling challenges of transformers.
  • Empirical results demonstrate that MambaByte outperforms subword Transformers in modeling quality and speed, with superior noise robustness and long-context stability.

MambaByte is a token-free LLM based on the Selective State Space Model (SSM) architecture, designed to learn directly from raw byte sequences without the inductive bias imposed by subword tokenization. MambaByte employs input-dependent SSM layers with fixed-size memory, enabling modeling of byte-level sequences efficiently while mitigating the quadratic scaling limitations found in standard autoregressive Transformers. It is trained autoregressively for next-byte prediction and incorporates a novel adaptation of speculative decoding that accelerates byte-level inference by leveraging a subword drafter and byte-level verifier. Empirical results show that MambaByte is competitive with, and in certain regimes surpasses, leading subword Transformer and SSM models in both modeling quality and efficiency, with marked advantages in robustness to noise and efficiency on long contexts (Wang et al., 2024).

1. Selective State Space Architecture

MambaByte is built on the Mamba Selective SSM as originally introduced by Gu & Dao (2023). Each SSM layer is governed by the continuous-time equation: ddth(t)=Ah(t)+B(t)x(t),y(t)=C(t)h(t),\frac{d}{dt}h(t) = A\,h(t) + B(t)\,x(t), \quad y(t) = C(t)\,h(t), where h(t)Rnh(t)\in\mathbb{R}^n is the hidden state, x(t)Rx(t)\in\mathbb{R} the input signal, and ARn×nA\in\mathbb{R}^{n\times n} a diagonal dynamics matrix. Uniquely, Mamba employs input-dependent selectivity for the input (B(t)B(t)), output (C(t)C(t)) matrices, and the discretization step (Δ\Delta).

Discretization at times tk=j=1kΔ[j]t_k=\sum_{j=1}^k \Delta[j], using zero-order hold, yields

h[k]=Aˉ[k]h[k1]+Bˉ[k]x[k],y[k]=Cˉ[k]h[k],(1)h[k] = \bar{A}[k]\,h[k-1] + \bar{B}[k]\,x[k], \qquad y[k] = \bar{C}[k]\,h[k], \tag{1}

with Aˉ[k]=exp(AΔ[k])\bar{A}[k] = \exp(A\,\Delta[k]), Bˉ[k]=A1(Aˉ[k]I)B[k]\bar{B}[k] = A^{-1}(\bar{A}[k] - I)B[k], and Cˉ[k]=C[k]\bar{C}[k] = C[k], all reparameterized as input-dependent functions.

The selectivity is effected via compact "routing" networks: Δ[k]=softplus(Rx[k]),B[k]=WBx[k],C[k]=WCx[k],\Delta[k]=\mathrm{softplus}(R\,x[k]), \quad B[k]=W_B\,x[k], \quad C[k]=W_C\,x[k], where RRr×dR\in\mathbb{R}^{r\times d}, WBRn×dW_B\in\mathbb{R}^{n\times d}, WCRd×nW_C\in\mathbb{R}^{d\times n}. A stack of LL such SSM layers, each with residual gated connections and intermediate feed-forward networks, composes the MambaByte backbone. Distinctly, each layer’s state dimension nn is independent of the sequence length, maintaining fixed memory regardless of input size.

2. Byte-Level Token-Free Modeling

MambaByte removes the subword tokenization step and operates directly on UTF-8 byte sequences. The input step replaces the usual token embeddings with a byte embedding matrix ER256×dE\in\mathbb{R}^{256\times d}, mapping each input byte bk{0,,255}b_k\in\{0,\ldots,255\} to an embedding

x[k]=E[bk]Rd.x[k] = E[b_k] \in \mathbb{R}^d.

These embeddings are propagated through the SSM layers. Despite potentially very long sequences, the fixed hidden state dimensions per layer ensure memory usage and computational cost do not scale with sequence length.

The final layer produces a latent state s[k]s[k] projected to logits over the byte vocabulary: [k]=Ws[k]+b,p(bkb<k)=softmax([k]),\ell[k] = W\,s[k] + b, \qquad p(b_k\mid b_{<k}) = \mathrm{softmax}\bigl(\ell[k]\bigr), enabling exact, autoregressive next-byte prediction.

3. Autoregressive Training Regime

MambaByte training maximizes the likelihood of the correct next byte, using the cross-entropy loss: L=k=1Tlogp(bkb<k).\mathcal{L} = -\sum_{k=1}^T \log p(b_k\mid b_{<k}). Large-scale training utilizes mixed-precision (BF16), AdamW optimizer (with β=(0.9,0.95)\beta=(0.9,0.95)), gradient norm clipping at 0.1, a linear warmup of 500 steps, and cosine annealing for the learning rate schedule. No dropout is used. Models are trained on random windows of 8,192 bytes.

SSM’s “parallel scan” algorithm enables forward and backward propagation with O(T)O(T) computational cost per byte and only O(nlogT)O(n\log T) overhead, facilitating the processing of long contexts in practice (§2).

4. Efficient Speculative Decoding

Decoding strictly one byte at a time is inherently slower compared to subword-level token models. To address this, MambaByte integrates an adaptation of speculative decoding through a dual-model scheme:

  • A smaller subword-level Mamba model (“drafter”) generates mm subwords rapidly in tokenized space.
  • The drafted subwords are detokenized to bytes, after which the full MambaByte (“verifier”) evaluates the byte sequence probabilities in parallel.

For each batch:

  • Drafted subword sequence s~1:m\tilde{s}_{1:m} is sampled.
  • Sequence is byte-encoded as b~1:n\tilde{b}_{1:n}.
  • Byte-level probabilities pi()p_i(\cdot) are computed for all b~i\tilde{b}_i in one parallel SSM scan.
  • The maximal prefix cc is found for which each b~i\tilde{b}_i is among the top-β\beta candidates of pip_i.
  • The accepted prefix b~1:c\tilde{b}_{1:c} is committed; from c+1c+1 onward, bytes are regenerated autogressively until a token boundary.

Empirically, this yields an inference speedup of approximately 2.6×2.6\times for speculative decoding versus standard bytewise decoding (Table 7), closely matching the efficiency of pure subword-level generation.

5. Empirical Performance and Robustness

Comprehensive evaluation demonstrates that MambaByte is highly competitive on several axes:

  • On the PG19 task with 8,192-byte context, a 353M parameter MambaByte achieves 0.93 bits/byte (bpb), surpassing MegaByte-758M (1.00 bpb) at equivalent computational budgets (Table 1).
  • On large-scale PG19 (150B bytes), a 1.03B parameter MambaByte achieves 39.6 bpb \rightarrow 33.0 word-level perplexity using a 16,384-byte sliding window, matching or exceeding leading subword models (Table 2).
  • Length extrapolation: MambaByte sustains bpb performance well beyond its training window (8k tokens), holding stable through at least 32k–64k bytes, unlike transformers which degrade markedly (Fig 3).
  • Noise robustness: On various text corruptions (byte-drop, repeat, uppercase, random-case, “antspeak”), MambaByte exhibits significantly lower degradation in perplexity than comparable subword SSMs (e.g., +8.5+8.5 vs +16.9+16.9 at 5% byte drop; Table 4).
  • Generation speed: Native bytewise MambaByte is approximately three times faster than MegaByte—generating an 8,192-byte sample takes 29 seconds vs 93 seconds (Table 6); with speculative decoding, its throughput approaches that of subword-level models (Table 7).

6. Implications and Limitations

Operating at the byte level eliminates biases from subword-tokenization, substantially enhancing robustness to out-of-vocabulary tokens and noisy text. The architectural strategy of maintaining a fixed-size hidden memory avoids the quadratic attention cost of transformers, instead scaling with l=1Lnldl\sum_{l=1}^L n_l d_l independently of sequence length.

However, certain limitations persist: validation during decoding requires serial byte-by-byte verification (although speculative batching alleviates some inefficiency), and finite-dimensional recurrent states may eventually saturate for extremely long sequences. Future research directions include augmenting the SSM with larger state dimensions, richer gating, hierarchical structures, as well as integration with retrieval mechanisms and assessment of token-free few-shot capabilities.

In summary, MambaByte establishes that a token-free, byte-level SSM can achieve or surpass state-of-the-art modeling and generation efficiency of subword-based transformers and SSMs, without reliance on token-patching or compression heuristics (Wang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MambaByte Model.