Papers
Topics
Authors
Recent
Search
2000 character limit reached

MambaByte: Token-Free Byte-Level Language Model

Updated 9 February 2026
  • MambaByte is a token-free, byte-level language model that processes raw byte sequences using a Selective State Space Model for enhanced efficiency and robustness.
  • It incorporates a speculative decoding mechanism that speeds up inference by drafting subword tokens then verifying them at the byte level.
  • Benchmark evaluations show MambaByte achieves competitive or superior performance and improved noise robustness compared to subword Transformer models.

MambaByte is a token-free, byte-level LLM based on the Selective State Space Model (SSM) introduced by Gu & Dao (2023), designed to learn directly from raw byte sequences without subword tokenization. It specifically adapts the Mamba SSM architecture to address the scalability and robustness limitations inherent in autoregressive Transformers operating on long byte sequences. MambaByte demonstrates competitive to superior performance compared to state-of-the-art subword Transformer models on language modeling benchmarks, provides improved robustness to input perturbations, and achieves O(L)O(L) computational efficiency by processing sequences with fixed-sized recurrent memory. The architecture further incorporates an efficient, byte-level speculative decoding mechanism to ameliorate sequential bottlenecks during inference, establishing SSMs as viable candidates for token-free language modeling (Wang et al., 2024).

1. Model Architecture: Selective State Space Model Adaptation

MambaByte builds on the Mamba Selective SSM, formulating language modeling as an autoregressive sequence task over raw bytes (vocabulary size 256). At its core, the continuous-time SSM employed is:

ddth(t)=Ah(t)+B(t)x(t),y(t)=C(t)h(t),\frac{d}{dt}h(t) = A h(t) + B(t) x(t), \qquad y(t) = C(t) h(t),

where h(t)Rnh(t) \in \mathbb{R}^n (hidden state), x(t)Rx(t) \in \mathbb{R} (scalar input), ARn×nA \in \mathbb{R}^{n \times n} (fixed diagonal), and B(t),C(t)B(t), C(t) are input- and time-dependent parameters.

After discretization, the recurrence becomes:

h[k]=Aˉ[k]h[k1]+Bˉ[k]x[k],y[k]=Cˉ[k]h[k],h[k] = \bar{A}[k] h[k-1] + \bar{B}[k] x[k], \qquad y[k] = \bar{C}[k] h[k],

with Aˉ[k],Bˉ[k],Cˉ[k]\bar{A}[k], \bar{B}[k], \bar{C}[k] computed via selective gating governed by:

Δ[k]=softplus(WΔ(WRx[k])),Bˉ[k]=Bˉ(x[k]),Cˉ[k]=Cˉ(x[k]),\Delta[k] = \mathrm{softplus}(W_\Delta(W_R x[k])), \qquad \bar{B}[k] = \bar{B}(x[k]), \qquad \bar{C}[k] = \bar{C}(x[k]),

where WΔRd×r,WRRr×dW_\Delta \in \mathbb{R}^{d \times r}, W_R \in \mathbb{R}^{r \times d} (rdr \ll d), and WB,WCRn×dW_B, W_C \in \mathbb{R}^{n \times d}. The softplus nonlinearity enforces Δ[k]>0\Delta[k] > 0.

Each Mamba layer internally maintains per-channel SSMs of state size nstaten_\mathrm{state} for a residual stream of width dd. Model configurations examined include:

Scale Layers (nn) Dim (dd) Expansion (ee) nstaten_\mathrm{state} Low‑Rank (rr) Params
Medium 53 1024 2 16 64 $353$ M
Large 48 2304 2 16 144 $1.6$ B

Each layer applies a small (k=4k=4) 1D convolution before the SSM and an elementwise gate (Swish/GELU) after.

2. Autoregressive Byte-Level Training

MambaByte employs standard cross-entropy minimization across a 256-way byte softmax to optimize for bits per byte (BPB). The primary pretraining datasets are PG19 (11.7 B), Stories (34 B), Books (108 B), ArXiv (60 B), and Code (677 B), where all numbers report bytes. During training:

  • Context length: 8,192 consecutive bytes, sampled randomly per document.
  • Batch size: 48; computation uses BF16 mixed precision on NVIDIA A100 GPUs.
  • Optimizer: AdamW with β1=0.9\beta_1=0.9, β2=0.95\beta_2=0.95, linear learning-rate warm-up (500 steps), cosine decay, and no dropout.
  • Gradient norm is clipped at 0.1 for BPB-minimization.
  • Total bytes seen: ~30 B (medium), ~150 B (large).

This approach fully dispenses with subword preprocessing, eliminating vocabulary selection, quantization, or patching.

3. Empirical Evaluation and Benchmarks

MambaByte was comprehensively benchmarked against subword and byte-level baselines under matched compute budgets, evaluating both model quality (BPB, PPL) and robustness.

Performance

Model Context Bytes Trained PG19 BPB Stories Books ArXiv Code
Transformer‑320M 1024 80 B 1.057 1.064 1.097 0.816 0.575
PerceiverAR‑248M 8192 80 B 1.104 1.070 1.104 0.791 0.546
MegaByte-1B 8192 80 B 1.000 0.978 1.007 0.678 0.411
MambaByte-353M 8192 30 B 0.930 0.908 0.966 0.663 0.396

Note: MambaByte used 0.63×0.63\times the training compute/data of MegaByte.

On PG19 (large scale):

Model Params Bytes Trained Test PPL ↓
Compressive Transf. (36-lay) 490 M 400 B 33.6
Block-Recurrent 1.3 B 26.5
MegaByte-1.65B 1.65 B 400 B 36.4
MambaByte-1.6B 1.6 B 150 B 33.0

Despite shorter training, MambaByte achieves superior performance relative to prior byte-level LMs and matches subword baselines.

Robustness to Noise

On PG19, under synthetic perturbations:

Noise/Probability ΔPPLMamba\Delta\mathrm{PPL}_{\rm Mamba} ΔPPLMByte\Delta\mathrm{PPL}_{\rm MByte}
Drop 0.3 +213.2 +31.7
Swap 0.3 +630.6 +28.7
Antspeak +58300.0 +28.3

Byte-level MambaByte displays strong robustness compared to subword SSMs, maintaining much lower PPL increases under significant noise corruption.

Length Extrapolation

MambaByte, trained on sequences up to 8,192 bytes, maintains flat performance for contexts up to 32kB and greater. Contemporary Transformer architectures begin to degrade past 8kB, indicating improved context generalization.

4. Speculative Decoding: Efficient Large-Scale Byte Generation

Inference in byte-level autoregressive SSMs is inherently sequential. To alleviate this bottleneck, MambaByte integrates a two-stage speculative decoding procedure:

  1. Drafting: A compact, subword-tokenized Mamba (MsubwordM_\mathrm{subword}) drafts mm subword tokens.
  2. Verification: Drafted subwords are mapped to bytes and evaluated in parallel by the byte-level MambaByte (MbyteM_\mathrm{byte}).
  3. Bifurcation Detection: Identify the first “bifurcation byte” cc at which the drafted and verified bytes diverge in probability ranking (when a drafted byte falls outside the top-β\beta for MbyteM_\mathrm{byte}).
  4. Correction: From cc, bytes are autoregressively generated with MbyteM_\mathrm{byte} until a word boundary.
  5. State Caching: Hidden states of both models are cached for iteration efficiency.

Pseudocode specification:

1
2
3
4
5
6
7
Algorithm 1: Speculative decoding iteration
    Inputs: (M_subword, M_byte, s_{1:t}, h^sub_t, h^byte_t)
    1. Draft subwords \tilde s_{1:m} with hidden states \bar h_j
    2. Map to drafted bytes \tilde b_{1:n}; parallel scan with M_byte to n
    3. Find c = min { i : rank_{p_i}(\tilde b_i) > β }
    4. Autoregressively correct from \tilde b_c to boundary
    5. Return verified bytes, new states

Empirical results for 8,192-byte generation on A100:

  • MambaByte (baseline): 93 s
  • With subword speculation: 2.6×2.6\times faster
  • Log-likelihood fidelity vs. greedy decode: 0.89 ($0.10$ for subword Mamba alone)

This approach achieves comparable decoding efficiency to subword Mamba with greatly improved generation fidelity.

5. Computational Benefits and Limitations

Benefits

  • Memory and Computation: Fixed-size recurrent memory (nstate×d×#layersn_\mathrm{state} \times d \times \#\text{layers}), independent of context length, supports true O(L)O(L) training and constant-step decoding.
  • Token-Free Robustness: Complete removal of subword tokenization avoids subword-induced bias and yields strong robustness to typographical errors and adversarial noise.
  • Parallelization: Layerwise parallel scans enable training acceleration, and the model architecture closely resembles optimized RNN recurrences.
  • Performance: BPB and PPL metrics per poly FLOP and parameter count are competitive or superior to contemporary Transformer architectures.

Limitations and Future Directions

  • Inference Serialism: Pure byte-level decoding remains fundamentally sequential—speculative decoding reduces but does not eliminate this.
  • Long-Range Dependencies: Reliance entirely on recurrent dynamics for global context; augmentations like sparse attention or hierarchical patching are prospective research directions.
  • Extensibility: Natural extensions include multimodal (audio/image/video), code, multilingual tasks, and exploration of alternative state-space parameterizations such as HiPPO and S4 variants.

6. Significance within Language Modeling

MambaByte establishes that selective SSMs provide a practical, token-free alternative to subword Transformers, retaining or surpassing their performance while substantially improving model robustness and scaling properties. The architecture confirms that byte-level language modeling is not only feasible but also competitive in terms of efficiency and quality, with speculative decoding further bridging the remaining inference efficiency gap (Wang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MambaByte.