MambaByte: Token-Free Byte-Level Language Model
- MambaByte is a token-free, byte-level language model that processes raw byte sequences using a Selective State Space Model for enhanced efficiency and robustness.
- It incorporates a speculative decoding mechanism that speeds up inference by drafting subword tokens then verifying them at the byte level.
- Benchmark evaluations show MambaByte achieves competitive or superior performance and improved noise robustness compared to subword Transformer models.
MambaByte is a token-free, byte-level LLM based on the Selective State Space Model (SSM) introduced by Gu & Dao (2023), designed to learn directly from raw byte sequences without subword tokenization. It specifically adapts the Mamba SSM architecture to address the scalability and robustness limitations inherent in autoregressive Transformers operating on long byte sequences. MambaByte demonstrates competitive to superior performance compared to state-of-the-art subword Transformer models on language modeling benchmarks, provides improved robustness to input perturbations, and achieves computational efficiency by processing sequences with fixed-sized recurrent memory. The architecture further incorporates an efficient, byte-level speculative decoding mechanism to ameliorate sequential bottlenecks during inference, establishing SSMs as viable candidates for token-free language modeling (Wang et al., 2024).
1. Model Architecture: Selective State Space Model Adaptation
MambaByte builds on the Mamba Selective SSM, formulating language modeling as an autoregressive sequence task over raw bytes (vocabulary size 256). At its core, the continuous-time SSM employed is:
where (hidden state), (scalar input), (fixed diagonal), and are input- and time-dependent parameters.
After discretization, the recurrence becomes:
with computed via selective gating governed by:
where (), and . The softplus nonlinearity enforces .
Each Mamba layer internally maintains per-channel SSMs of state size for a residual stream of width . Model configurations examined include:
| Scale | Layers () | Dim () | Expansion () | Low‑Rank () | Params | |
|---|---|---|---|---|---|---|
| Medium | 53 | 1024 | 2 | 16 | 64 | $353$ M |
| Large | 48 | 2304 | 2 | 16 | 144 | $1.6$ B |
Each layer applies a small () 1D convolution before the SSM and an elementwise gate (Swish/GELU) after.
2. Autoregressive Byte-Level Training
MambaByte employs standard cross-entropy minimization across a 256-way byte softmax to optimize for bits per byte (BPB). The primary pretraining datasets are PG19 (11.7 B), Stories (34 B), Books (108 B), ArXiv (60 B), and Code (677 B), where all numbers report bytes. During training:
- Context length: 8,192 consecutive bytes, sampled randomly per document.
- Batch size: 48; computation uses BF16 mixed precision on NVIDIA A100 GPUs.
- Optimizer: AdamW with , , linear learning-rate warm-up (500 steps), cosine decay, and no dropout.
- Gradient norm is clipped at 0.1 for BPB-minimization.
- Total bytes seen: ~30 B (medium), ~150 B (large).
This approach fully dispenses with subword preprocessing, eliminating vocabulary selection, quantization, or patching.
3. Empirical Evaluation and Benchmarks
MambaByte was comprehensively benchmarked against subword and byte-level baselines under matched compute budgets, evaluating both model quality (BPB, PPL) and robustness.
Performance
| Model | Context | Bytes Trained | PG19 BPB | Stories | Books | ArXiv | Code |
|---|---|---|---|---|---|---|---|
| Transformer‑320M | 1024 | 80 B | 1.057 | 1.064 | 1.097 | 0.816 | 0.575 |
| PerceiverAR‑248M | 8192 | 80 B | 1.104 | 1.070 | 1.104 | 0.791 | 0.546 |
| MegaByte-1B | 8192 | 80 B | 1.000 | 0.978 | 1.007 | 0.678 | 0.411 |
| MambaByte-353M | 8192 | 30 B | 0.930 | 0.908 | 0.966 | 0.663 | 0.396 |
Note: MambaByte used the training compute/data of MegaByte.
On PG19 (large scale):
| Model | Params | Bytes Trained | Test PPL ↓ |
|---|---|---|---|
| Compressive Transf. (36-lay) | 490 M | 400 B | 33.6 |
| Block-Recurrent | 1.3 B | – | 26.5 |
| MegaByte-1.65B | 1.65 B | 400 B | 36.4 |
| MambaByte-1.6B | 1.6 B | 150 B | 33.0 |
Despite shorter training, MambaByte achieves superior performance relative to prior byte-level LMs and matches subword baselines.
Robustness to Noise
On PG19, under synthetic perturbations:
| Noise/Probability | ||
|---|---|---|
| Drop 0.3 | +213.2 | +31.7 |
| Swap 0.3 | +630.6 | +28.7 |
| Antspeak | +58300.0 | +28.3 |
Byte-level MambaByte displays strong robustness compared to subword SSMs, maintaining much lower PPL increases under significant noise corruption.
Length Extrapolation
MambaByte, trained on sequences up to 8,192 bytes, maintains flat performance for contexts up to 32kB and greater. Contemporary Transformer architectures begin to degrade past 8kB, indicating improved context generalization.
4. Speculative Decoding: Efficient Large-Scale Byte Generation
Inference in byte-level autoregressive SSMs is inherently sequential. To alleviate this bottleneck, MambaByte integrates a two-stage speculative decoding procedure:
- Drafting: A compact, subword-tokenized Mamba () drafts subword tokens.
- Verification: Drafted subwords are mapped to bytes and evaluated in parallel by the byte-level MambaByte ().
- Bifurcation Detection: Identify the first “bifurcation byte” at which the drafted and verified bytes diverge in probability ranking (when a drafted byte falls outside the top- for ).
- Correction: From , bytes are autoregressively generated with until a word boundary.
- State Caching: Hidden states of both models are cached for iteration efficiency.
Pseudocode specification:
1 2 3 4 5 6 7 |
Algorithm 1: Speculative decoding iteration
Inputs: (M_subword, M_byte, s_{1:t}, h^sub_t, h^byte_t)
1. Draft subwords \tilde s_{1:m} with hidden states \bar h_j
2. Map to drafted bytes \tilde b_{1:n}; parallel scan with M_byte to n
3. Find c = min { i : rank_{p_i}(\tilde b_i) > β }
4. Autoregressively correct from \tilde b_c to boundary
5. Return verified bytes, new states |
Empirical results for 8,192-byte generation on A100:
- MambaByte (baseline): 93 s
- With subword speculation: faster
- Log-likelihood fidelity vs. greedy decode: 0.89 ($0.10$ for subword Mamba alone)
This approach achieves comparable decoding efficiency to subword Mamba with greatly improved generation fidelity.
5. Computational Benefits and Limitations
Benefits
- Memory and Computation: Fixed-size recurrent memory (), independent of context length, supports true training and constant-step decoding.
- Token-Free Robustness: Complete removal of subword tokenization avoids subword-induced bias and yields strong robustness to typographical errors and adversarial noise.
- Parallelization: Layerwise parallel scans enable training acceleration, and the model architecture closely resembles optimized RNN recurrences.
- Performance: BPB and PPL metrics per poly FLOP and parameter count are competitive or superior to contemporary Transformer architectures.
Limitations and Future Directions
- Inference Serialism: Pure byte-level decoding remains fundamentally sequential—speculative decoding reduces but does not eliminate this.
- Long-Range Dependencies: Reliance entirely on recurrent dynamics for global context; augmentations like sparse attention or hierarchical patching are prospective research directions.
- Extensibility: Natural extensions include multimodal (audio/image/video), code, multilingual tasks, and exploration of alternative state-space parameterizations such as HiPPO and S4 variants.
6. Significance within Language Modeling
MambaByte establishes that selective SSMs provide a practical, token-free alternative to subword Transformers, retaining or surpassing their performance while substantially improving model robustness and scaling properties. The architecture confirms that byte-level language modeling is not only feasible but also competitive in terms of efficiency and quality, with speculative decoding further bridging the remaining inference efficiency gap (Wang et al., 2024).