MambaByte: Token-Free Byte-Level Language Model

Updated 9 February 2026

MambaByte is a token-free, byte-level language model that processes raw byte sequences using a Selective State Space Model for enhanced efficiency and robustness.
It incorporates a speculative decoding mechanism that speeds up inference by drafting subword tokens then verifying them at the byte level.
Benchmark evaluations show MambaByte achieves competitive or superior performance and improved noise robustness compared to subword Transformer models.

MambaByte is a token-free, byte-level LLM based on the Selective State Space Model (SSM) introduced by Gu & Dao (2023), designed to learn directly from raw byte sequences without subword tokenization. It specifically adapts the Mamba SSM architecture to address the scalability and robustness limitations inherent in autoregressive Transformers operating on long byte sequences. MambaByte demonstrates competitive to superior performance compared to state-of-the-art subword Transformer models on language modeling benchmarks, provides improved robustness to input perturbations, and achieves $O(L)$ computational efficiency by processing sequences with fixed-sized recurrent memory. The architecture further incorporates an efficient, byte-level speculative decoding mechanism to ameliorate sequential bottlenecks during inference, establishing SSMs as viable candidates for token-free language modeling (Wang et al., 2024).

1. Model Architecture: Selective State Space Model Adaptation

MambaByte builds on the Mamba Selective SSM, formulating language modeling as an autoregressive sequence task over raw bytes (vocabulary size 256). At its core, the continuous-time SSM employed is:

$\frac{d}{dt}h(t) = A h(t) + B(t) x(t), \qquad y(t) = C(t) h(t),$

where $h(t) \in \mathbb{R}^n$ (hidden state), $x(t) \in \mathbb{R}$ (scalar input), $A \in \mathbb{R}^{n \times n}$ (fixed diagonal), and $B(t), C(t)$ are input- and time-dependent parameters.

After discretization, the recurrence becomes:

$h[k] = \bar{A}[k] h[k-1] + \bar{B}[k] x[k], \qquad y[k] = \bar{C}[k] h[k],$

with $\bar{A}[k], \bar{B}[k], \bar{C}[k]$ computed via selective gating governed by:

$\Delta[k] = \mathrm{softplus}(W_\Delta(W_R x[k])), \qquad \bar{B}[k] = \bar{B}(x[k]), \qquad \bar{C}[k] = \bar{C}(x[k]),$

where $W_\Delta \in \mathbb{R}^{d \times r}, W_R \in \mathbb{R}^{r \times d}$ ( $r \ll d$ ), and $W_B, W_C \in \mathbb{R}^{n \times d}$ . The softplus nonlinearity enforces $\Delta[k] > 0$ .

Each Mamba layer internally maintains per-channel SSMs of state size $n_\mathrm{state}$ for a residual stream of width $d$ . Model configurations examined include:

Scale	Layers ( $n$ )	Dim ( $d$ )	Expansion ( $e$ )	$n_\mathrm{state}$	Low‑Rank ( $r$ )	Params
Medium	53	1024	2	16	64	$353$ M
Large	48	2304	2	16	144	$1.6$ B

Each layer applies a small ( $k=4$ ) 1D convolution before the SSM and an elementwise gate (Swish/GELU) after.

2. Autoregressive Byte-Level Training

MambaByte employs standard cross-entropy minimization across a 256-way byte softmax to optimize for bits per byte (BPB). The primary pretraining datasets are PG19 (11.7 B), Stories (34 B), Books (108 B), ArXiv (60 B), and Code (677 B), where all numbers report bytes. During training:

Context length: 8,192 consecutive bytes, sampled randomly per document.
Batch size: 48; computation uses BF16 mixed precision on NVIDIA A100 GPUs.
Optimizer: AdamW with $\beta_1=0.9$ , $\beta_2=0.95$ , linear learning-rate warm-up (500 steps), cosine decay, and no dropout.
Gradient norm is clipped at 0.1 for BPB-minimization.
Total bytes seen: ~30 B (medium), ~150 B (large).

This approach fully dispenses with subword preprocessing, eliminating vocabulary selection, quantization, or patching.

3. Empirical Evaluation and Benchmarks

MambaByte was comprehensively benchmarked against subword and byte-level baselines under matched compute budgets, evaluating both model quality (BPB, PPL) and robustness.

Performance

Model	Context	Bytes Trained	PG19 BPB	Stories	Books	ArXiv	Code
Transformer‑320M	1024	80 B	1.057	1.064	1.097	0.816	0.575
PerceiverAR‑248M	8192	80 B	1.104	1.070	1.104	0.791	0.546
MegaByte-1B	8192	80 B	1.000	0.978	1.007	0.678	0.411
MambaByte-353M	8192	30 B	0.930	0.908	0.966	0.663	0.396

Note: MambaByte used $0.63\times$ the training compute/data of MegaByte.

On PG19 (large scale):

Model	Params	Bytes Trained	Test PPL ↓
Compressive Transf. (36-lay)	490 M	400 B	33.6
Block-Recurrent	1.3 B	–	26.5
MegaByte-1.65B	1.65 B	400 B	36.4
MambaByte-1.6B	1.6 B	150 B	33.0

Despite shorter training, MambaByte achieves superior performance relative to prior byte-level LMs and matches subword baselines.

Robustness to Noise

On PG19, under synthetic perturbations:

Noise/Probability	$\Delta\mathrm{PPL}_{\rm Mamba}$	$\Delta\mathrm{PPL}_{\rm MByte}$
Drop 0.3	+213.2	+31.7
Swap 0.3	+630.6	+28.7
Antspeak	+58300.0	+28.3

Byte-level MambaByte displays strong robustness compared to subword SSMs, maintaining much lower PPL increases under significant noise corruption.

Length Extrapolation

MambaByte, trained on sequences up to 8,192 bytes, maintains flat performance for contexts up to 32kB and greater. Contemporary Transformer architectures begin to degrade past 8kB, indicating improved context generalization.

4. Speculative Decoding: Efficient Large-Scale Byte Generation

Inference in byte-level autoregressive SSMs is inherently sequential. To alleviate this bottleneck, MambaByte integrates a two-stage speculative decoding procedure:

Drafting: A compact, subword-tokenized Mamba ( $M_\mathrm{subword}$ ) drafts $m$ subword tokens.
Verification: Drafted subwords are mapped to bytes and evaluated in parallel by the byte-level MambaByte ( $M_\mathrm{byte}$ ).
Bifurcation Detection: Identify the first “bifurcation byte” $c$ at which the drafted and verified bytes diverge in probability ranking (when a drafted byte falls outside the top- $\beta$ for $M_\mathrm{byte}$ ).
Correction: From $c$ , bytes are autoregressively generated with $M_\mathrm{byte}$ until a word boundary.
State Caching: Hidden states of both models are cached for iteration efficiency.

Pseudocode specification:

Algorithm 1: Speculative decoding iteration
    Inputs: (M_subword, M_byte, s_{1:t}, h^sub_t, h^byte_t)
    1. Draft subwords \tilde s_{1:m} with hidden states \bar h_j
    2. Map to drafted bytes \tilde b_{1:n}; parallel scan with M_byte to n
    3. Find c = min { i : rank_{p_i}(\tilde b_i) > β }
    4. Autoregressively correct from \tilde b_c to boundary
    5. Return verified bytes, new states

Empirical results for 8,192-byte generation on A100:

MambaByte (baseline): 93 s
With subword speculation: $2.6\times$ faster
Log-likelihood fidelity vs. greedy decode: 0.89 ($0.10$ for subword Mamba alone)

This approach achieves comparable decoding efficiency to subword Mamba with greatly improved generation fidelity.

5. Computational Benefits and Limitations

Benefits

Memory and Computation: Fixed-size recurrent memory ( $n_\mathrm{state} \times d \times \#\text{layers}$ ), independent of context length, supports true $O(L)$ training and constant-step decoding.
Token-Free Robustness: Complete removal of subword tokenization avoids subword-induced bias and yields strong robustness to typographical errors and adversarial noise.
Parallelization: Layerwise parallel scans enable training acceleration, and the model architecture closely resembles optimized RNN recurrences.
Performance: BPB and PPL metrics per poly FLOP and parameter count are competitive or superior to contemporary Transformer architectures.

Limitations and Future Directions

Inference Serialism: Pure byte-level decoding remains fundamentally sequential—speculative decoding reduces but does not eliminate this.
Long-Range Dependencies: Reliance entirely on recurrent dynamics for global context; augmentations like sparse attention or hierarchical patching are prospective research directions.
Extensibility: Natural extensions include multimodal (audio/image/video), code, multilingual tasks, and exploration of alternative state-space parameterizations such as HiPPO and S4 variants.

6. Significance within Language Modeling

MambaByte establishes that selective SSMs provide a practical, token-free alternative to subword Transformers, retaining or surpassing their performance while substantially improving model robustness and scaling properties. The architecture confirms that byte-level language modeling is not only feasible but also competitive in terms of efficiency and quality, with speculative decoding further bridging the remaining inference efficiency gap (Wang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

MambaByte: Token-free Selective State Space Model (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MambaByte.

MambaByte: Token-Free Byte-Level Language Model

1. Model Architecture: Selective State Space Model Adaptation

2. Autoregressive Byte-Level Training

3. Empirical Evaluation and Benchmarks

Performance

Robustness to Noise

Length Extrapolation

4. Speculative Decoding: Efficient Large-Scale Byte Generation

5. Computational Benefits and Limitations

Benefits

Limitations and Future Directions

6. Significance within Language Modeling

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MambaByte: Token-Free Byte-Level Language Model

1. Model Architecture: Selective State Space Model Adaptation

2. Autoregressive Byte-Level Training

3. Empirical Evaluation and Benchmarks

Performance

Robustness to Noise

Length Extrapolation

4. Speculative Decoding: Efficient Large-Scale Byte Generation

5. Computational Benefits and Limitations

Benefits

Limitations and Future Directions

6. Significance within Language Modeling

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research