Distilled Byte LMs
- Distilled Byte LMs are neural architectures that operate directly on byte sequences by transferring knowledge from token-level models.
- They use a two-stage distillation process (intermediate alignment and byte-level fine-tuning) to recover over 90% of teacher performance with far fewer data bytes.
- Specialized designs like H-Net and Bolmo leverage dynamic chunking and boundary prediction to achieve robust and tokenizer-free language modeling across diverse scripts.
Distilled Byte LLMs (LMs) are neural architectures that operate directly on byte sequences rather than subword or word tokens, obtained through the distillation of conventional token-level LLMs. By eschewing fixed subword tokenization, byte-level LMs circumvent out-of-vocabulary issues and gracefully accommodate diverse scripts, nonstandard text, and adversarial perturbations. However, direct training of competitive Byte LMs (BLMs) from scratch demands orders of magnitude more data (O(trillions) of bytes) and compute, due to longer input sequences and weaker inductive biases. Distilled BLMs mitigate these requirements by leveraging the capabilities of pre-existing token-level LLMs, using tailored distillation routines to transfer knowledge, greatly improving data efficiency and lowering adoption barriers for tokenizer-free language modeling (Bao et al., 1 Feb 2026, Minixhofer et al., 17 Dec 2025).
1. Motivation for Byte-Level LLMs
Standard autoregressive LMs rely on subword tokenization schemes such as BPE or WordPiece. Such schemes introduce vulnerabilities, including out-of-vocabulary tokens, inconsistent or language-biased segmentation, and brittleness to nonstandard or corrupted input. Byte-level modeling abolishes these artifacts, providing:
- Uniform granularity: All character sets and scripts are treated equivalently at the byte level.
- Robustness: Invariance to typos, codepoints, adversarial corruptions, and exotic script composition.
- Universality: No “unknown” tokens; all content expressible as a byte sequence is modelable.
Despite these advantages, training performant models directly on raw bytes is cost-prohibitive—sequence lengths are 2–4× longer than subword-tokenized equivalents, rendering self-attention computationally expensive and optimization challenging in the absence of subword-level inductive bias (Bao et al., 1 Feb 2026, Minixhofer et al., 17 Dec 2025).
2. Distillation Approaches to Byte LMs
To obviate the O(trillion) byte requirement, recent work introduces systematic knowledge distillation routines for “byteifying” existing large LLMs:
- Two-Stage Distillation: Both “Distilling Token-Trained Models into Byte-Level Models” (Bao et al.) and Bolmo employ two-phase protocols.
- Intermediate alignment: The byte-model learns to reproduce intermediate representations and output distributions of the teacher LLM in the teacher’s token space, via progressive alignment objectives.
- Byte-space fine-tuning: After sufficient alignment, the byte LM learns byte-level next-token prediction, with optionally unfrozen parameters for full end-to-end adaptation (Bao et al., 1 Feb 2026, Minixhofer et al., 17 Dec 2025).
Model architectures: A hierarchical encoder–decoder structure (“H-Net”) in (Bao et al., 1 Feb 2026) and the Latent-Tokenizer LM (LTLM) paradigm in Bolmo (Minixhofer et al., 17 Dec 2025) allow condensed processing of long byte sequences and dynamic chunking to maintain performance and efficiency.
This approach recovers over 90% of standard token LLM performance on common downstream tasks while requiring only a few hundred billion bytes of data instead of trillions.
3. Technical Distillation Protocols
3.1 Progressive Knowledge Distillation (Bao et al.)
The distillation process in (Bao et al., 1 Feb 2026) is sequential, with losses introduced in progression:
- Embedding Alignment: Minimize between teacher token embeddings and student byte-derived states at token end-positions.
- Joint Distillation: KL divergence on next-token distributions at teacher token boundaries: .
- Boundary Learning: The router predicts token boundaries on bytes via binary cross-entropy. This is implemented using a dynamic chunker (“one-byte lookahead routing”).
Curriculum scheduling ensures the student successively specializes: embedding alignment (~5–10B bytes), joint distillation (~10–20B bytes), boundary learning (~5B bytes).
3.2 Bolmo’s Byteification (Minixhofer et al., 17 Dec 2025)
Bolmo’s routine involves:
- Boundary Prediction Loss: Supervises patch (token) boundary placement to reflect the teacher’s subword tokenization.
- Local Encoder Distillation: Pooled byte encoder states are aligned to teacher subword embeddings up to a certain Transformer depth.
- Patch-Level Likelihood Matching: Using temperature-scaled binary cross-entropy, ensures the likelihoods assigned to byte-segmented subword “patches” match those of the teacher model.
- Transition to Byte-Only Generation: After alignment, all weights are unfrozen and standard byte-level next-token prediction objectives guide further training.
The protocol ensures near-exact reproduction of token-level distributions, where loss minimization corresponds to perfect imitation of the teacher.
4. Model Architectures and Implementation
Distilled Byte LMs employ specialized architectures to efficiently process long byte sequences:
| Component | H-Net (Bao et al., 1 Feb 2026) | Bolmo (Minixhofer et al., 17 Dec 2025) |
|---|---|---|
| Encoder | 4 Mamba2 layers + FFN; chunk_size=256 | mLSTM + SwiGLU for local; patch boundary predictor |
| Chunk/patch routing | One-byte lookahead similarity | 1-byte lookahead (boundary predictor via attention) |
| Core transformer | Copied from teacher LLM | Frozen (Stage 1), then unfrozen, from teacher LM |
| Decoder | Mamba2 + FFN + JBP/MBP head | 4× mLSTM + SwiGLU local decoder |
| Byte head | Linear head to 256–512 byte values | Linear projection to 256 (512 with boundary-fusion) |
Key architectural features include:
- Hierarchical encoding for scalable attention over long byte sequences, mitigating self-attention quadratic cost (Bao et al., 1 Feb 2026).
- Dynamic chunking/pooling using one-byte lookahead to approximate subword boundaries, enabling fast compression and efficient inference (Minixhofer et al., 17 Dec 2025).
- Embedding projection from byte to token space for intermediate alignment (Bao et al., 1 Feb 2026).
5. Experimental Results and Data Efficiency
Distilled Byte LMs achieve strong empirical results, with significant data and compute savings:
- Data efficiency: The protocol in (Bao et al., 1 Feb 2026) achieves >90% teacher performance with only 125B training bytes, whereas previous SFT-only byte models required 200–500B bytes with greater accuracy loss.
- MMLU Scores: On Llama-3.2 3B teacher, MMLU drops from 56.0 (teacher) to 51.8 (student after Stage 2; 92.5% retention). For Qwen-3 4B: 73.0 → 68.5 (93.8% retention) (Bao et al., 1 Feb 2026).
- Ablation studies: Excluding embedding alignment results in catastrophic degradation, e.g., LMB scores dropping from 70 to 31 (Bao et al., 1 Feb 2026). Bolmo similarly demonstrates the necessity of structured two-stage distillation and specialized encoders (Minixhofer et al., 17 Dec 2025).
- Character and code tasks: Bolmo 7B attains 75.1% on character understanding (CUTE: 78.6%) compared to 56.0% for its subword teacher. HumanEval code pass@1/16 is 40.6/74.7 vs. 49.0/71.1 for the teacher (Minixhofer et al., 17 Dec 2025).
- Inference speed: Dynamic boundary merging allows Bolmo to reach throughput competitive with subword LMs, with decoding rates exceeding subword baselines at higher compression ratios (Minixhofer et al., 17 Dec 2025).
6. Cross-Tokenizer Distillation and Likelihood Scoring
Distillation with mismatched tokenizers (e.g., when the student is byte-level and the teacher is BPE) requires careful alignment of likelihoods. (Phan et al., 16 Dec 2025) introduces:
- Recursive likelihood scoring leveraging BPE merge trees, enabling exact KL divergence computation for subset student vocabularies via a single O(1) forward pass per token.
- General-case scoring using lossless recursive cover enumeration or beam-search approximations for arbitrary vocabulary mappings.
These algorithms permit distillation between subword and byte LMs without auxiliary losses or ad-hoc heuristics, preserving the statistical fidelity of the process and enabling edge-device deployment via vocabulary trimming.
7. Practical Implications, Limitations, and Future Directions
Distilled Byte LMs bridge the gap between theoretical byte-model advantages and the practical performance of token LMs:
- Accessibility: ~125B bytes suffice to train models competitive with trillion-byte baselines. Available pre-trained checkpoints and codebases accelerate adoption (Bao et al., 1 Feb 2026).
- Deployment: Byte LMs enable seamless handling of multilingual, corrupted, or adversarial text, and dynamic vocabulary trimming for memory-constrained or edge-device settings (Phan et al., 16 Dec 2025).
- Robustness: Current distilled BLMs exhibit partial erosion of robustness post Stage 2 (fine-tuning), with sensitivity to boundary prediction; improved regularization and boundary modeling are active research areas (Bao et al., 1 Feb 2026, Minixhofer et al., 17 Dec 2025).
- Architecture adaptability: Dynamic chunk pooling, boundary-fusion, and patch prediction are subjects of ongoing optimization. Optimal atomic units beyond UTF-8 bytes (e.g., MYTE, SCRIPT) are under exploration (Minixhofer et al., 17 Dec 2025).
- Post-training compatibility: Parameter-compatible architectures (e.g., Bolmo) allow zero-cost transfer of post-training (instruction tuning, RL) improvements from the source subword LM via weight-difference arithmetic (“Task Arithmetic”), contingent upon embedding compatibility (Minixhofer et al., 17 Dec 2025).
This body of work establishes distilled Byte LMs as a practical, competitive alternative to both token-level and naively-trained-from-scratch byte-level LLMs, enabling efficient, robust, and universal modeling on raw data streams.