Bolmo Architecture: Byte-Level Transformer
- Bolmo is a byte-level autoregressive Transformer family that eliminates subword tokenization through a structured two-stage distillation and fine-tuning process.
- The architecture employs a hybrid embedding scheme with dynamic pooling and a non-causal boundary predictor to achieve efficient segmentation and high inference speeds.
- Empirical results show Bolmo 7B outperforms comparable models in multiple tasks, demonstrating its practical efficacy and enhanced throughput.
Bolmo is a fully open family of byte-level autoregressive Transformer LMs at the 1B and 7B parameter scales, designed to match or surpass the performance of modern subword-level LMs while eliminating subword tokenization constraints. Unlike previous byte-level approaches reliant on training from scratch, Bolmo converts ("byteifies") a competitive subword LM via a structured two-stage distillation and fine-tuning pipeline using less than 1% of the original pretraining token budget. The architecture specifically resolves prior expressivity mismatches between subword and byte-level models, enabling efficient and highly accurate byte-level operation with inference speeds that approach or surpass those of subword LMs (Minixhofer et al., 17 Dec 2025).
1. Architectural Overview
Bolmo is structured around an autoregressive Transformer backbone, operating at two model sizes: approximately 1 billion (Bolmo 1B) and 7 billion (Bolmo 7B) parameters. Input is processed as a sequence of UTF-8 bytes , with no subword tokenizer applied at inference, directly addressing deficiencies in character-level understanding and tokenization bias intrinsic to subword-based models.
The architecture enables dynamic pooling of bytes into variable-length "patches," which both improves efficiency and closes the gap between tokenization-based and byte-level modeling. Bolmo employs a three-stage token processing pipeline: local encoding, boundary prediction and pooling, global Transformer modeling, depooling, and local decoding, culminating in a byte-centric LM head.
2. Input Representation and Embedding Scheme
To maintain both efficiency and adaptability, Bolmo utilizes a hybrid embedding strategy. Every byte is mapped via a byte embedding table :
Here, is a subword suffix embedding matrix (sparse hash table) inherited from the teacher subword model vocabulary . This residual addition introduces high-capacity, teacher-informed subword context for each byte without compromising the integrity of byte-level processing. This combination is empirically shown to accelerate convergence and sample efficiency by enabling byte-level models to leverage contextual knowledge present in subword embeddings.
3. Pooling, Transformer Block, and Boundary Prediction
Bolmo employs a variant of the Latent-Tokenizer architecture, drawing from prior work such as Hourglass, DTP, BLT, and H-Net frameworks, but with significant adaptation for byteification:
- Local Encoder: A shallow stack consisting of an mLSTM + FFN contextualizes the hybrid embeddings, with the update rule:
The FFN employs the SwiGLU activation and pre-activation normalization.
- Non-causal Boundary Predictor: Unlike causal predictors which cannot match subword tokenizers' use of future context, Bolmo's prefill boundary predictor uses a one-byte lookahead:
A threshold on determines patch boundaries.
- Pooling: For patch boundaries, the last byte's representation is retained as the patch token input to the global Transformer.
- Global Model (Transformer): The patch sequence is processed by a masked self-attention Transformer of the same structure as the source OLMo subword model (e.g., 32 layers, 4096 hidden size, 32-head attention).
- Depooling: For each byte position , is augmented by the most recent patch hidden state from the global model:
where is a linear projection and is the patch containing .
- Local Decoder and LM Head: A short mLSTM+FFN stack (e.g., 4 layers for 7B) further contextualizes . Output logits are projected to a 512-way softmax, with each byte having two options: "byte alone" and "byte plus boundary."
4. Distillation and Training Procedure
Bolmo is trained by exact distillation from a high-quality subword LM (OLMo 2 1B or OLMo 3 7B) in a two-stage process:
- Stage 1 (Local Pretraining): With the global Transformer frozen, the local encoder, decoder, boundary predictor, and LM head are trained to reproduce the subword model's behavior. The composite objective is:
where: - : boundary-matching loss - : local-encoder stitching loss - : patch-level distillation loss (temperature-controlled cross-entropy with ) - : byte-level cross-entropy
- Stage 2 (End-to-End Fine-Tuning): All parameters, including the global Transformer, are jointly fine-tuned. Source model learning rates are lower than for local models.
- Training Details: Stage 1 uses 9.8B tokens, batch size 32, and peak learning rate . Stage 2 uses 39.3B tokens, batch size 64, global LR , local LR , and 150K steps.
5. Expressivity, Capacity Matching, and Compression
Bolmo addresses the historical inability of byte-level LMs to match the expressivity of subword models, particularly regarding token (patch) boundaries and deep contextual representations. The non-causal, lookahead-based boundary predictor yields 99% segmentation accuracy relative to the teacher model. Empirically, with these boundaries, Bolmo reproduces both patch segmentation and internal representations at every patch.
The system can be trained for variable token compression ratios () by merging teacher patches, with high ratios increasing inference throughput. Since the model's softmax head operates over a fixed 512-symbol vocabulary (bytes with/without boundary), it avoids the inefficiencies encountered by subword models with large vocabularies (), and continues to accelerate as compression increases.
6. Performance, Efficiency, and Empirical Results
Bolmo 7B demonstrates significant performance improvements over all prior open-weight byte-level LMs of comparable size, such as EvaByte 6.5B, BLT 7B, and TFree-Hat 7B. Notable results include:
- On STEM multiple-choice tasks, Bolmo 7B exceeds BLT 7B by +16.5 points.
- On character understanding tasks (CUTE, EXECUTE), Bolmo 7B achieves 78.6% accuracy, surpassing the subword teacher's 56.9%.
- On general QA, code, mathematics, and language modeling (e.g., MMLU), Bolmo performance is within 1–2 points of the subword teacher.
- At bytes/patch, inference speed is million bytes/sec—on par with the subword LM’s token throughput. For , Bolmo surpasses the subword baseline in decodation speed.
Empirical ablations confirm that the non-causal boundary predictor is necessary for faithful reproduction of teacher segmentations and representations. Omitting Stage 1 of the training pipeline results in slower convergence and slightly inferior bit-per-byte metrics. Architectural decisions, such as mLSTM usage, are motivated by wall-clock inference efficiency rather than FLOPs per se.
7. Significance and Ecosystem Integration
Bolmo establishes byte-level LMs as practical, highly competitive alternatives to subword LMs in a range of large-scale applications. Its design accommodates efficient post-training and benefits from compatibility with the extensive infrastructure of subword models. The architecture demonstrates that competitive byteification—achieving parity with subword teacher LMs while improving character sensitivity and inference throughput—can be performed at less than 1% of the original pretraining compute budget (Minixhofer et al., 17 Dec 2025).