Bolmo Architecture: Byte-Level Transformer

Updated 18 December 2025

Bolmo is a byte-level autoregressive Transformer family that eliminates subword tokenization through a structured two-stage distillation and fine-tuning process.
The architecture employs a hybrid embedding scheme with dynamic pooling and a non-causal boundary predictor to achieve efficient segmentation and high inference speeds.
Empirical results show Bolmo 7B outperforms comparable models in multiple tasks, demonstrating its practical efficacy and enhanced throughput.

Bolmo is a fully open family of byte-level autoregressive Transformer LMs at the 1B and 7B parameter scales, designed to match or surpass the performance of modern subword-level LMs while eliminating subword tokenization constraints. Unlike previous byte-level approaches reliant on training from scratch, Bolmo converts ("byteifies") a competitive subword LM via a structured two-stage distillation and fine-tuning pipeline using less than 1% of the original pretraining token budget. The architecture specifically resolves prior expressivity mismatches between subword and byte-level models, enabling efficient and highly accurate byte-level operation with inference speeds that approach or surpass those of subword LMs (Minixhofer et al., 17 Dec 2025).

1. Architectural Overview

Bolmo is structured around an autoregressive Transformer backbone, operating at two model sizes: approximately 1 billion (Bolmo 1B) and 7 billion (Bolmo 7B) parameters. Input is processed as a sequence of UTF-8 bytes $x \in \{0, \ldots, 255\}^n$ , with no subword tokenizer applied at inference, directly addressing deficiencies in character-level understanding and tokenization bias intrinsic to subword-based models.

The architecture enables dynamic pooling of bytes into variable-length "patches," which both improves efficiency and closes the gap between tokenization-based and byte-level modeling. Bolmo employs a three-stage token processing pipeline: local encoding, boundary prediction and pooling, global Transformer modeling, depooling, and local decoding, culminating in a byte-centric LM head.

2. Input Representation and Embedding Scheme

To maintain both efficiency and adaptability, Bolmo utilizes a hybrid embedding strategy. Every byte $x_i$ is mapped via a byte embedding table $E_b \in \mathbb{R}^{256 \times d}$ :

$e_i^{(0)} = E_b[x_i] + E_{\mathrm{suf}}[\text{subwordSuffix}(x_{1:i})]$

Here, $E_{\mathrm{suf}}$ is a subword suffix embedding matrix (sparse hash table) inherited from the teacher subword model vocabulary $V_{\text{sub}}$ . This residual addition introduces high-capacity, teacher-informed subword context for each byte without compromising the integrity of byte-level processing. This combination is empirically shown to accelerate convergence and sample efficiency by enabling byte-level models to leverage contextual knowledge present in subword embeddings.

3. Pooling, Transformer Block, and Boundary Prediction

Bolmo employs a variant of the Latent-Tokenizer architecture, drawing from prior work such as Hourglass, DTP, BLT, and H-Net frameworks, but with significant adaptation for byteification:

Local Encoder: A shallow stack consisting of an mLSTM + FFN contextualizes the hybrid embeddings, with the update rule:

$\begin{align*} g_t &= \sigma(W_g [e_t; h_{t-1}] + b_g) \ c_t &= g_t \odot c_{t-1} + (1 - g_t) \odot \tanh(W_c [e_t; h_{t-1}] + b_c) \ h_t &= \text{mGate}(c_t) \odot \tanh(c_t) \end{align*}$

The FFN employs the SwiGLU activation and pre-activation normalization.

Non-causal Boundary Predictor: Unlike causal predictors which cannot match subword tokenizers' use of future context, Bolmo's prefill boundary predictor uses a one-byte lookahead:

$p_t = \mathcal{B}(\hat e)_t = \frac{1}{2}\left[1 - \frac{(W_q \hat e_{t+1})^T (W_k \hat e_t)}{\|W_q \hat e_{t+1}\| \, \|W_k \hat e_t\|}\right] \in [0,1]$

A threshold on $p_t$ determines patch boundaries.

Pooling: For patch boundaries, the last byte's representation is retained as the patch token input to the global Transformer.
Global Model (Transformer): The patch sequence is processed by a masked self-attention Transformer of the same structure as the source OLMo subword model (e.g., 32 layers, 4096 hidden size, 32-head attention).
Depooling: For each byte position $t$ , $\hat e_t$ is augmented by the most recent patch hidden state from the global model:

$z_t = \hat e_t + P(\hat h_{j(t)})$

where $P$ is a linear projection and $j(t)$ is the patch containing $t$ .

Local Decoder and LM Head: A short mLSTM+FFN stack (e.g., 4 layers for 7B) further contextualizes $z_t$ . Output logits are projected to a 512-way softmax, with each byte having two options: "byte alone" and "byte plus boundary."

4. Distillation and Training Procedure

Bolmo is trained by exact distillation from a high-quality subword LM (OLMo 2 1B or OLMo 3 7B) in a two-stage process:

Stage 1 (Local Pretraining): With the global Transformer frozen, the local encoder, decoder, boundary predictor, and LM head are trained to reproduce the subword model's behavior. The composite objective is:

$L_{\text{stage1}} = \lambda_B L_B + \lambda_E L_E + \lambda_{D,\text{distil}} L_{D,\text{distil}} + \lambda_{D,\text{CE}} L_{D,\text{CE}}$

where: - $L_B$ : boundary-matching loss - $L_E$ : local-encoder stitching loss - $L_{D,\text{distil}}$ : patch-level distillation loss (temperature-controlled cross-entropy with $\tau=5$ ) - $L_{D,\text{CE}}$ : byte-level cross-entropy

Stage 2 (End-to-End Fine-Tuning): All parameters, including the global Transformer, are jointly fine-tuned. Source model learning rates are lower than for local models.
Training Details: Stage 1 uses 9.8B tokens, batch size 32, and peak learning rate $5 \times 10^{-4}$ . Stage 2 uses 39.3B tokens, batch size 64, global LR $1.8 \times 10^{-5}$ , local LR $3.7 \times 10^{-5}$ , and 150K steps.

5. Expressivity, Capacity Matching, and Compression

Bolmo addresses the historical inability of byte-level LMs to match the expressivity of subword models, particularly regarding token (patch) boundaries and deep contextual representations. The non-causal, lookahead-based boundary predictor yields $\gtrsim$ 99% segmentation accuracy relative to the teacher model. Empirically, with these boundaries, Bolmo reproduces both patch segmentation and internal representations at every patch.

The system can be trained for variable token compression ratios ( $c = \text{\#bytes}/\text{\#patches}$ ) by merging teacher patches, with high ratios increasing inference throughput. Since the model's softmax head operates over a fixed 512-symbol vocabulary (bytes $\times$ with/without boundary), it avoids the inefficiencies encountered by subword models with large vocabularies ( $|\mathcal{V}| \approx 200\text{k} - 400\text{k}$ ), and continues to accelerate as compression increases.

6. Performance, Efficiency, and Empirical Results

Bolmo 7B demonstrates significant performance improvements over all prior open-weight byte-level LMs of comparable size, such as EvaByte 6.5B, BLT 7B, and TFree-Hat 7B. Notable results include:

On STEM multiple-choice tasks, Bolmo 7B exceeds BLT 7B by +16.5 points.
On character understanding tasks (CUTE, EXECUTE), Bolmo 7B achieves 78.6% accuracy, surpassing the subword teacher's 56.9%.
On general QA, code, mathematics, and language modeling (e.g., MMLU), Bolmo performance is within 1–2 points of the subword teacher.
At $c \approx 4$ bytes/patch, inference speed is $\sim 0.1$ million bytes/sec—on par with the subword LM’s token throughput. For $c \gtrsim 6.6$ , Bolmo surpasses the subword baseline in decodation speed.

Empirical ablations confirm that the non-causal boundary predictor is necessary for faithful reproduction of teacher segmentations and representations. Omitting Stage 1 of the training pipeline results in slower convergence and slightly inferior bit-per-byte metrics. Architectural decisions, such as mLSTM usage, are motivated by wall-clock inference efficiency rather than FLOPs per se.

7. Significance and Ecosystem Integration

Bolmo establishes byte-level LMs as practical, highly competitive alternatives to subword LMs in a range of large-scale applications. Its design accommodates efficient post-training and benefits from compatibility with the extensive infrastructure of subword models. The architecture demonstrates that competitive byteification—achieving parity with subword teacher LMs while improving character sensitivity and inference throughput—can be performed at less than 1% of the original pretraining compute budget (Minixhofer et al., 17 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Bolmo: Byteifying the Next Generation of Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Bolmo Architecture.

Bolmo Architecture: Byte-Level Transformer

1. Architectural Overview

2. Input Representation and Embedding Scheme

3. Pooling, Transformer Block, and Boundary Prediction

4. Distillation and Training Procedure

5. Expressivity, Capacity Matching, and Compression

6. Performance, Efficiency, and Empirical Results

7. Significance and Ecosystem Integration

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Bolmo Architecture: Byte-Level Transformer

1. Architectural Overview

2. Input Representation and Embedding Scheme

3. Pooling, Transformer Block, and Boundary Prediction

4. Distillation and Training Procedure

5. Expressivity, Capacity Matching, and Compression

6. Performance, Efficiency, and Empirical Results

7. Significance and Ecosystem Integration

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research