ByteSampler: Exact Byte-Level Sampling for BPE LMs
- ByteSampler is an inference-time algorithm that generates exact byte-level samples from BPE-tokenized language models, effectively eliminating the prompt boundary problem.
- It uses a Valid Covering Tree to track all tokenizations that honor the byte-level prompt, enabling practical model composition via ensembling and proxy-tuning across models with disparate tokenizers.
- Empirical evaluations show that ByteSampler achieves O(1) per-byte overhead and matches token-level performance, demonstrating strong efficiency and downstream accuracy.
ByteSampler is an inference-time algorithm that enables exact, efficient sampling from byte-level or character-level distributions using LMs originally trained with byte-pair encoding (BPE) tokenization. ByteSampler provides a principled solution to tokenization-induced generation distortions, most notably the Prompt Boundary Problem (PBP), and enables practical model composition—including ensembling and proxy-tuning—across models using disparate tokenizers, all without modifying or retraining the underlying transformer model (Hayase et al., 17 Jun 2025).
1. Tokenization and the Prompt Boundary Problem
Modern autoregressive LMs almost universally operate over a fixed vocabulary of multi-byte or multi-character tokens, rather than raw characters or bytes. BPE is the prevailing approach: input strings are pretokenized and then merged iteratively to form tokens, typically mapped from UTF-8 bytes or characters. During inference, a prompt is encoded into tokens , which are provided to the model to generate subsequent tokens , before decoding back to the output string.
The Prompt Boundary Problem arises when a user's prompt does not align with tokenization boundaries—specifically, when ends on a byte- or character-level prefix that would otherwise be part of a single token under BPE. In these cases, the model is forced to select a tokenization that rarely or never appeared in pretraining, resulting in unnatural generations. For instance, if "because" is a single token but the prompt cuts off as "becau", the resulting conditional distribution often assigns high probability to incorrect continuations. The PBP is especially problematic in languages such as Chinese, and in code generation, where prompt boundaries regularly intersect with tokenization boundaries.
2. Mathematical Framework
The goal is to produce generations that are consistent with the true byte-level prompt boundary. The standard LM distribution is
but the user instead desires
where are bytes and 0 is the observed prompt prefix. Equivalently, the requirement is to sample such that
1
is matched, as opposed to the naive approach of tokenizing 2 and continuing generation conditioned on 3. The naive approach ignores the fact that 4 could be a prefix of a larger token and forces unnatural continuations.
To bridge this gap, ByteSampler constructs the "Valid Covering Tree" 5, indexing all token sequences that decode to 6 along with one additional full token, while tracking all valid tokenizations that honor the prefix constraint. The next-byte distribution is expressed as:
7
guaranteeing that the byte-level sampling is faithful to the true generative process.
3. Inference-Time Algorithm and Complexity
ByteSampler maintains a tree data structure whose branches represent all valid token prefix continuations adhering exactly to the prompt's byte boundary. Upon receiving a new byte 8, it performs three core updates:
- Extends each leaf node with all tokens whose byte encoding starts with 9 and is compatible with the current leaf; this uses an 0-size pairwise validity table.
- Prunes any branches whose next byte is not 1.
- Emits any uniquely-determined token as output, compressing the trunk of the tree.
To generate the conditional next-byte distribution, ByteSampler:
- Enumerates the active leaves of 2.
- Computes the token-prefix probability for each leaf.
- Groups leaves by the next byte they would output, accumulating weights.
- Samples from the categorical distribution over bytes.
- Advances the prompt and updates 3.
All operations are 4 in time and memory per generated byte due to bounded tree width (by constants 5 and 6 reflecting tokenizer vocabulary branching and context limits), resulting in negligible computational overhead relative to standard token-level sampling (Hayase et al., 17 Jun 2025).
4. Faithfulness and Theoretical Guarantees
By construction, the Valid Covering Tree built by ByteSampler tracks all and only tokenizations that decode to the exact observed byte-prefix plus one token. At every step, the algorithm marginalizes over all valid continuations, ensuring that:
7
This guarantees exact equivalence to the desired text-level distribution conditioned on the prompt bytes, with no approximations or heuristics. As a result, ByteSampler entirely eliminates the Prompt Boundary Problem for BPE-tokenized models, without altering the autoregressive model or retraining.
5. Model Composition: Ensembling and Proxy-Tuning
5.1 Byte-Level Ensembling
Model composition is often hindered by tokenizer misalignment: token vocabularies 8 and 9 from different models do not match, making direct token-level ensembling intractable. ByteSampler resolves this by exposing byte-level distributions 0 for each model, always over 256-dimensional byte support. This enables exact inference-time ensembling,
1
or product-of-experts (PoE) in logit space,
2
for any interpolation weight 3. Such ensembling exactly preserves each model's generated text-level marginal distributions.
Empirical results with three 1B-parameter models (Qwen3-1.7B, OLMo2-1B, Llama3.2-1B) showed that byte-level ensembles consistently outperformed the average constituent and closely matched the best single model on benchmarks including DROP, LAMBADA, SQuAD, and TriviaQA (Hayase et al., 17 Jun 2025).
5.2 Proxy-Tuning via Logit Differences
Proxy-tuning, as described by Liu et al. (2024), uses smaller models (proxies) as experts/anti-experts to apply "virtual" training to large base models at inference without modifying model weights. ByteSampler is critical here because proxies and base models may use incompatible tokenizers. The proxy mixture is formed as
4
with 5 the byte-level logits from the base, expert, and anti-expert models, respectively; 6 controls the influence. The resultant distribution is equivalent to having trained the base on the expert's data minus the anti's.
Application to Llama-3.1-8B base with OLMo2-1B proxies resulted in substantial performance improvements (GSM8K EM from ~55% to 76%, MMLU from ~28% to 60%, AlpacaEval2 win rate from 0.88 to 0.715) (Hayase et al., 17 Jun 2025).
6. Empirical Evaluation
ByteSampler achieves practical efficiency and strong downstream performance compared to both naive and heuristic baselines:
- Token inference overhead: ByteSampler requires 24.24 tokens per generation, adding only ~0.72 token evaluations per byte relative to plain BPE (23.5), and 65% less overhead than next-best exact methods (Byte-Pair Correction: 72.99; with prefix caching: 25.6).
- Character-level modeling:
- On English (OLMo2-1B), ByteSampler matches the token-level baseline: 0.81 bits/char vs. 0.80 bits/char (token) and 6.53 bits/char (naive).
- Next-char accuracy (greedy) is 81.6% with ByteSampler, compared to 29.5% (naive), 71.6% (Backtracking), and 76.3% (Token Alignment).
- On Chinese (Qwen3-1.7B): 3.23 bits/char (matching token baseline 3.29), with 52.7% next-char accuracy; naive achieves only 32.8%.
- Ensembling and proxy-tuning:
- Byte-level ensembles consistently outperformed individual model averages across multiple tasks.
- Proxy-tuning with ByteSampler provided substantial improvements in instruction-following and knowledge benchmarks.
7. Significance and Extensions
ByteSampler provides an exact, 7-overhead, inference-time reduction from any BPE-tokenized LM to a byte-level generator. It completely solves the Prompt Boundary Problem for BPE LMs and enables a range of model-composition techniques previously impossible due to tokenizer misalignment—specifically, exact ensembling and proxy-tuning—without requiring model retraining or weight access. The open-source implementation demonstrates negligible additional latency and strongly favorable empirical performance on text generation and downstream tasks (Hayase et al., 17 Jun 2025).