Papers
Topics
Authors
Recent
Search
2000 character limit reached

ByteSampler: Exact Byte-Level Sampling for BPE LMs

Updated 6 May 2026
  • ByteSampler is an inference-time algorithm that generates exact byte-level samples from BPE-tokenized language models, effectively eliminating the prompt boundary problem.
  • It uses a Valid Covering Tree to track all tokenizations that honor the byte-level prompt, enabling practical model composition via ensembling and proxy-tuning across models with disparate tokenizers.
  • Empirical evaluations show that ByteSampler achieves O(1) per-byte overhead and matches token-level performance, demonstrating strong efficiency and downstream accuracy.

ByteSampler is an inference-time algorithm that enables exact, efficient sampling from byte-level or character-level distributions using LMs originally trained with byte-pair encoding (BPE) tokenization. ByteSampler provides a principled solution to tokenization-induced generation distortions, most notably the Prompt Boundary Problem (PBP), and enables practical model composition—including ensembling and proxy-tuning—across models using disparate tokenizers, all without modifying or retraining the underlying transformer model (Hayase et al., 17 Jun 2025).

1. Tokenization and the Prompt Boundary Problem

Modern autoregressive LMs almost universally operate over a fixed vocabulary VV of multi-byte or multi-character tokens, rather than raw characters or bytes. BPE is the prevailing approach: input strings SΣS \in \Sigma^* are pretokenized and then merged iteratively to form tokens, typically mapped from UTF-8 bytes or characters. During inference, a prompt xΣx \in \Sigma^* is encoded into tokens t1,,tk=encode(x)t_1, \ldots, t_k = \text{encode}(x), which are provided to the model to generate subsequent tokens tk+1,tk+2,t_{k+1}, t_{k+2}, \ldots, before decoding back to the output string.

The Prompt Boundary Problem arises when a user's prompt xx does not align with tokenization boundaries—specifically, when xx ends on a byte- or character-level prefix that would otherwise be part of a single token under BPE. In these cases, the model is forced to select a tokenization that rarely or never appeared in pretraining, resulting in unnatural generations. For instance, if "because" is a single token but the prompt cuts off as "becau", the resulting conditional distribution often assigns high probability to incorrect continuations. The PBP is especially problematic in languages such as Chinese, and in code generation, where prompt boundaries regularly intersect with tokenization boundaries.

2. Mathematical Framework

The goal is to produce generations that are consistent with the true byte-level prompt boundary. The standard LM distribution is

p(t1,,tn)=i=1np(tit<i),p(t_1, \ldots, t_n) = \prod_{i=1}^n p(t_i|t_{<i}),

but the user instead desires

p(bm+1,bm+2,b1,,bm=x),p(b_{m+1}, b_{m+2}, \ldots | b_1, \ldots, b_m = x),

where bjb_j are bytes and SΣS \in \Sigma^*0 is the observed prompt prefix. Equivalently, the requirement is to sample such that

SΣS \in \Sigma^*1

is matched, as opposed to the naive approach of tokenizing SΣS \in \Sigma^*2 and continuing generation conditioned on SΣS \in \Sigma^*3. The naive approach ignores the fact that SΣS \in \Sigma^*4 could be a prefix of a larger token and forces unnatural continuations.

To bridge this gap, ByteSampler constructs the "Valid Covering Tree" SΣS \in \Sigma^*5, indexing all token sequences that decode to SΣS \in \Sigma^*6 along with one additional full token, while tracking all valid tokenizations that honor the prefix constraint. The next-byte distribution is expressed as:

SΣS \in \Sigma^*7

guaranteeing that the byte-level sampling is faithful to the true generative process.

3. Inference-Time Algorithm and Complexity

ByteSampler maintains a tree data structure whose branches represent all valid token prefix continuations adhering exactly to the prompt's byte boundary. Upon receiving a new byte SΣS \in \Sigma^*8, it performs three core updates:

  1. Extends each leaf node with all tokens whose byte encoding starts with SΣS \in \Sigma^*9 and is compatible with the current leaf; this uses an xΣx \in \Sigma^*0-size pairwise validity table.
  2. Prunes any branches whose next byte is not xΣx \in \Sigma^*1.
  3. Emits any uniquely-determined token as output, compressing the trunk of the tree.

To generate the conditional next-byte distribution, ByteSampler:

  1. Enumerates the active leaves of xΣx \in \Sigma^*2.
  2. Computes the token-prefix probability for each leaf.
  3. Groups leaves by the next byte they would output, accumulating weights.
  4. Samples from the categorical distribution over bytes.
  5. Advances the prompt and updates xΣx \in \Sigma^*3.

All operations are xΣx \in \Sigma^*4 in time and memory per generated byte due to bounded tree width (by constants xΣx \in \Sigma^*5 and xΣx \in \Sigma^*6 reflecting tokenizer vocabulary branching and context limits), resulting in negligible computational overhead relative to standard token-level sampling (Hayase et al., 17 Jun 2025).

4. Faithfulness and Theoretical Guarantees

By construction, the Valid Covering Tree built by ByteSampler tracks all and only tokenizations that decode to the exact observed byte-prefix plus one token. At every step, the algorithm marginalizes over all valid continuations, ensuring that:

xΣx \in \Sigma^*7

This guarantees exact equivalence to the desired text-level distribution conditioned on the prompt bytes, with no approximations or heuristics. As a result, ByteSampler entirely eliminates the Prompt Boundary Problem for BPE-tokenized models, without altering the autoregressive model or retraining.

5. Model Composition: Ensembling and Proxy-Tuning

5.1 Byte-Level Ensembling

Model composition is often hindered by tokenizer misalignment: token vocabularies xΣx \in \Sigma^*8 and xΣx \in \Sigma^*9 from different models do not match, making direct token-level ensembling intractable. ByteSampler resolves this by exposing byte-level distributions t1,,tk=encode(x)t_1, \ldots, t_k = \text{encode}(x)0 for each model, always over 256-dimensional byte support. This enables exact inference-time ensembling,

t1,,tk=encode(x)t_1, \ldots, t_k = \text{encode}(x)1

or product-of-experts (PoE) in logit space,

t1,,tk=encode(x)t_1, \ldots, t_k = \text{encode}(x)2

for any interpolation weight t1,,tk=encode(x)t_1, \ldots, t_k = \text{encode}(x)3. Such ensembling exactly preserves each model's generated text-level marginal distributions.

Empirical results with three 1B-parameter models (Qwen3-1.7B, OLMo2-1B, Llama3.2-1B) showed that byte-level ensembles consistently outperformed the average constituent and closely matched the best single model on benchmarks including DROP, LAMBADA, SQuAD, and TriviaQA (Hayase et al., 17 Jun 2025).

5.2 Proxy-Tuning via Logit Differences

Proxy-tuning, as described by Liu et al. (2024), uses smaller models (proxies) as experts/anti-experts to apply "virtual" training to large base models at inference without modifying model weights. ByteSampler is critical here because proxies and base models may use incompatible tokenizers. The proxy mixture is formed as

t1,,tk=encode(x)t_1, \ldots, t_k = \text{encode}(x)4

with t1,,tk=encode(x)t_1, \ldots, t_k = \text{encode}(x)5 the byte-level logits from the base, expert, and anti-expert models, respectively; t1,,tk=encode(x)t_1, \ldots, t_k = \text{encode}(x)6 controls the influence. The resultant distribution is equivalent to having trained the base on the expert's data minus the anti's.

Application to Llama-3.1-8B base with OLMo2-1B proxies resulted in substantial performance improvements (GSM8K EM from ~55% to 76%, MMLU from ~28% to 60%, AlpacaEval2 win rate from 0.88 to 0.715) (Hayase et al., 17 Jun 2025).

6. Empirical Evaluation

ByteSampler achieves practical efficiency and strong downstream performance compared to both naive and heuristic baselines:

  • Token inference overhead: ByteSampler requires 24.24 tokens per generation, adding only ~0.72 token evaluations per byte relative to plain BPE (23.5), and 65% less overhead than next-best exact methods (Byte-Pair Correction: 72.99; with prefix caching: 25.6).
  • Character-level modeling:
    • On English (OLMo2-1B), ByteSampler matches the token-level baseline: 0.81 bits/char vs. 0.80 bits/char (token) and 6.53 bits/char (naive).
    • Next-char accuracy (greedy) is 81.6% with ByteSampler, compared to 29.5% (naive), 71.6% (Backtracking), and 76.3% (Token Alignment).
    • On Chinese (Qwen3-1.7B): 3.23 bits/char (matching token baseline 3.29), with 52.7% next-char accuracy; naive achieves only 32.8%.
  • Ensembling and proxy-tuning:
    • Byte-level ensembles consistently outperformed individual model averages across multiple tasks.
    • Proxy-tuning with ByteSampler provided substantial improvements in instruction-following and knowledge benchmarks.

7. Significance and Extensions

ByteSampler provides an exact, t1,,tk=encode(x)t_1, \ldots, t_k = \text{encode}(x)7-overhead, inference-time reduction from any BPE-tokenized LM to a byte-level generator. It completely solves the Prompt Boundary Problem for BPE LMs and enables a range of model-composition techniques previously impossible due to tokenizer misalignment—specifically, exact ensembling and proxy-tuning—without requiring model retraining or weight access. The open-source implementation demonstrates negligible additional latency and strongly favorable empirical performance on text generation and downstream tasks (Hayase et al., 17 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ByteSampler.