Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sampling from Your Language Model One Byte at a Time (2506.14123v1)

Published 17 Jun 2025 in cs.CL, cs.FL, and cs.LG

Abstract: Tokenization is used almost universally by modern LLMs, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations. For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. This Prompt Boundary Problem (PBP) also arises in languages such as Chinese and in code generation, where tokens often do not line up with syntactic boundaries. Additionally mismatching tokenizers often hinder model composition and interoperability. For example, it is not possible to directly ensemble models with different tokenizers due to their mismatching vocabularies. To address these issues, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM, without changing its generative distribution at the text level. Our method efficient solves the PBP and is also able to unify the vocabularies of LLMs with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time as well as transfer the post-training from one model to another using proxy-tuning. We demonstrate in experiments that the ensemble and proxy-tuned models outperform their constituents on downstream evals.

Summary

  • The paper presents ByteSampler, an inference method that converts autoregressive language models with BPE tokenizers into exact byte-level samplers to overcome the prompt boundary problem.
  • It introduces the Valid Covering Tree to efficiently enumerate all valid tokenizations by pruning invalid token pairs in constant time.
  • Experiments demonstrate minimal overhead, improved cross-entropy loss, and enhanced model ensembling and proxy-tuning capabilities.

Modern LLMs widely rely on tokenization, typically using Byte-Pair Encoding (BPE) or similar methods, to represent text efficiently. While effective for compression and computation, tokenization can introduce artifacts, notably the Prompt Boundary Problem (PBP). This occurs when a user's prompt ends on a sequence of characters that could form the beginning of a token, but the standard tokenization process forces the model to condition only on the complete tokens derived from the prompt. This mismatch between the user's character-level intuition and the model's token-level conditioning can lead to unexpected or suboptimal continuations, especially in languages without clear word boundaries (like Chinese), when completing code identifiers, or during constrained generation. Furthermore, differing tokenizers hinder interoperability between models, making tasks like ensembling or transferring training difficult.

The paper "Sampling from Your LLM One Byte at a Time" (2506.14123) presents ByteSampler, an inference-time method designed to solve the PBP exactly and enable seamless interaction with BPE-based LLMs at the byte level. The core idea is to convert any autoregressive LM with a BPE tokenizer into a byte-level LM without altering its generative distribution at the text level. This is achieved by carefully considering all possible valid tokenizations of a given byte prefix.

The method relies on constructing and maintaining a data structure called the Valid Covering Tree. Given a byte prefix, this tree represents all possible valid token sequences that match this prefix or start matching it and extend no more than one full token beyond the prefix's end. The tree is constructed by starting with potential tokens that match the prefix and then pruning branches that either fail to match the prefix or contain invalid contiguous pairs of tokens.

A key theoretical insight supporting the efficiency of this approach for BPE tokenizers is that a token sequence is valid if and only if all its adjacent token pairs are valid (Proposition 1). This property, which holds for BPE but not necessarily for other tokenization schemes like Unigram or Wordpiece, allows for efficient pruning of the search space. Checking the validity of a token pair can be done in constant time by analyzing their merge trajectories within the tokenizer's rules.

The ByteSampler algorithm processes the input byte stream incrementally. It maintains the Valid Covering Tree, adding potential next tokens as children based on valid pairs and pruning branches that no longer match the growing byte prefix. BPE tokenizers possess a property where any token is fully determined by a constant amount of lookahead (in bytes), which depends only on the tokenizer. This ensures that the Valid Covering Tree has bounded depth and branching factors, allowing the tree to be updated in constant time per new byte (Algorithm 1). As tokens become fully determined by the consumed bytes, they are removed from the tree and output.

With the Valid Covering Tree representing all valid token sequences for a given byte prefix, ByteSampler can perform various LM tasks at the byte level:

  • Prefix Probability: The probability of a byte prefix is computed by summing the probabilities (obtained from the underlying token LM) of all token sequences corresponding to the leaves of the Valid Covering Tree.
  • Sampling Continuation: To sample a text continuation, one samples a leaf sequence from the Valid Covering Tree based on its probability. Once a sequence is selected, standard token-level sampling from the underlying LM can continue from the end of that sequence. This solves the PBP by ensuring the conditioning is on the full byte prefix, not just its tokenized form.
  • Next Byte Distribution: The probability distribution over the next byte is computed by grouping leaf sequences by the next byte they would produce and summing their probabilities. This enables explicit byte-by-byte generation.

The practical utility of ByteSampler is demonstrated through several experiments:

  • Efficiency: ByteSampler shows significantly lower computational overhead compared to other exact methods for solving the PBP (e.g., Byte-Pair Correction), requiring only a marginal number of additional token evaluations per byte compared to plain BPE tokenization (Table 1). This minimal overhead is crucial for practical byte-level sampling.
  • Character-Level LLMing: Converting models to predict character-by-character using ByteSampler results in substantially lower cross-entropy loss and higher next-character prediction accuracy compared to naive methods that suffer from the PBP (Tables 2, 4). ByteSampler achieves performance metrics (bits per character) comparable to the original token-level models (Tables 2, 3), confirming it preserves the underlying text distribution. It also outperforms heuristic backtracking methods in next-character prediction accuracy (Tables 3, 4).
  • Byte-Level Ensemble: ByteSampler allows ensembling models with different tokenizers by operating on the unified byte-level vocabulary. Experiments show that a simple average ensemble of three models with distinct tokenizers consistently outperforms the average performance of individual models on various downstream tasks (Table 5).
  • Byte-Level Proxy-Tuning: ByteSampler enables the application of post-training recipes (represented by an expert/anti-expert pair of models) to a base model even if their tokenizers differ. This "proxy-tuning" transfers the desired capabilities without requiring access to the base model's weights or additional training. Experiments demonstrate that a proxy-tuned Llama-3.1-8B using Olmo2-1B-Instruct as the expert shows improved performance on instruction-following and reasoning tasks, surpassing the smaller expert model (Table 6).

Implementation considerations discussed include using optimizations like caching token masks, leveraging 4D attention masks and KV-cache garbage collection for efficient inference on the Valid Covering Tree, and implementing batching. The paper also touches upon extending the method to handle character-level BPE (requiring logic for byte fallback) and the complexities of online pretokenization, outlining a practical approach for common regular expressions used in pretokenizers (Appendix C). A method for converting merge lists not generated by the standard BPE algorithm into a "normal form" suitable for the algorithm is also presented.

In summary, ByteSampler provides an exact, efficient, and practical solution to the Prompt Boundary Problem for BPE-based LMs. By converting token-level models into byte-level models at inference time, it not only corrects sampling artifacts but also unlocks powerful model composition techniques like ensembling and proxy-tuning between models that would otherwise be incompatible due to differing tokenizers. The method's low overhead makes byte-level interaction with large LMs feasible, opening avenues for various downstream applications and research directions.

X Twitter Logo Streamline Icon: https://streamlinehq.com