Tokenization with Split Trees

Published 21 May 2026 in cs.CL | (2605.22705v1)

Abstract: We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure. The Linear Programming (LP) relaxation is near-integral in practice, yielding provably near-optimal vocabularies, with training time empirically scaling quadratically in the number of split trees. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, reducing the number of inference tokens for models using this tokenizer, thus extending the effective context length. ToaST also uses common single-byte tokens less frequently than these baselines, leading to a substantial improvement in Renyi efficiency. In experiments training 1.5B parameter LLMs, ToaST achieves the highest CORE score, outperforming baselines by 2.6%--7.6%, with significance for two of three, and scoring best on 13 of 22 individual tasks.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces ToaST, a novel tokenization method that formulates subword vocabulary selection as an integer programming problem to optimize compression.
It develops a recursive split tree structure that deterministically segments pretokens based on frequency counts, reducing the reliance on greedy methods.
Empirical results demonstrate over 11% token count reduction and significant improvements in LLM performance, outperforming BPE, UnigramLM, and WordPiece.

Tokenization with Split Trees: Principled Compression-Driven Subword Tokenization

Introduction

"Tokenization with Split Trees" (2605.22705) introduces ToaST, a novel subword tokenization method that frames vocabulary selection as an explicit combinatorial optimization problem under a recursive split tree inference procedure. ToaST is designed to optimize compression (minimize token count) directly and globally, rather than relying on the greedy, merge-based approach used in Byte Pair Encoding (BPE) and the ablative top-down process in UnigramLM. Given the centrality of data compression and the tokenization bottleneck in LLM production and deployment, ToaST aims to reduce the number of inference tokens, extend effective context, and improve token distribution characteristics compared to standard baselines.

Split Tree Construction

ToaST departs from merge-driven methods by constructing, for each unique pretoken (as defined by regular expression segmentation), a full binary split tree using frequency counts of byte $n$ -grams computed over the tokenizer training data. At each node, the split is chosen to maximize $\min(c_p, c_{p'})$ , where $c_p$ and $c_{p'}$ are the corpus frequencies of the resulting substrings after the split, promoting balanced splits and enhancing the utility of candidate tokens.

Figure 1: Example split tree for $\textvisiblespace Kentucky$ demonstrates recursive binary splitting based on $n$ -gram counts, with single bytes as leaves.

Each node in the split tree corresponds to a candidate token. This construction is vocabulary-independent: the same deterministic trees are used regardless of the final vocabulary, allowing decoupling between statistical partitioning and subsequent token selection.

Split Tree Inference Algorithm

Tokenization of text is performed by recursively traversing each pretoken's split tree. For a given vocabulary $V$ , the inference process emits a token at the first node encountered on a root-to-leaf path that is present in $V$ ; if the node is not in $V$ , its children are considered recursively. All single-byte tokens representing valid UTF-8 code units are always included in $V$ to guarantee tokenizability.

Figure 2: During tokenization, in-vocabulary nodes (blue) are emitted, while out-of-vocabulary nodes (white) are split recursively; unreachable nodes in descent are light gray.

Figure 3: A token appears in the output if and only if the node is in the vocabulary and none of its ancestors are; an example path from leaf to root highlights this property.

This design achieves several properties:

Changing vocabulary never affects tree structure.
All valid vocabularies yield correct tokenizations (contrast with BPE merge constraints).
Removing a token causes local refinements without cascading effects—each tree can be efficiently traversed to update tokenizations incrementally.

Vocabulary Selection as Integer Programming

ToaST formulates the selection of vocabulary tokens as a 0-1 Integer Program (IP): select a subset of candidate tokens of the desired vocabulary size $\min(c_p, c_{p'})$ 0 to globally minimize the total number of training tokens produced under recursive split inference. The constraints ensure complete coverage of every pretoken with mutually exclusive token assignments along each root-to-leaf path. The model variables represent inclusion of tokens in $\min(c_p, c_{p'})$ 1 and assignment of tree nodes to the tokenization:

$\min(c_p, c_{p'})$ 2: token $\min(c_p, c_{p'})$ 3 is included in $\min(c_p, c_{p'})$ 4,
$\min(c_p, c_{p'})$ 5: node $\min(c_p, c_{p'})$ 6 in split tree $\min(c_p, c_{p'})$ 7 is used in the tokenization.

Due to the tree structure, the LP relaxation of the IP is almost always integral (maximum observed relaxation gap less than $\min(c_p, c_{p'})$ 8), and rounding heuristics suffice for scalability to hundreds of thousands of split trees and vocabulary sizes exceeding 250k.

Figure 4: Total training time grows quadratically with the number of split trees; scalability enables coverage of $\min(c_p, c_{p'})$ 9 of training data.

Intrinsic Compression and Token Distribution Metrics

On English data, ToaST achieves superior compression, outperforming BPE, WordPiece, and UnigramLM by over 11% reduction in token count at vocabulary sizes of 40,960 and above. This is a substantial result given that lower token counts:

Lower inference costs for models run under fixed context limitations,
Improve effective contextual coverage per prompt.
Figure 5: Validation bytes per token for ToaST and baselines—a prominent gap is observed from 8k to 262k vocabulary sizes.

ToaST yields a dramatically altered distribution of token types: for moderately large vocabularies (e.g., 65,536), use of single-byte fallback tokens (Leaf nodes) is reduced by a factor of 14–19x compared to BPE and similar baselines.

Figure 6: Stack plot of token categories with ToaST; Root tokens (full pretoken coverage) dominate, whereas Leaf usage is suppressed.

Figure 7: BPE substantially overuses single-byte fallback tokens compared to ToaST.

ToaST also achieves higher R\'enyi efficiency ( $c_p$ 0), reflecting a more uniform token usage distribution and mitigating the issue of excessively frequent tokens that can degrade model capacity and amplify context budget disparities.

Figure 8: ToaST attains higher R\'enyi efficiency across all vocabulary sizes, indicating superior token utilization uniformity.

Downstream LLM Performance

In Nanochat LLM training at 1.5B parameters (depth 24), ToaST leads to the best model performance on the CORE metric—a composite average over 22 evaluation benchmarks. The CORE score outperforms BPE by 7.6% and UnigramLM by 5.3% (statistically significant); a 2.6% margin over WordPiece is observed. For individual tasks, ToaST achieves the top score in 13 out of 22 benchmarks.

Pretraining is token-matched, so ToaST's improved compression exposes the network to 11% more text for the same number of tokens—a direct capacity scaling advantage. Notably, the advantages persist in supervised fine-tuning and several reasoning-oriented downstream tasks.

Flexibility and Extensibility

Because split tree construction and vocabulary selection are decoupled, ToaST is highly extensible:

Split tree partitioning can be guided by morpheme boundaries, character boundaries (multi-byte safety), or superword (phrase) statistics as auxiliary criteria.
Linear objective coefficients in the optimization may be reweighted by language or domain, directly addressing known tokenization premiums in multilingual settings and supporting fairer LLMs.
Figure 9: Example hierarchical split preference—superword, morpheme, character, and byte levels can be incorporated into recursive tree construction.

Computational Scalability

Warm-started LP solving and the near-integral relaxation of the ToaST IP make it tractable to solve for a wide range of vocabulary sizes and with very large coverage of highly frequent pretokens, capturing $c_p$ 199\% of training occurrences with reasonable compute resources.

Conclusion

ToaST implements split tree-based tokenization with global, compression-optimizing vocabulary selection via integer programming. Empirically, ToaST achieves over 11% token count reduction, dramatically fewer fallback tokens, and higher R\'enyi efficiency compared to BPE, WordPiece, and UnigramLM, with strong downstream LLM effects (7.6% higher CORE vs. BPE). Its generality and extensibility make it relevant for future research into multilingual fairness and linguistically-aware tokenization. ToaST provides a compelling framework for principled tokenizer design, with broad practical and theoretical implications for efficient and effective LLM training and deployment.

Markdown Report Issue