PathPiece Tokenizer: Minimizing Tokens in NLP

Updated 13 November 2025

PathPiece Tokenizer is a subword tokenization algorithm that minimizes token count under fixed vocabulary constraints using dynamic programming.
It integrates explicit pre-tokenization strategies to enforce linguistic boundaries and improve segmentation structure.
Empirical results indicate that reducing token count alone does not enhance model performance, emphasizing the importance of vocabulary initialization and segmentation methods.

PathPiece Tokenizer is a subword tokenization algorithm introduced to test the hypothesis that minimizing the number of tokens in a segmentation—subject to a fixed vocabulary—yields superior downstream performance for transformer-based LLMs. Moving beyond the compression-origin paradigm of Byte-Pair Encoding (BPE), PathPiece provides a dynamic programming approach that, for any input sequence and vocabulary, produces the segmentation with the smallest possible number of tokens, under strict vocabulary and length constraints. The work provides a systematic, empirical evaluation of PathPiece and related tokenization schemes, revealing that minimum-token segmentations do not necessarily translate to improved model performance and exposing the critical importance of pre-tokenization and vocabulary initialization (Schmidt et al., 28 Feb 2024).

1. Formal Specification

Given a fixed vocabulary $V$ containing at least all 256 single-byte tokens and a maximum token length $L$ (set to 16 bytes), PathPiece defines the tokenization of a document $d$ of length $n$ bytes as the decomposition into the minimal possible sequence of tokens $t_1, ..., t_K \in V$ such that $t_1 \cdots t_K = d$ . The core segmentation objective is:

$K(d) = |\mathrm{seg}(d)| = \min_{\substack{t_1 \cdots t_K = d \ t_k \in V}} K$

Considering an entire corpus $\mathcal{C}$ and vocabulary size $|V| = m$ , the corpus token count is defined:

$\mathrm{CTC}(V) = \sum_{d \in \mathcal{C}} K(d)$

The vocabulary construction objective is then:

$\min_V\; \mathrm{CTC}(V) \quad \text{subject to} \quad |V| = m$

No token may exceed $L$ bytes, and no out-of-vocabulary tokens occur.

2. Pre-tokenization Phase

PathPiece supports a range of explicit pre-tokenization rules, which significantly impact the structure and stability of downstream segmentation. The key pre-tokenization strategies evaluated are:

FirstSpace: Split text on whitespace and prohibit whitespace in the interior of tokens, following conventional BPE/WordPiece formulations.
Space: Assign each space character its own token, as seen in previous work (Gow-Smith et al., 2022).
Digit: Assign all ASCII digits as their own single-byte tokens, following usage in models such as LLaMA.

Pre-tokenization is implemented by scanning the input and inserting token boundaries according to the configured rules, restricting where subword boundaries may be formed.

3. Vocabulary Construction Mechanism

PathPiece employs a top-down vocabulary construction procedure, starting from an over-complete vocabulary $V_0$ and iteratively pruning tokens to reach a target size $m$ . Initialization schemes include:

Frequent n-grams: Select all byte n-grams (up to size $L$ ) sorted by frequency.
BPE: Run standard BPE bottom-up merges up to size $|V_0|$ .
Unigram: Use a large Unigram LM for initial vocabulary.

During pruning, single-byte tokens are retained while other tokens are evaluated for removal by computing $\Delta(t)$ , the increase in corpus tokens if token $t$ is deleted. This is efficiently estimated using dynamic programming (DP):

A forward DP calculates $p_\ell[i]$ : minimal tokens to cover positions $1$– $i$ .
A backward DP calculates $b_\ell[i]$ : minimal tokens to cover positions $i$ – $n$ .
For every occurrence of $t$ , alternatives are considered: splitting the token at possible internal boundaries (incrementing token count minimally) or substituting it by a superset token.
The reduction in tokens is aggregated for each $t$ , and the lowest-impact tokens are pruned until the vocabulary size constraint is met.

Each pruning iteration operates in $O(n L^2)$ time, owing to the dynamic programming optimizations.

4. Segmentation Algorithm

With the vocabulary $V$ finalized, PathPiece segments arbitrary text by constructing a directed acyclic graph (DAG) over positions $0$ to $n$ , with edges representing candidate tokens. All edges have unit cost, enabling the minimal tokenization as a shortest-path problem. The segmentation procedure is as follows:

Forward DP:
- Initialize $p_\ell[0] = 0$ .
- For $e = 1\ldots n$ :
$p_\ell[e] = \min_{1 \leq w \leq L} \{\, p_\ell[e-w]+1\;|\;d[e-w+1:e] \in V\,\}$
The minimal number of tokens $K(d)$ is recovered at $p_\ell[n]$ .
Reconstruct the tokenization via a backward pass; ties between equal-length segmentations are resolved either by favoring the longest token (“PathPiece_L”) or randomly (“PathPiece_R”), with the latter introducing subword regularization.

5. Empirical Evaluation and Model Training

A large-scale, controlled experimental setup was implemented as follows:

LLMs: Decoder-only MPT transformers.
Training Corpora: The Pile (825 GB); 6 GB MiniPile subset for tokenizer training.
Model Sizes: 54 models at 350M, 6 at 1.3B, and 4 at 2.4B parameters, each pre-trained to ~200B tokens.
Vocabularies: Tested in sizes $\{32{,}768,\ 40{,}960,\ 49{,}152\}$ across 18 tokenization variants combining:
- Vocabulary construction strategies: BPE, Unigram, WordPiece, SaGe, PathPiece_L, PathPiece_R
- Top-down initialization: n-grams, BPE, Unigram
- Pre-tokenization: None, FirstSpace, Space+Digit, etc.
- Segmentation: greedy (longest-match), DP (PathPiece), and Unigram Viterbi

Downstream performance was assessed on 10 multiple-choice tasks (knowledge, commonsense, context) using lm-evaluation-harness; the random baseline was 32% accuracy.

Tokenizer Variant	Avg. MC Acc. (350M)	Notable Features
PathPiece_L + BPE-init	49.4%	Best avg., not stat. sig.
Unigram	49.0%	Clustered at top
BPE	49.0%	Clustered at top
BPE + Greedy	49.0%	Clustered at top
WordPiece	48.8%	Clustered at top
SaGe	48.6%	Clustered at top

Vocabulary size in the test range had only a small, highly correlated effect on accuracy ( $R^2 \approx 0.8$ ), and no monotonic relationship between corpus token count (CTC) and performance was found. The minimum-CTC tokenizer (PathPiece with no pre-tokenization) performed worst among all tested schemes.

6. Analysis of Factors Impacting Tokenization Performance

Key insights from empirical results:

Token Count Minimization does not guarantee better downstream accuracy. No consistent improvement is found for PathPiece or any minimum-CTC scheme. In fact, the variant with the absolute fewest tokens (no pre-tokenization) ranked lowest.
Pre-tokenization exerts substantial influence: Explicit space-based pre-tokenization (“Space”) outperformed both FirstSpace and no pre-tokenization (statistically significant at $p<0.01$ ), supporting the value of hard linguistic boundaries.
Vocabulary Initialization is critical: Top-down construction (PathPiece, SaGe) was more successful when initialized from large BPE vocabularies (rather than n-grams or Unigram), indicating that BPE merges encode useful morpheme-like priors ( $p<0.01$ ).
Segmentation Algorithm Matters: For BPE vocabularies, greedy longest-match segmentation was optimal; for Unigram, Viterbi decoding with $-\log p(t)$ weights was preferred; PathPiece segmentation requires DP.
Larger Models: Increasing model size shifted performance slightly but maintained the pattern of a statistically indistinguishable cluster at the top.

An illustrative example: For the sentence “The valuation is estimated to be $213M.”, PathPiece tokenizations under pre-tokenization regimes yielded increasing linguistic coherence when moving from None to FirstSpace to Space.

7. Recommendations and Broader Implications

Empirical analysis supports several recommendations for tokenizer design:

Employ Explicit Pre-tokenization. Use the Space scheme or similar, anchoring subwords at reliable linguistic boundaries.
Initialize Top-down Tokenizer Vocabularies from BPE. BPE-based initialization offers consistently superior priors.
Align Segmentation Algorithms with Vocabulary Construction. BPE vocabularies segment best with greedy, Unigram with Viterbi, and PathPiece with DP.
Do Not Over-optimize Vocabulary Size in the 30K–50K Range. Gains are marginal; prioritize resources elsewhere.
Rethink Compression as the Central Tokenizer Criterion. Linguistic boundary alignment and regularization via pre-tokenization are at least as important as mere token minimization.

All 64 tokenizer variants, associated vocabularies, and pre-trained models (350M, 1.3B, 2.4B parameters) are released publicly for further paper and benchmarking.

These findings reshape understanding of the trade-offs in tokenizer construction, indicating that algorithmic minimality in token count is subordinate to structure imparted by explicit pre-tokenization and informed vocabulary initialization. The research advocates for a newly nuanced approach to subword tokenization development in NLP model pretraining (Schmidt et al., 28 Feb 2024).

PDF Markdown Chat (Pro)

References (1)

Tokenization Is More Than Compression (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to PathPiece Tokenizer.