Continued BPE: Online Vocabulary Refinement
- The paper introduces an online vocabulary refinement mechanism that removes under-trained tokens using the Intersection over Self (IoS) metric.
- Continued BPE optimizes token compression and downstream model performance, as evidenced by improved BLEU and COMET scores compared to vanilla BPE.
- It maintains deterministic, reversible tokenization via event replay, ensuring compatibility with large-scale multilingual and LLM preprocessing pipelines.
Continued BPE training refers to algorithms that augment the classical Byte Pair Encoding (BPE) tokenizer training routine with online vocabulary refinement. This approach addresses structural inefficiencies in vocabulary composition arising from traditional BPE, notably the retention of "intermediate" or under-trained tokens that contribute little to compression or modeling quality. A recent example is the Picky BPE algorithm, which interleaves merge operations with data-driven, frequency-based token removal steps, enabling more efficient, higher-quality vocabularies while maintaining or improving downstream LLM performance (Chizhov et al., 6 Sep 2024).
1. Formal Comparison: Classical vs. Continued BPE
Classical BPE initializes the vocabulary from the set of unique characters or bytes and proceeds with a fixed number of merge steps: at each iteration, the most frequent adjacent token pair in the training corpus is merged to create a new token . The process iterates until the target vocabulary size is reached, never removing tokens once added.
Continued BPE (e.g., Picky BPE) introduces a vocabulary refinement mechanism. After each merge, it evaluates whether or have become effectively redundant using the Intersection over Self (IoS) metric: where is the frequency of the pair and the frequency of token prior to merging.
If the IoS exceeds a hyperparameter threshold , the corresponding token is removed from the vocabulary. The vanilla BPE procedure is recovered when .
2. Metrics, Heuristics, and Hyperparameterization
Three primary criteria underpin continued BPE strategies:
- Vocabulary Efficiency (IoS): The IoS metric is used to determine redundancy at merge time. The only introduced hyperparameter is , controlling removal aggressiveness.
- Under-trained Token Heuristic: Under-trained tokens are detected post hoc by inspecting embedding %%%%10%%%%-norms after NMT model training; tokens with low frequency and low norm are considered under-trained. Picky BPE reduces their presence by replacing rare tokens with more frequent, compositional alternatives [Land & Bartolo 2024].
- Compression Constraint (Corpus Token Count - CTC): Compression performance is measured using the corpus token count (CTC), the total number of tokens needed to encode the text. CTC is reported relative to vanilla BPE, thereby fixing the baseline at 1.0 [Schmidt et al. 2024].
3. Algorithmic Structure and Pseudocode
Training Loop
The training procedure interleaves pair merging and refinement:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Input: C_string, V_target, threshold T in (0,1] Output: V_final, event list E 1. Initialize V = set of unique chars in C_string 2. Tokenize C to tokens in V 3. While |V| < V_target: 4. Count f_p(·,·), f_t(·) 5. Pick (x1,x2) maximizing f_p(x1,x2) 6. x3 = x1 + x2 7. Add x3 to V; E.append(Merge(x1,x2)) 8. Replace (x1,x2) -> x3 in C 9. If IoS(x1|x1,x2) >= T: remove x1 from V; E.append(Remove(x1)) 10. If x2 != x1 and IoS(x2|x1,x2) >= T: remove x2 from V; E.append(Remove(x2)) 11. Return V_final = V, E |
Inference
Tokenization replays the merge and removal events in strict chronological order:
1 2 3 4 5 6 7 8 9 10 11 |
Input: word w, V_final, event list E Output: sequence W 1. W = split w into chars in V_final 2. M, R = sets of merge, removal events 3. While M ∪ R not empty: 4. ε = earliest event in E ∩ (M ∪ R) 5. If Merge(x1,x2): replace [x1,x2]→[x1+x2] 6. If Remove(x): split x → constituents 7. Update M, R for new W 8. Return W |
4. Token Frequency Dynamics and Online Refinement
Early-stage merges in BPE tend to produce high-frequency tokens; continued BPE leverages this by aggressively removing tokens that become redundant due to merges. If almost every instance of participates in a new merge to , then is superfluous, justifying its immediate removal. This dynamic prevents the accumulation of low-utility intermediate tokens and allows them to be reintroduced later if corpus statistics shift ("second chance" property).
The value of tunes the removal strictness: lower values increase the likelihood of pruning, at the cost of potentially discarding borderline-useful units. Event chronology guarantees reversibility under sequence replay, enabling stable, online vocabulary updates.
5. Empirical Effects on Vocabulary, Compression, and Model Performance
Continued BPE training demonstrates several empirical advantages over vanilla BPE:
- Machine Translation Performance: On standard benchmarks and for multiple language pairs and vocabulary sizes, Picky BPE matches or slightly improves average BLEU and COMET scores. For EN–DE (vocab=8192), vanilla BPE achieved BLEU and COMET $0.431$, while Picky BPE () achieved BLEU and COMET $0.434$.
- Under-trained Tokens: Visualization of embedding -norms versus token frequencies highlights a reduction in low-frequency, low-norm (truly under-trained) tokens after refinement. Pruned tokens are predominantly from the low-frequency/low-norm cluster; new tokens are closer to distributional centroids.
- Compression (CTC): Picky BPE consistently improves or at least matches vanilla BPE in tokenization compression, with CTC ratios ranging from $0.992$ to $1.000$ relative to baseline. Alternative techniques such as post hoc trimming reportedly degrade compression by up to in sequence length.
- Token Quality: Analysis of word-initial token proportions and mean token length shows that Picky BPE both adds and retains a higher share of linguistically salient (word-initial, underscore-prefixed) tokens compared to those dropped. For example, with , of added tokens and of dropped tokens were word-initial; mean token length increased from $5.38$ (T=1.0) to $5.50$ (T=0.6), suggesting junk tokens are replaced by longer, more meaningful ones.
- Comparison to Unigram Models: On EN–DE with vocab $32768$, Unigram Segmenters (SentencePiece) worsened CTC ($1.143$, $1.124$), while Picky BPE maintained compression ($0.994$, $0.997$), despite both generating large proportions of word-initial tokens.
6. Implementation and Deployment Considerations
Continued BPE strategies such as Picky BPE require only a single new hyperparameter () and strictly preserve the final vocabulary size. Tokenization is deterministic and invertible via event replay—suitable for large-scale, multilingual, or LLM application settings. The approach addresses vocabulary bloat and associated failure modes, such as the persistence of under-trained tokens with attendant risks (e.g., hallucinations, model leaks).
In summary, continued/online BPE training, as realized in Picky BPE, provides a principled modification that yields more efficient vocabularies with maintained or improved compression and downstream model quality, while remaining compatible with existing BPE-based preprocessing pipelines (Chizhov et al., 6 Sep 2024).