Byte-Pair Encoding: Efficient Subword Tokenization

Updated 15 December 2025

Byte-Pair Encoding (BPE) is a frequency-driven algorithm that iteratively merges the most frequent adjacent symbol pairs to create a compact, open-vocabulary token set.
It enhances model performance by reducing out-of-vocabulary (OOV) occurrences and optimizing token counts, benefiting multilingual NLP, machine translation, and speech recognition.
Advanced BPE variants incorporate fairness, linguistic alignment, and semantic prioritization, expanding applications to code, symbolic music, and low-resource languages.

Byte-Pair Encoding (BPE) is a frequency-driven, greedy algorithm for subword tokenization that has become foundational in neural language modeling and sequence transduction tasks due to its empirical efficiency, vocabulary compactness, and ability to robustly handle out-of-vocabulary (OOV) words and rare subword sequences. Originally motivated by the need for open-vocabulary modeling and efficient data compression, BPE’s core procedure and numerous refinements have been systematically analyzed in both theoretical and practical terms, yielding a suite of advanced tokenization systems that remain central to large-scale multilingual NLP, machine translation, speech recognition, and code or symbolic sequence modeling.

1. Core Algorithmic Principles and Theoretical Foundations

The standard BPE algorithm operates by iteratively merging the most frequent adjacent symbol pairs in a corpus, treating the corpus as a sequence over an initial atomic vocabulary (typically characters or bytes). At each iteration, adjacent bigram frequencies are computed and the most frequent pair (x*, y*) is replaced globally with a new symbol z = x*·y*, which is then added to the vocabulary. The merge process continues until a target vocabulary size |V| is reached, resulting in a merge table and a non-overlapping segmentation regime for both training and inference (Kunchukuttan et al., 2016, Kozma et al., 2024).

Formally, for a string $s$ and budget $k$ , BPE greedily approximates the optimal pair-encoding problem: find $k$ merges maximizing compression utility, i.e., the reduction $|s| - |\text{tokenize}_k(s)|$ , where $|\cdot|$ denotes sequence length. The underlying optimization problem is APX-complete; that is, it does not admit a polynomial-time approximation scheme, with BPE’s greedy solution always achieving at least $1/3$ of the optimal compression and at most $5/8$ in worst-case scenarios. On real-world data, empirical ratios typically range from $0.5$ to $0.9$ (Kozma et al., 2024, Zouhar et al., 2023).

Computationally, standard BPE’s runtime for $M$ merges over $N$ -length corpus is $O(NM)$ ; this can be improved to $O(N\log M)$ using priority queues and linked-list index management (Zouhar et al., 2023). Brute-force search for optimal merges is exponential in $M$ but can be pruned via memoization and safe permutation orders.

2. Extensions and Variants for Practice and Fairness

BPE’s merge-heuristic admits various extensions. Scaffold-BPE dynamically removes low-frequency “scaffold tokens” (tokens used only as components of longer tokens), using a parameter-free decision rule based on token frequency after each merge. Scaffold-BPE thereby increases the average frequency of rare tokens, smooths learning imbalances, and yields consistent improvements in zero/few-shot accuracy on language modeling and translation tasks (Lian et al., 2024).

LBPE (Long-token-first BPE) modifies the encoding procedure to maximize the selection of longest possible tokens: in each pass, the longest valid span found in the vocabulary is prioritized for merging, effectively balancing frequency and semantic richness and mitigating the predominance of short, high-frequency tokens that can bottleneck token learning (Lian et al., 2024).

Parity-aware BPE explicitly optimizes a min-max objective over languages in multilingual corpora: at each merge, the target language with the lowest current compression gain is selected and merges are computed only from its corpus, trading a small decrease in overall compression (<1%) for large improvements in tokenization parity (Gini coefficient reduction from 0.064 to 0.011). Downstream model performance is unaffected, while token counts and perplexities become more equitable across languages (Foroutan et al., 6 Aug 2025).

Hierarchical BPE introduces a two-level compression architecture: first-level BPE tokens are reinterpreted as “patches” (each augmented with explicit end-of-patch markers), and a second BPE pass is applied to the patches’ byte sequences to enforce a maximum patch length. This approach matches or exceeds baseline methods for rare word handling and multi-lingual flexibility, with language-agnostic applicability (Dolga et al., 17 Oct 2025).

SuperBPE relaxes the traditional constraint that merges respect whitespace boundaries throughout, allowing, after an initial phase of intra-word merges, merging of adjacent tokens that cross whitespace boundaries. This increases the frequency of multi-word expressions as first-class tokens and reduces both token count and inference compute (up to 33% fewer tokens and 27% less FLOPs per byte) while providing +4% average absolute gain in downstream accuracy on multitask benchmarks in high-capacity (8B) models (Liu et al., 17 Mar 2025).

3. Linguistic and Application-Centric Adaptations

BPE’s purely frequency-driven merges can result in linguistically incongruous tokens in unsegmented scripts (e.g., Chinese, Japanese, Thai) or heavily inflected languages. Entropy-driven pre-tokenization approaches insert boundaries on the basis of pointwise mutual information (PMI) and contextual entropy or via local peaks in next-character predictive entropy from pretrained LLMs. These boundaries are inserted prior to BPE merge training, constraining merges to operate only within statistically coherent or uncertain regions, aligning subword units with morphological/lexical boundaries and significantly improving segmentation metrics (e.g., F1 on PKU Chinese segmentation increased by >8 points) (Hu et al., 18 Jun 2025).

For morphologically rich languages such as Bengali, optimal vocabulary sizes are in the 500–1000 token range, ensuring that high-productivity morphemes emerge early and limiting overfitting to rare or OOV forms. Overly large vocabularies (>2000) can degrade out-of-distribution ASR performance due to spurious memorization on rare, unproductive tokens (Samin, 2024). For machine translation, BPE granularity is highly consequential. Asymmetric BPE, with more merges (larger subwords) for the encoder/source and fewer merges (smaller subwords) for the decoder/target, significantly improves low-resource MT performance (e.g., up to +5.8 CHRF++ on EN-HI at 50K sentences), mitigating target-side data sparsity and improving alignment (Yadav et al., 5 Nov 2025).

In music and code modeling, BPE is repurposed to operate over abstract event tokens (e.g., REMI events in music, programming symbols in code). For symbolic music, BPE “supertokens” in polyphonic corpora learn to capture recurring rhythmic, harmonic, or melodic motifs, which directly benefits phrase segmentation. In source code, BPE brings the advantage of a compact, subword vocabulary that eliminates UNK token need, solving the rare-identifier problem and simplifying model architecture (Le et al., 2024, Arkesteijn et al., 2020).

4. Empirical Properties and Evaluation

BPE enables sequence models to operate on open-vocabulary data by segmenting rare/unseen words into shared subword units. Empirically, BPE dramatically reduces the OOV rate, giving correct translation rates for true OOVs in the ≈55–60% range overall, with performance depending heavily on OOV type (named entities: >75%, inflectional variants: 20–30%) and language similarity (Araabi et al., 2022). The number of merges M controls a trade-off: smaller M yields finer granularity (more resilience to rare words, longer sequences), while larger M gives more compact, frequent tokens but at the risk of overfitting to rare forms.

On machine translation, setting M=20K–50K merges is standard for joint source-target BPE. However, in morphologically complex or low-resource settings, careful tuning (including asymmetry) is essential for best performance (Yadav et al., 5 Nov 2025). In code modeling, BPE with |V| ≈ 8,000 subwords attains a 10×–15× reduction in vocabulary compared to word-level, with only modest increase over pure character-level (Arkesteijn et al., 2020).

In symbolic music, supertokens can encode both structural and motif-level patterns, and BPE’s utility as a phrase segmentation preprocessor is highly contingent on the level of polyphony and merge budget (Le et al., 2024). In speech recognition (ASR), a “sweet spot” emerges for |V| in the 500–1000 range, balancing OOV robustness and generalization (Samin, 2024).

5. Limitations, Open Problems, and Future Directions

Theoretical analyses reveal that while BPE is robust in practice, its greedy approach yields utility that is at best a constant-factor approximation to optimal compression, and on adversarial inputs may vary between $0.333$ and $0.625$ of the possible optimum (Kozma et al., 2024, Zouhar et al., 2023). BPE’s frequency-only merges can be suboptimal with respect to semantics, morphology, or cross-lingual fairness, leading to research into linguistically and parity-aware algorithms (Foroutan et al., 6 Aug 2025, Hu et al., 18 Jun 2025).

Key open directions include: integrating compositional or semantic criteria into the merge objective, adapting BPE-style algorithms to speech/vision/structured domains, optimizing for downstream model objectives beyond mere compression, and principled search over the merge order or hyperparameter space. The combination of computational-efficiency improvements (heap-based, hierarchical, and dynamic BPE algorithms) and fairness- or linguistically-motivated modifications represents an active area of tokenization and subword modeling research.

BPE Variant	Key Mechanism	Targeted Effects
Standard (Greedy)	Most-frequent merges	Compression, OOV handling
Scaffold-BPE	Scaffold token removal	Reduce frequency imbalance, smooth learning
LBPE	Long-token priority	Boost long token usage, mitigate short-token dominance
Parity-Aware BPE	Max-min CR, per-lang	Cross-lingual fairness in token count
Entropy/PMI Pre-tokenize	Info-theoretic cues	Linguistic alignment in unsegmented languages
SuperBPE	Merge across whitespace	Efficient, multi-word and superword tokens
Hierarchical BPE	2-level compression	Mixed character/subword flexibility, low param count

Each refinement targets specific limitations or inefficiencies of classic BPE, offering trade-offs between compression rate, language fairness, computational cost, interpretability, and alignment with true linguistic or semantic boundaries.

7. Impact and Broader Implications

Byte-Pair Encoding has established itself as the central algorithmic paradigm for subword tokenization in both monolingual and multilingual neural sequence models. Its theoretical analysis as an APX-complete combinatorial problem and the emergence of efficiency, fairness, and language-agnostic extensions demonstrate both its robustness and its ongoing adaptability to new domains and modeling regimes (Kozma et al., 2024, Foroutan et al., 6 Aug 2025, Dolga et al., 17 Oct 2025). As large-scale pretrained models increasingly serve heterogeneous user bases and diverse linguistic communities, advanced BPE variants such as Scaffold-BPE, LBPE, Parity-aware BPE, and entropy-driven pretokenization offer principled avenues for improving both model accuracy and societal equity in tokenization-driven computational costs and error rates.