Unigram Tokenization Model in NLP
- Unigram tokenization model is a probabilistic subword segmentation framework that defines a vocabulary with independent token probabilities.
- It employs expectation–maximization and iterative vocabulary pruning to maximize the marginal likelihood of observed text.
- This approach enhances morphological fidelity and segmentation efficiency, outperforming merge-based methods in multilingual scenarios.
A unigram tokenization model is a probabilistic framework for subword segmentation, foundational to modern NLP tokenization pipelines such as those implemented in the SentencePiece toolkit. The model defines a vocabulary of subword units and a probability distribution over these units, segmenting input text into token sequences such that each token in a segmentation is sampled independently according to its assigned probability. Unlike merge-based schemes, the unigram model treats subword selection as a latent variable problem and seeks to maximize the marginal likelihood of the observed text across all valid segmentations, typically via expectation–maximization algorithms with iterative vocabulary pruning. This statistical formulation has shown robust empirical and theoretical benefits, especially for morphologically rich or multilingual data.
1. Mathematical Formulation and Model Objective
Let denote the token vocabulary, with categorical probability mass function satisfying . Given an input string , its possible segmentations are sequences of tokens from whose concatenation yields . The joint probability for a segmentation is
The probability of the string is a sum over all valid segmentations:
The objective is maximum marginal likelihood over a corpus :
Direct optimization is intractable due to the exponential number of segmentations, necessitating approximate training methods (Land et al., 14 Dec 2025, Karthika et al., 21 Jun 2025, Bostrom et al., 2020).
2. Vocabulary Learning, EM Training, and Pruning
Training employs the Expectation–Maximization (EM) algorithm. For each subword and corpus :
- E-step: Compute expected counts of :
- M-step: Normalize to update probabilities:
Vocabulary pruning proceeds iteratively: after a set number of EM steps, tokens with the lowest probabilities (bottom or cumulative mass below threshold ) are removed, and the distribution re-normalized. This repeats until the target vocabulary size is achieved (e.g., K–256K) (Land et al., 14 Dec 2025, Karthika et al., 21 Jun 2025). Kudo’s SentencePiece implementation uses fast pruning heuristics (e.g., removing 5–10% of tokens per pruning round, setting ).
Seeds for are generated via suffix arrays and frequency thresholds, and hyperparameters (seed size factor, pruning rate, EM sub-iterations, prune thresholds) have been shown to trade off compression versus likelihood, usually with Pareto-optimal frontiers (Land et al., 14 Dec 2025). Final vocabularies can be tailored for in-domain compression or morphological alignment.
3. Segmentation Inference and Decoding
Given learned probabilities and vocabulary, segmentation inference on new text entails searching for the segmentation that maximizes . This is performed with Viterbi dynamic programming:
- For index in , compute:
- Backpointer arrays yield the argmax sequence (Bostrom et al., 2020, Land et al., 14 Dec 2025).
Alternate approaches use the forward–backward algorithm for probabilistic segmentation or confidence estimation, but most implementations employ Viterbi for efficiency (Land et al., 14 Dec 2025).
4. Treatment of Word Boundaries and Special Tokens
Boundary convention critically influences vocabulary structure and segmentation. Two principal schemes are in use (Jacobs et al., 2022):
- Word-initial marking: Prefixes (e.g. “▁”) on tokens that follow a word boundary.
- Word-final marking: Suffixes (e.g. underscore) on tokens that precede a word boundary.
Empirical evidence suggests that when training on pre-tokenized data, word-initial marking yields superior compression, whereas on raw text word-final marking minimizes per-word token counts and perplexity. Both marking schemes result in vocabularies that recover complementary sets of morphemes; for maximum morphological fidelity, unioning two vocabularies is effective. Recommendations include aligning the boundary marking with input preprocessing and explicitly reporting the used convention in publications (Jacobs et al., 2022).
Additionally, treating space symbols as standalone tokens (i.e., forbidding internal spaces within any token except one special token, “␣”) increases morphological validity and improves prefix-splitting accuracy, with no empirically observed negative impact on downstream general NLU tasks (Gow-Smith et al., 2022). This modification is trivial to implement, requiring only that whitespace be preprocessed to explicit tokens and the vocabulary construction to disallow multi-character tokens containing space. Model performance on prefix-rich datasets (e.g., LADEC, MorphoLex) shows consistent F1 improvements ( points), and downstream RoBERTa models using this tokenization achieve highest accuracies on complex-word classification (Gow-Smith et al., 2022).
5. Empirical Properties and Intrinsic Evaluation
Unigram tokenization outperforms greedy merge-based algorithms such as BPE on various intrinsic metrics:
- Fertility: Average tokens per word—consistently lower for larger vocabularies, with Unigram LM providing slightly lower or comparable fertility to BPE.
- Character-per-token (CPT): Marginally higher CPT for Unigram LM, increasing with vocabulary size.
- Word Fragmentation Rate (WFR): Unigram LM consistently yields lower WFR by $1$–$2$ points across vocabulary sizes.
- Morphological Alignment: Boundary F1 is considerably improved, with Unigram LM aligning segment boundaries more closely with gold morphological splits (e.g., English CELEX2: F1=30.3% for Unigram LM vs. 19.3% for BPE) (Bostrom et al., 2020, Karthika et al., 21 Jun 2025).
In multilingual regimes, particularly for Indic and morphologically complex languages, cluster‐based vocabulary training in conjunction with Unigram LM ameliorates high-resource imbalance and improves parity and fairness metrics across languages, enabling strong zero-shot performance on extremely low-resource dialects (Karthika et al., 21 Jun 2025).
6. Downstream Impact and Theory
Empirical comparisons show that transformer models (e.g., RoBERTa-base) pretrained with Unigram LM tokenization are never outperformed and often surpassed those using BPE across a range of English and Japanese tasks (e.g., SQuAD, MNLI, TyDi QA), with especially large gains for morphologically rich and non-segmented scripts (Bostrom et al., 2020). In complex-word and derivation-rich classification, tokenizers treating spaces as individual tokens yield the highest task accuracy, suggesting an improved capacity to represent complex word formation (Gow-Smith et al., 2022).
From a theoretical perspective, it has been established that transformers, when trained character-level on -order Markov sources () without tokenization, become limited to learning unigram models at the character level—incapable of modeling higher-order dependencies. Employing a subword tokenizer (including Unigram LM or BPE), even the simplest transformer architecture becomes able to approximate the source entropy rate near-optimally, and the final model behaves as a unigram over the induced tokens rather than suboptimal character unigrams (Rajaraman et al., 2024).
7. Extensions, Conditional Models, and Multilingual Considerations
Conditional unigram tokenization extends the standard model by training token probability estimates for target language tokens conditioned on source language token sequences in parallel corpora. Formally, co-occurrence tables are constructed and probabilities estimated as
Training uses expected-count or two-step EM updates. Intrinsic evaluations show improved cross‐lingual alignment and lower language modeling perplexity but no consistent advantage (and sometimes notable degradation) in translation quality, attributed to the quadratic parameter scaling and data sparsity bottlenecks. Alternative parameterizations and hybrid schemes are hypothesized as future directions for practical cross-lingual tokenization (Vico et al., 10 Jul 2025).
For multilingual and low-resource settings, Unigram LM demonstrates strong zero-shot transfer when high-resource languages of related script and family are pooled during training, and cluster-based vocabularies mitigate bias toward the most populous languages (Karthika et al., 21 Jun 2025).
References:
- (Karthika et al., 21 Jun 2025) "Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights"
- (Land et al., 14 Dec 2025) "Which Pieces Does Unigram Tokenization Really Need?"
- (Bostrom et al., 2020) "Byte Pair Encoding is Suboptimal for LLM Pretraining"
- (Gow-Smith et al., 2022) "Improving Tokenisation by Alternative Treatment of Spaces"
- (Rajaraman et al., 2024) "Toward a Theory of Tokenization in LLMs"
- (Jacobs et al., 2022) "Lost in Space Marking"
- (Vico et al., 10 Jul 2025) "Conditional Unigram Tokenization with Parallel Data"