Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unigram Tokenization Model in NLP

Updated 3 February 2026
  • Unigram tokenization model is a probabilistic subword segmentation framework that defines a vocabulary with independent token probabilities.
  • It employs expectation–maximization and iterative vocabulary pruning to maximize the marginal likelihood of observed text.
  • This approach enhances morphological fidelity and segmentation efficiency, outperforming merge-based methods in multilingual scenarios.

A unigram tokenization model is a probabilistic framework for subword segmentation, foundational to modern NLP tokenization pipelines such as those implemented in the SentencePiece toolkit. The model defines a vocabulary of subword units and a probability distribution over these units, segmenting input text into token sequences such that each token in a segmentation is sampled independently according to its assigned probability. Unlike merge-based schemes, the unigram model treats subword selection as a latent variable problem and seeks to maximize the marginal likelihood of the observed text across all valid segmentations, typically via expectation–maximization algorithms with iterative vocabulary pruning. This statistical formulation has shown robust empirical and theoretical benefits, especially for morphologically rich or multilingual data.

1. Mathematical Formulation and Model Objective

Let V={v1,...,vV}V = \{v_1, ..., v_{|V|}\} denote the token vocabulary, with categorical probability mass function PP satisfying vVP(v)=1\sum_{v\in V} P(v) = 1. Given an input string xx, its possible segmentations s=(vi1,...,vin)s = (v_{i_1}, ..., v_{i_n}) are sequences of tokens from VV whose concatenation yields xx. The joint probability for a segmentation ss is

P(s)=j=1nP(vij).P(s) = \prod_{j=1}^n P(v_{i_j}).

The probability of the string is a sum over all valid segmentations:

P(x)=sSeg(x)P(s).P(x) = \sum_{s \in \mathrm{Seg}(x)} P(s).

The objective is maximum marginal likelihood over a corpus C={x(1),...,x(M)}C = \{x^{(1)}, ..., x^{(M)}\}:

L(P;C)=xClogP(x).L(P; C) = \sum_{x \in C} \log P(x).

Direct optimization is intractable due to the exponential number of segmentations, necessitating approximate training methods (Land et al., 14 Dec 2025, Karthika et al., 21 Jun 2025, Bostrom et al., 2020).

2. Vocabulary Learning, EM Training, and Pruning

Training employs the Expectation–Maximization (EM) algorithm. For each subword vVv \in V and corpus CC:

  • E-step: Compute expected counts of vv:

c(t)(v)=xCsSeg(x)P(t)(sx)countv(s).c^{(t)}(v) = \sum_{x \in C} \sum_{s \in \mathrm{Seg}(x)} P^{(t)}(s|x) \cdot \mathrm{count}_v(s).

  • M-step: Normalize to update probabilities:

P(t+1)(v)=c(t)(v)uVc(t)(u).P^{(t+1)}(v) = \frac{c^{(t)}(v)}{\sum_{u \in V} c^{(t)}(u)}.

Vocabulary pruning proceeds iteratively: after a set number of EM steps, tokens with the lowest probabilities (bottom KK or cumulative mass below threshold τ\tau) are removed, and the distribution re-normalized. This repeats until the target vocabulary size is achieved (e.g., V=32|V| = 32K–256K) (Land et al., 14 Dec 2025, Karthika et al., 21 Jun 2025). Kudo’s SentencePiece implementation uses fast pruning heuristics (e.g., removing 5–10% of tokens per pruning round, setting τ108\tau \sim 10^{-8}).

Seeds for VV are generated via suffix arrays and frequency thresholds, and hyperparameters (seed size factor, pruning rate, EM sub-iterations, prune thresholds) have been shown to trade off compression versus likelihood, usually with Pareto-optimal frontiers (Land et al., 14 Dec 2025). Final vocabularies can be tailored for in-domain compression or morphological alignment.

3. Segmentation Inference and Decoding

Given learned probabilities and vocabulary, segmentation inference on new text xx entails searching for the segmentation ss^* that maximizes P(s)P(s^*). This is performed with Viterbi dynamic programming:

  • For index ii in xx, compute:

best[i]=maxtV,  length(t)i,  x[it+1:i]=t(best[ilength(t)]P(t))\textrm{best}[i] = \max_{t \in V, \; \mathrm{length}(t) \leq i, \; x[i-|t|+1:i]=t} \bigl(\textrm{best}[i-\mathrm{length}(t)] \cdot P(t)\bigr)

Alternate approaches use the forward–backward algorithm for probabilistic segmentation or confidence estimation, but most implementations employ Viterbi for efficiency (Land et al., 14 Dec 2025).

4. Treatment of Word Boundaries and Special Tokens

Boundary convention critically influences vocabulary structure and segmentation. Two principal schemes are in use (Jacobs et al., 2022):

  • Word-initial marking: Prefixes (e.g. “▁”) on tokens that follow a word boundary.
  • Word-final marking: Suffixes (e.g. underscore) on tokens that precede a word boundary.

Empirical evidence suggests that when training on pre-tokenized data, word-initial marking yields superior compression, whereas on raw text word-final marking minimizes per-word token counts and perplexity. Both marking schemes result in vocabularies that recover complementary sets of morphemes; for maximum morphological fidelity, unioning two vocabularies is effective. Recommendations include aligning the boundary marking with input preprocessing and explicitly reporting the used convention in publications (Jacobs et al., 2022).

Additionally, treating space symbols as standalone tokens (i.e., forbidding internal spaces within any token except one special token, “␣”) increases morphological validity and improves prefix-splitting accuracy, with no empirically observed negative impact on downstream general NLU tasks (Gow-Smith et al., 2022). This modification is trivial to implement, requiring only that whitespace be preprocessed to explicit tokens and the vocabulary construction to disallow multi-character tokens containing space. Model performance on prefix-rich datasets (e.g., LADEC, MorphoLex) shows consistent F1 improvements (+2.7\sim\mathbf{+2.7} points), and downstream RoBERTa models using this tokenization achieve highest accuracies on complex-word classification (Gow-Smith et al., 2022).

5. Empirical Properties and Intrinsic Evaluation

Unigram tokenization outperforms greedy merge-based algorithms such as BPE on various intrinsic metrics:

  • Fertility: Average tokens per word—consistently lower for larger vocabularies, with Unigram LM providing slightly lower or comparable fertility to BPE.
  • Character-per-token (CPT): Marginally higher CPT for Unigram LM, increasing with vocabulary size.
  • Word Fragmentation Rate (WFR): Unigram LM consistently yields lower WFR by $1$–$2$ points across vocabulary sizes.
  • Morphological Alignment: Boundary F1 is considerably improved, with Unigram LM aligning segment boundaries more closely with gold morphological splits (e.g., English CELEX2: F1=30.3% for Unigram LM vs. 19.3% for BPE) (Bostrom et al., 2020, Karthika et al., 21 Jun 2025).

In multilingual regimes, particularly for Indic and morphologically complex languages, cluster‐based vocabulary training in conjunction with Unigram LM ameliorates high-resource imbalance and improves parity and fairness metrics across languages, enabling strong zero-shot performance on extremely low-resource dialects (Karthika et al., 21 Jun 2025).

6. Downstream Impact and Theory

Empirical comparisons show that transformer models (e.g., RoBERTa-base) pretrained with Unigram LM tokenization are never outperformed and often surpassed those using BPE across a range of English and Japanese tasks (e.g., SQuAD, MNLI, TyDi QA), with especially large gains for morphologically rich and non-segmented scripts (Bostrom et al., 2020). In complex-word and derivation-rich classification, tokenizers treating spaces as individual tokens yield the highest task accuracy, suggesting an improved capacity to represent complex word formation (Gow-Smith et al., 2022).

From a theoretical perspective, it has been established that transformers, when trained character-level on kk-order Markov sources (k>1k>1) without tokenization, become limited to learning unigram models at the character level—incapable of modeling higher-order dependencies. Employing a subword tokenizer (including Unigram LM or BPE), even the simplest transformer architecture becomes able to approximate the source entropy rate near-optimally, and the final model behaves as a unigram over the induced tokens rather than suboptimal character unigrams (Rajaraman et al., 2024).

7. Extensions, Conditional Models, and Multilingual Considerations

Conditional unigram tokenization extends the standard model by training token probability estimates p(tS)p(t\mid S) for target language tokens tt conditioned on source language token sequences SS in parallel corpora. Formally, co-occurrence tables c(t,s)c(t, s) are constructed and probabilities estimated as

p(tS)siSc(t,si)tjtgtskSc(tj,sk)p(t\mid S) \approx \frac{\sum_{s_i \in S} c(t, s_i)}{\sum_{t_j \in \text{tgt}} \sum_{s_k \in S} c(t_j, s_k)}

Training uses expected-count or two-step EM updates. Intrinsic evaluations show improved cross‐lingual alignment and lower language modeling perplexity but no consistent advantage (and sometimes notable degradation) in translation quality, attributed to the quadratic parameter scaling and data sparsity bottlenecks. Alternative parameterizations and hybrid schemes are hypothesized as future directions for practical cross-lingual tokenization (Vico et al., 10 Jul 2025).

For multilingual and low-resource settings, Unigram LM demonstrates strong zero-shot transfer when high-resource languages of related script and family are pooled during training, and cluster-based vocabularies mitigate bias toward the most populous languages (Karthika et al., 21 Jun 2025).


References:

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Unigram Tokenization Model.