Papers
Topics
Authors
Recent
2000 character limit reached

Unigram Tokenization Algorithm

Updated 16 December 2025
  • Unigram Tokenization Algorithm is a probabilistic subword segmentation method that models token likelihoods, ensuring lossless text reconstruction.
  • It employs an EM-based training process with iterative vocabulary pruning to optimize segmentation performance on large corpora.
  • Integrated in SentencePiece, the method enhances neural machine translation and speeds up on-the-fly, language-agnostic text processing.

The Unigram Tokenization Algorithm, often referred to as "UnigramLM," is a probabilistic subword segmentation methodology utilized in neural text processing. Distinguished from deterministic or greedy tokenization schemes, UnigramLM underlies the widely adopted SentencePiece toolkit and has become a standard for language-agnostic, lossless, and robust preprocessing in neural systems, particularly in machine translation and large-scale language modeling (Kudo et al., 2018, Land et al., 14 Dec 2025).

1. Mathematical Framework and Objective

Let V={x1,...,xV}V = \{x_1, ..., x_{|V|}\} denote the vocabulary of subword tokens. For an input string XX, a segmentation is a sequence x=(xi1,...,xik)x = (x_{i_1}, ..., x_{i_k}) of tokens from VV such that their concatenation exactly reconstructs XX. The UnigramLM makes a strong conditional independence assumption over subwords, leading to the segmentation probability:

P(x)=j=1kp(xij)P(x) = \prod_{j=1}^k p(x_{i_j})

where each p(xi)>0p(x_i) > 0 and the probabilities sum to one: i=1Vp(xi)=1\sum_{i=1}^{|V|} p(x_i) = 1.

Unlike approaches that optimize token sequence likelihoods directly, UnigramLM treats the segmentation as latent—optimizing for the marginal likelihood over all valid segmentations xS(X)x \in S(X):

L(V,p)=1XCXXClog[xS(X)P(x)]L(V, p) = - \frac{1}{\sum_{X \in C} |X|} \sum_{X \in C} \log \left[ \sum_{x \in S(X)} P(x) \right]

where CC is the corpus. This negative average per-byte log-likelihood is minimized by adjusting both VV and pp (Land et al., 14 Dec 2025).

2. EM-Based Training and Vocabulary Pruning

UnigramLM training proceeds by initial "overgeneration" of the vocabulary through extracting frequent substrings (seed vocabulary V0V^0), followed by iterative pruning while fitting token probabilities using the Expectation-Maximization (EM) algorithm. SentencePiece and subsequent references formalize the process as follows:

  • Seed Vocabulary Construction: V0=ψn|V^0| = \psi \cdot n (typical seed_ratio ψ=10\psi = 10), where nn is the target final vocabulary size. Suffix array and longest common prefix (LCP) interval algorithms efficiently generate substrings up to a maximal length (Land et al., 14 Dec 2025).
  • EM Fitting (per pruning iteration):
    • E-Step: For each sentence, the expected counts of each token are computed by forward–backward dynamic programming over all segmentation paths.
    • Early Pruning: Tokens with expected count below threshold τe\tau_e are removed (τe=0.5\tau_e = 0.5 effective but robust to changes).
    • M-Step: Probabilities are updated as p(xi)ψ(ci)/jVψ(cj)p(x_i) \gets \psi(c_i) / \sum_{j \in V}\psi(c_j), with ψ(c)\psi(c) often the digamma function but can default to identity without material loss (<<0.01\% effect on loss).
  • Pruning: Tokens are ranked by their estimated loss increase if removed (computed by retokenizing the corpus without the candidate token). Top αV\alpha \cdot |V| (α=0.75\alpha = 0.75) are retained, and the process repeats until the vocabulary size reaches the pre-final overshoot factor βn\beta \cdot n (β=1.1\beta = 1.1). Final pruning is by probability.

The method scales linearly with corpus size and achieves practical runtimes on large datasets (Land et al., 14 Dec 2025, Kudo et al., 2018).

3. Algorithm Variants: Final-Style Pruning

Land & Pinter (2025) introduced the "Final-Style Pruning" (FSP) heuristic variant, which replaces loss-based iterative pruning with repeated pruning by lowest token probability:

  1. Construct seed vocabulary and perform EM fitting as in standard UnigramLM.
  2. After each EM block, remove the lowest-p(xi)p(x_i) tokens to reduce V|V| by factor α\alpha.
  3. Repeat until Vβn|V| \leq \beta \cdot n, then do one final prune by p(x)p(x).

Empirical evaluation shows that FSP modestly increases the negative log-likelihood (++0.6–1.2% relative) but gains -0.5–1.3% in compression (i.e., produces fewer tokens for the same data), with morphological alignment metrics between standard Unigram and BPE (Land et al., 14 Dec 2025).

4. Key Hyperparameters and Their Effects

Extensive evaluation has determined best practices for robust training, summarized as follows:

Parameter Default Observed Effect
seed_ratio ψ\psi 10 ψ<4\psi < 4 degrades both loss and compression; ψ10\psi \geq 10 is safe
em_steps τ\tau 2 Invariant to variation (1–5)
early_thr τe\tau_e 0.5 No consistent loss/compression effect across 0–10
prune_ratio α\alpha 0.75 Higher α\alpha slows pruning, lowers loss but increases token count
pre_final β\beta 1.1 Lower β\beta stops pruning earlier, minimal impact
digamma on/off on Negligible change (<<0.01\%)

Recommended defaults are robust for most settings (Land et al., 14 Dec 2025). Seed vocabularies must be at least 8n8 \cdot n in size for well-posed optimization; inadequate seeding materially harms final compression (Land et al., 14 Dec 2025).

5. Integration in SentencePiece and Practical Usage

SentencePiece provides reference implementations and command-line tools as well as C++/Python APIs. End-to-end pipelines—including Unicode NFKC normalization (optionally extensible via FSTs), id-mapping, encoding/decoding, and lossless detokenization—are supported for both BPE and UnigramLM paradigms (Kudo et al., 2018).

Typical Usage

  • Training:

1
2
3
4
5
6
7
spm_train \
  --input=corpus.txt \
  --model_prefix=spm_unigram \
  --model_type=unigram \
  --vocab_size=32000 \
  --character_coverage=1.0 \
  --seed_sentencepiece_size=320000

  • Encoding/Decoding (Python):

1
2
3
4
5
6
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load('spm_unigram.model')
tokens = sp.EncodeAsPieces('Sample text.')
ids = sp.EncodeAsIds('Sample text.')
text = sp.DecodeIds(ids)

SentencePiece operates losslessly, preserving spacing via Unicode marker U+2581 ("_"). No pre-tokenization is required; whitespace and rare Unicode symbols are preserved using the character_coverage setting (Kudo et al., 2018, Land et al., 14 Dec 2025).

6. Empirical Properties and Impact

Empirical evaluation on neural machine translation benchmarks (e.g., English ↔ Japanese) demonstrates that UnigramLM with SentencePiece delivers BLEU improvements of 1–1.5 points over word-level baselines at a fraction of vocabulary size. Training directly on raw sentences, without external pretokenization or language-specific rules, yields results as good as or superior to pipelines with hand-crafted segmentation (Kudo et al., 2018).

SentencePiece's raw-Japanese tokenization is approximately 380× faster than word-level pre-tokenization approaches, with throughput exceeding 20,000 sentences per second on modern CPUs, suitable for on-the-fly inference (Kudo et al., 2018).

7. Limitations and Implementational Considerations

The principal limitation is algorithmic complexity: standard UnigramLM requires EM inference and loss-based pruning, implemented efficiently only in select systems like SentencePiece. Corpus duplication alters early pruning, and rare Unicode drop-out can occur if character_coverage is set below 1.0 (Land et al., 14 Dec 2025). Final-Style Pruning is not yet exposed as a SentencePiece command-line option, requiring manual modification for experimental use (Land et al., 14 Dec 2025).

A plausible implication is that while UnigramLM achieves rigorous marginal likelihood objectives and adaptivity across languages, production systems may trade off exact objective minimization for engineering simplicity and greater compression using FSP or similar heuristics.


References:

  • [SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, (Kudo et al., 2018)]
  • [Which Pieces Does Unigram Tokenization Really Need?, (Land et al., 14 Dec 2025)]

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Unigram Tokenization Algorithm.