Unigram Tokenization Algorithm
- Unigram Tokenization Algorithm is a probabilistic subword segmentation method that models token likelihoods, ensuring lossless text reconstruction.
- It employs an EM-based training process with iterative vocabulary pruning to optimize segmentation performance on large corpora.
- Integrated in SentencePiece, the method enhances neural machine translation and speeds up on-the-fly, language-agnostic text processing.
The Unigram Tokenization Algorithm, often referred to as "UnigramLM," is a probabilistic subword segmentation methodology utilized in neural text processing. Distinguished from deterministic or greedy tokenization schemes, UnigramLM underlies the widely adopted SentencePiece toolkit and has become a standard for language-agnostic, lossless, and robust preprocessing in neural systems, particularly in machine translation and large-scale language modeling (Kudo et al., 2018, Land et al., 14 Dec 2025).
1. Mathematical Framework and Objective
Let denote the vocabulary of subword tokens. For an input string , a segmentation is a sequence of tokens from such that their concatenation exactly reconstructs . The UnigramLM makes a strong conditional independence assumption over subwords, leading to the segmentation probability:
where each and the probabilities sum to one: .
Unlike approaches that optimize token sequence likelihoods directly, UnigramLM treats the segmentation as latent—optimizing for the marginal likelihood over all valid segmentations :
where is the corpus. This negative average per-byte log-likelihood is minimized by adjusting both and (Land et al., 14 Dec 2025).
2. EM-Based Training and Vocabulary Pruning
UnigramLM training proceeds by initial "overgeneration" of the vocabulary through extracting frequent substrings (seed vocabulary ), followed by iterative pruning while fitting token probabilities using the Expectation-Maximization (EM) algorithm. SentencePiece and subsequent references formalize the process as follows:
- Seed Vocabulary Construction: (typical seed_ratio ), where is the target final vocabulary size. Suffix array and longest common prefix (LCP) interval algorithms efficiently generate substrings up to a maximal length (Land et al., 14 Dec 2025).
- EM Fitting (per pruning iteration):
- E-Step: For each sentence, the expected counts of each token are computed by forward–backward dynamic programming over all segmentation paths.
- Early Pruning: Tokens with expected count below threshold are removed ( effective but robust to changes).
- M-Step: Probabilities are updated as , with often the digamma function but can default to identity without material loss (0.01\% effect on loss).
- Pruning: Tokens are ranked by their estimated loss increase if removed (computed by retokenizing the corpus without the candidate token). Top () are retained, and the process repeats until the vocabulary size reaches the pre-final overshoot factor (). Final pruning is by probability.
The method scales linearly with corpus size and achieves practical runtimes on large datasets (Land et al., 14 Dec 2025, Kudo et al., 2018).
3. Algorithm Variants: Final-Style Pruning
Land & Pinter (2025) introduced the "Final-Style Pruning" (FSP) heuristic variant, which replaces loss-based iterative pruning with repeated pruning by lowest token probability:
- Construct seed vocabulary and perform EM fitting as in standard UnigramLM.
- After each EM block, remove the lowest- tokens to reduce by factor .
- Repeat until , then do one final prune by .
Empirical evaluation shows that FSP modestly increases the negative log-likelihood (0.6–1.2% relative) but gains 0.5–1.3% in compression (i.e., produces fewer tokens for the same data), with morphological alignment metrics between standard Unigram and BPE (Land et al., 14 Dec 2025).
4. Key Hyperparameters and Their Effects
Extensive evaluation has determined best practices for robust training, summarized as follows:
| Parameter | Default | Observed Effect |
|---|---|---|
| seed_ratio | 10 | degrades both loss and compression; is safe |
| em_steps | 2 | Invariant to variation (1–5) |
| early_thr | 0.5 | No consistent loss/compression effect across 0–10 |
| prune_ratio | 0.75 | Higher slows pruning, lowers loss but increases token count |
| pre_final | 1.1 | Lower stops pruning earlier, minimal impact |
| digamma on/off | on | Negligible change (0.01\%) |
Recommended defaults are robust for most settings (Land et al., 14 Dec 2025). Seed vocabularies must be at least in size for well-posed optimization; inadequate seeding materially harms final compression (Land et al., 14 Dec 2025).
5. Integration in SentencePiece and Practical Usage
SentencePiece provides reference implementations and command-line tools as well as C++/Python APIs. End-to-end pipelines—including Unicode NFKC normalization (optionally extensible via FSTs), id-mapping, encoding/decoding, and lossless detokenization—are supported for both BPE and UnigramLM paradigms (Kudo et al., 2018).
Typical Usage
- Training:
1 2 3 4 5 6 7 |
spm_train \ --input=corpus.txt \ --model_prefix=spm_unigram \ --model_type=unigram \ --vocab_size=32000 \ --character_coverage=1.0 \ --seed_sentencepiece_size=320000 |
- Encoding/Decoding (Python):
1 2 3 4 5 6 |
import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.Load('spm_unigram.model') tokens = sp.EncodeAsPieces('Sample text.') ids = sp.EncodeAsIds('Sample text.') text = sp.DecodeIds(ids) |
SentencePiece operates losslessly, preserving spacing via Unicode marker U+2581 ("_"). No pre-tokenization is required; whitespace and rare Unicode symbols are preserved using the character_coverage setting (Kudo et al., 2018, Land et al., 14 Dec 2025).
6. Empirical Properties and Impact
Empirical evaluation on neural machine translation benchmarks (e.g., English ↔ Japanese) demonstrates that UnigramLM with SentencePiece delivers BLEU improvements of 1–1.5 points over word-level baselines at a fraction of vocabulary size. Training directly on raw sentences, without external pretokenization or language-specific rules, yields results as good as or superior to pipelines with hand-crafted segmentation (Kudo et al., 2018).
SentencePiece's raw-Japanese tokenization is approximately 380× faster than word-level pre-tokenization approaches, with throughput exceeding 20,000 sentences per second on modern CPUs, suitable for on-the-fly inference (Kudo et al., 2018).
7. Limitations and Implementational Considerations
The principal limitation is algorithmic complexity: standard UnigramLM requires EM inference and loss-based pruning, implemented efficiently only in select systems like SentencePiece. Corpus duplication alters early pruning, and rare Unicode drop-out can occur if character_coverage is set below 1.0 (Land et al., 14 Dec 2025). Final-Style Pruning is not yet exposed as a SentencePiece command-line option, requiring manual modification for experimental use (Land et al., 14 Dec 2025).
A plausible implication is that while UnigramLM achieves rigorous marginal likelihood objectives and adaptivity across languages, production systems may trade off exact objective minimization for engineering simplicity and greater compression using FSP or similar heuristics.
References:
- [SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, (Kudo et al., 2018)]
- [Which Pieces Does Unigram Tokenization Really Need?, (Land et al., 14 Dec 2025)]