Papers
Topics
Authors
Recent
2000 character limit reached

Subword Units in Language Processing

Updated 31 December 2025
  • Subword units are linguistic segments shorter than words but longer than characters, designed to reduce sparsity and manage out-of-vocabulary issues.
  • They are constructed using algorithms like Byte-Pair Encoding, Unigram Language Models, and morphological segmenters to balance context capture and vocabulary efficiency.
  • Subword units improve performance in neural tasks such as machine translation, speech recognition, and word representation learning by enabling robust open-vocabulary modeling.

A subword unit is a linguistic or orthographic sequence—shorter than a word but usually longer than a character—used as an atomic symbol in various statistical and neural language technologies. Subword units enable open-vocabulary modeling, efficient parameter sharing across related forms, and mitigation of sparsity and out-of-vocabulary (OOV) problems in speech and language processing. Subword vocabularies are typically constructed via unsupervised, linguistically motivated, or data-compression–oriented algorithms, such as Byte-Pair Encoding (BPE), unigram LLMs, or model-based morphological segmenters, and are now ubiquitous in state-of-the-art machine translation, speech recognition, and linguistic representation systems.

1. Theoretical Rationale and Types of Subword Units

Subword units address the tradeoff between fully word-based and fully character-based modeling. Word units capture longer context but are limited by fixed vocabularies and heavy sparsity, while character units offer open-vocabulary coverage but greatly increase sequence length and impair the modeling of morphological/semantic information. Subword units—typically character n-grams, phone n-grams, frequent morphemes, or statistically-derived segments—strike an efficient compromise (Sennrich et al., 2015, Vania et al., 2017, Zhu et al., 2019).

Three main types of subword units are prevalent:

  • Character n-grams: All overlapping n-length substrings of a word (e.g., “cats”: “cat”, “ats”), including the full word.
  • Morph-based units: Segments that correspond to actual morphemes or morphological boundaries, usually induced by models like Morfessor (Grönroos et al., 2020) or supervised with gold morphological annotations (Vania et al., 2017, Zhu et al., 2019).
  • Data-driven units (e.g., BPE, Unigram LM): Variable-length segments optimized for frequency, compression, or segmentation likelihood. BPE merges the most frequent adjacent pairs; unigram models prune a large vocabulary via EM (Sennrich et al., 2015, Kudo, 2018, Grönroos et al., 2020). These units often do not align perfectly with true morphemes but yield superior performance in large-scale neural systems (Macháček et al., 2018, Xiao et al., 2018, Wu et al., 2018).

Subword vocabularies are typically constructed to ensure that every word can be fully segmented: all single characters are retained, guaranteeing a fallback for unseen words (Sennrich et al., 2015, Zenkel et al., 2017).

2. Algorithms and Segmentation Principles

Two foundational algorithms for subword segmentation are BPE and the Unigram LM model.

  • Byte-Pair Encoding (BPE): Each word is decomposed into its sequence of characters, and frequent pairs are iteratively merged. The number of merges is a hyperparameter; typical vocabulary sizes are 1–100k. The merge procedure enables a flexible unit set from sub-character (singletons) to word-level units (Sennrich et al., 2015, Zenkel et al., 2017, Macháček et al., 2018). Extensions introduce context-sensitive or compression-driven merges, such as Accessor Variety (AV) and Description Length Gain (DLG), providing finer control and improved performance in morphologically complex languages (Wu et al., 2018).
  • Unigram LLM (Unigram LM): Begins with a large candidate subword vocabulary. The EM algorithm is used to estimate the probability for each subword, and low-probability items are pruned to a target size. SentencePiece (Kishino et al., 2022) and Morfessor EM+Prune (Grönroos et al., 2020) are prominent implementations, the latter incorporating MDL or Bayesian priors (Kudo, 2018). The Unigram LM can probabilistically sample multiple valid segmentations per word, enabling subword regularization (Kudo, 2018, Lakomkin et al., 2020).

Linguistically motivated algorithms, such as Morfessor and semi-CRF segmenters, induce morph boundaries based on likelihood or gold labels but are frequently outperformed by data-driven methods in neural systems (Macháček et al., 2018, Vania et al., 2017, Zhu et al., 2019).

3. Subword Units in Neural Architectures and Representation Learning

In neural architectures, subword units are employed as the tokens over which all modeling—embedding, prediction, output—operates.

  • Machine Translation (NMT): Token sequences from BPE or Unigram segmentation replace word tokens as the input and output of neural encoder/decoder models, enabling open vocabulary translation via composition and facilitating parameter sharing for rare word forms (Sennrich et al., 2015, Kudo, 2018, Wu et al., 2018, He et al., 2020). Subword units have been shown to yield BLEU improvements of 0.5–2 points over word-based or character-based models, especially on OOV and rare words (Sennrich et al., 2015, Kudo, 2018).
  • Speech Recognition (ASR): Subword units (characters, grapheme n-grams, or phone-based BPE units) serve as the output labels of CTC, RNN-T, or attention-based models. Appropriately chosen subword vocabularies enable both efficient modeling and high OOV word recall, especially in morphologically rich languages (Xiao et al., 2018, Wang et al., 2020, Singh et al., 2020, Zhou et al., 2021). Phone-based or acoustically matched subwords often outperform character-based subwords for ASR (Wang et al., 2020, Zhou et al., 2021).
  • Word Representation Learning: Word vectors are constructed as (typically normalized) sums of the embeddings of their composing subword units—character n-grams, BPE units, or morphemes (Zhu et al., 2019, Chaudhary et al., 2018). Positional embeddings and attention-based composition functions provide additional capacity for capturing sequential and morphological information (Zhu et al., 2019).
  • Named Entity Recognition and Classification: Neural NER models operationalize subword sequences (e.g., char or phone LSTMs) to encode morphological and phonological cues for improved OOV and low-frequency entity tagging (Abujabal et al., 2018).

4. Subword Regularization and Ambiguity

Mapping text to subword units is inherently ambiguous, with many valid segmentations for a given surface form. Subword regularization is the policy of explicitly sampling and training on multiple segmented variants, leveraging this ambiguity to regularize the model and increase its robustness to OOVs and domain shifts (Kudo, 2018, Lakomkin et al., 2020).

  • Mathematics: For Unigram LM segmenters, the segmentation probability is p(yw)i=1Mp(yi)p(y | w) \propto \prod_{i=1}^M p(y_i), with the best segmentation obtained via dynamic programming (Viterbi), but training can sample alternative segmentations using a temperature parameter α and N-best list (Lakomkin et al., 2020).
  • Effects: Empirical studies report 2–8% relative WER reduction in ASR (Lakomkin et al., 2020), and 0.5–2.0 BLEU improvement in NMT, especially with low-resource or out-of-domain data (Kudo, 2018). Unseen-word recognition recall is significantly improved by regularization (Lakomkin et al., 2020).

Regularization is realized in practice by sampling a segmentation for each training example on-the-fly, using a smoothed selection from N-best segmentations, and training the model to minimize the expected loss over these segmentations (Kudo, 2018, Lakomkin et al., 2020).

5. Empirical Evaluations, Task-Specific Recommendations, and Limitations

Empirical studies establish key best practices and identify important tradeoffs:

  • ASR (Keyword/OOV): Single-character subwords achieve highest OOV recall and maximum term-weighted value in OOV keyword search for Arabic and Finnish (Singh et al., 2020). Interpolating n-gram and RNNLM-derived subword LMs, with variable-order n-gram approximation, yields further improvements (Singh et al., 2020). Acoustically-driven subword discovery better aligns segmentation with phonemic transitions and yields lower WER than BPE in time-synchronous models (Zhou et al., 2021).
  • NMT: BPE with a "zero-suffix" marker on non-final tokens and shared source-target vocabulary recovers most of the BLEU degradation relative to more complex segmenters in morphologically rich pairs (Macháček et al., 2018). Subword regularization via Unigram LM consistently adds 0.5–1.5 BLEU, especially in low-resource and OOD conditions (Kudo, 2018).
  • Representation Learning and Transfer: Combining character n-grams, morphemes, and phone n-grams in compositional embeddings substantially improves NER F1 and rare-word transfer, notably in low-resource settings (Chaudhary et al., 2018, Zhu et al., 2019).
  • Morphology: Subword units based on data-compression–motivated principles (BPE/Unigram) only weakly approximate true morphological boundaries and rarely outperform oracle morphological analysis for language modeling (Vania et al., 2017, Zhu et al., 2019). However, linguistically motivated segmenters show their advantages in rare word and semantic similarity tasks (Zhu et al., 2019).
  • Unit Vocabulary Size: Optimal vocabulary size balances sequence length and data sparsity; too large yields many singleton units, too small increases sequence and model size, degrading performance (Sennrich et al., 2015, Zenkel et al., 2017). Envelope values are task- and data-size–dependent; tuning is always recommended (Wu et al., 2018, Macháček et al., 2018, Zenkel et al., 2017).

Limitations arise from over-segmentation (overly fine-grained, redundant units), language-specific orthographic or phonological alternations unrecoverable by frequency-driven merges, and duplication of hypotheses in beam search under probabilistic segmentation (Kudo, 2018, Lakomkin et al., 2020, Macháček et al., 2018).

6. Extensions and Emerging Paradigms

Recent advances include:

  • Dynamic Programming Encoding (DPE): Treats segmentation as a latent variable; marginalizes all valid segmentations during training using a mixed character–subword transformer, yielding further BLEU gains on top of BPE and regularization (He et al., 2020).
  • Acoustic Data-Driven Subword Modeling (ADSM): Aligns subword discovery directly to acoustic boundaries by iterating supervised alignment, vocabulary refinement, and subword merging; achieves state-of-the-art WER and robust coverage in both time-synchronous and label-synchronous ASR (Zhou et al., 2021).
  • Visually Grounded Discovery: CNNs trained on speech–image paired data discover interpretable diphone events at intermediate layers, which tightly align with phone boundaries without any supervision, suggesting novel unsupervised pathways for subword discovery (Harwath et al., 2019).

Hybrid models (e.g., joint phone-BPE and char-BPE systems in ASR) and hierarchical composition for word representations offer further robustness and phonological or script generalization (Wang et al., 2020, Zhu et al., 2019).


References:

  • (Sennrich et al., 2015): Neural Machine Translation of Rare Words with Subword Units
  • (Vania et al., 2017): From Characters to Words to in Between: Do We Capture Morphology?
  • (Zenkel et al., 2017): Subword and Crossword Units for CTC Acoustic Models
  • (Kudo, 2018): Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
  • (Macháček et al., 2018): Morphological and Language-Agnostic Word Segmentation for NMT
  • (Xiao et al., 2018): Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units
  • (Wu et al., 2018): Finding Better Subword Segmentation for Neural Machine Translation
  • (Abujabal et al., 2018): Neural Named Entity Recognition from Subword Units
  • (Chaudhary et al., 2018): Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations
  • (Harwath et al., 2019): Towards Visually Grounded Sub-Word Speech Unit Discovery
  • (Zhu et al., 2019): A Systematic Study of Leveraging Subword Information for Learning Word Representations
  • (Grönroos et al., 2020): Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning
  • (Wang et al., 2020): An investigation of phone-based subword units for end-to-end speech recognition
  • (He et al., 2020): Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation
  • (Singh et al., 2020): Subword RNNLM Approximations for Out-Of-Vocabulary Keyword Search
  • (Tarján et al., 2020): Deep Transformer based Data Augmentation with Subword Units for Morphologically Rich Online ASR
  • (Lakomkin et al., 2020): Subword Regularization: An Analysis of Scalability and Generalization for End-to-End Automatic Speech Recognition
  • (Zhou et al., 2021): Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition
  • (Kishino et al., 2022): Extracting linguistic speech patterns of Japanese fictional characters using subword units
Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Subword Units.