Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Published 11 Aug 2025 in cs.CL and cs.AI | (2508.08424v3)

Abstract: The relationship between tokenizer algorithm (e.g., Byte-Pair Encoding (BPE), Unigram), morphological alignment, tokenization quality (e.g., compression efficiency), and downstream performance remains largely unclear, particularly for languages with complex morphology. In this paper, we conduct a comprehensive evaluation of tokenizers using small-sized BERT models -- from pre-training through fine-tuning -- for Telugu (agglutinative), along with preliminary evaluation in Hindi (primarily fusional with some agglutination) and English (fusional). To evaluate morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms. Our experiments reveal two key findings for Telugu. First, the choice of tokenizer algorithm is the most significant factor influencing performance, with Unigram-based tokenizers consistently outperforming BPE across most settings. Second, while better morphological alignment shows a moderate, positive correlation with performance on text classification and structure prediction tasks, its impact is secondary to the tokenizer algorithm. Notably, hybrid approaches that use morphological information for pre-segmentation significantly boost the performance of BPE, though not Unigram. Our results further showcase the need for comprehensive intrinsic evaluation metrics for tokenizers that could explain downstream performance trends consistently.

Summary

  • The paper demonstrates that Unigram tokenizers significantly outperform BPE in handling complex morphological structures, particularly in agglutinative languages like Telugu.
  • The study introduces a multi-stage experimental framework using small BERT models to isolate tokenizer effects on text classification and structure prediction tasks.
  • The findings highlight the need for improved intrinsic metrics to better capture tokenization quality and morphological alignment in diverse NLP systems.

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Introduction

The paper "Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment" (2508.08424) addresses the relationship between various tokenization algorithms, morphological alignment, tokenization quality, and their impact on downstream performance, especially in languages with complex morphology. The study focuses on the evaluation of tokenizers using small-sized BERT models across Telugu, Hindi, and English, thereby analyzing tokenization strategies tailored for different morphological typologies.

Tokenization Algorithms and Morphological Alignment

The research reveals two main findings for Telugu, an agglutinative language. Firstly, the choice of the tokenizer algorithm significantly impacts performance, with Unigram-based tokenizers outperforming Byte-Pair Encoding (BPE) across most scenarios. Secondly, although there is a moderate positive correlation between morphological alignment and performance on text classification and structure prediction tasks, the influence of morphological alignment is secondary to the choice of tokenizer algorithm.

In particular, hybrid approaches utilizing morphological information for pre-segmentation can significantly enhance BPE performance, although similar improvements are not observed for Unigram tokenizers. Therefore, the study emphasizes the need for comprehensive intrinsic evaluation metrics for tokenizers to consistently explain downstream performance trends.

Evaluation Framework

The paper deploys a multi-stage experimental framework to assess various tokenization approaches affecting morphological alignment. It involves encoder-only BERT models trained using different tokenizer variants across a variety of downstream tasks, and ensures constant hyperparameters for isolating the tokenizer’s effect on language modeling. A dataset with gold morpheme segmentations was created for Telugu, allowing quantitative evaluation of morphological alignment.

Results and Implications

The detailed results show that naive Unigram tokenizers consistently deliver superior overall performance compared to BPE and other tokenization variants. This suggests that Unigram tokenizers are more adept at handling rich morphological structures and distribution of tokens in agglutinative languages such as Telugu.

While the findings illustrate the positive correlation between morphological alignment scores and downstream effectiveness, they also underscore the need for intrinsic metrics accounting for different linguistic alignments and trade-offs.

Furthermore, the analysis implies significant practical implications for NLP systems designed for morphologically rich and low-resource languages. These systems can benefit from adopting Unigram tokenizers or hybrid approaches incorporating morphological segmentation, especially when dealing with syntax-based tasks.

Tokenization Quality

In addition to morphological alignment, the paper investigates tokenization quality through metrics like Corpus Token Count (CTC) and Rényi Entropy. Contrary to expectations, neither demonstrated a significant correlation with downstream performance for small-sized models. While the tokenizer type notably affects performance, Rényi entropy fails to fully explain the observed trends, indicating the limitations of current intrinsic evaluation metrics. Thus, there is a need for developing more comprehensive metrics that consistently account for efficiency and alignment in tokenization schemes.

Conclusion

The paper highlights the dominance of Unigram tokenizers over BPE in handling complex morphological structures in low-resource LLMs, especially for Telugu. Although morphological alignment positively influences language understanding tasks, the choice of tokenizer remains the primary determinant of performance. Future work should aim to integrate better intrinsic evaluation metrics, considering both linguistic alignment and tokenization quality.

The research underscores the complexity of tokenization in morphologically rich languages, paving the way for more effective language modeling strategies in NLP systems. The study's implications are particularly crucial for building equitable NLP frameworks in diverse linguistic settings.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

What is this paper about?

This paper looks at a basic but very important step in building language AI: how we split text into pieces the computer can understand, a process called “tokenization.” The authors ask which tokenization method works best for languages with rich word structure (like Telugu and Hindi), and whether matching tokens to real word parts (morphemes) actually helps AI models perform better.

What questions did the researchers ask?

They focused on two simple questions:

  1. Does cutting words in ways that match real word parts (morphemes) make models better?
  2. Do “quality” measures of tokenization (like how compactly text is represented) explain why some models work better than others?

They also compared two popular tokenizers:

  • BPE (Byte-Pair Encoding)
  • Unigram (Unigram LLM)

How did they study it?

Think of tokenization like breaking words into puzzle pieces. Some languages build long words by stacking small meaning units (like Lego bricks). The team tested different ways to cut those words into pieces, then trained the same small BERT-style model with each tokenizer and compared results.

They did this for three languages:

  • Telugu (very morphologically rich; many word endings and long words)
  • Hindi (moderately rich)
  • English (less rich)

To make the comparison fair, they kept everything else the same (same training data, same model size, same training steps). Here’s what they tried:

  • Different granularities of pieces: character-level, word-level, and subword-level (BPE and Unigram).
  • “Hybrid” methods: first split words at likely morpheme boundaries (using tools called Morfessor or a morphological analyzer), then apply BPE or Unigram. This tries to align tokens with real word parts.
  • Different “vocabulary sizes”: how many different subword pieces the tokenizer is allowed to use.

They trained small BERT-like models (about 8.5 million parameters) from scratch for each tokenizer and tested them on many tasks, such as:

  • Text classification (like sentiment or intent)
  • Structure prediction (finding parts of speech, names, and sentence grammar)
  • Similarity and inference (are two sentences similar? does one sentence imply another?)

To check how well tokenizers matched real morphemes, they created a new “gold” dataset for Telugu with correct morpheme splits and used a metric called MorphScore (it compares where the tokenizer cuts words to where real morpheme boundaries are).

They also tested two “tokenization quality” ideas:

  • Corpus Token Count (CTC): how many tokens it takes to encode the text (fewer is like packing your suitcase more tightly).
  • Rényi entropy: how evenly token pieces are used (like whether some pieces are used all the time and others almost never).

What did they find?

Here are the main takeaways:

  • Unigram beat BPE in most cases
    • Across many settings in Telugu (and similar trends in Hindi/English), Unigram tokenizers consistently led to better model performance than BPE.
  • Matching morphemes helps, but less than the tokenizer choice
    • Tokenizers that aligned better with real morpheme boundaries showed a moderate improvement, especially for structure-related tasks (like parts-of-speech and parsing).
    • But the biggest performance difference came from the algorithm itself: Unigram vs. BPE. Unigram mattered more than perfect morpheme matching.
  • Hybrid “morphology-aware” pre-splitting helps BPE a lot, not Unigram
    • When they pre-split words using morphological tools and then applied BPE, performance improved noticeably.
    • Doing the same before Unigram didn’t help much and sometimes didn’t help at all.
  • Vocabulary size matters differently for each tokenizer
    • BPE worked best with smaller vocabularies.
    • Unigram did best with larger vocabularies.
  • Popular “quality” scores didn’t explain why Unigram wins
    • Compression (CTC) and token usage distribution (Rényi entropy) did not reliably predict which tokenizer would perform better in these small BERT models.
    • In other words, packing the text into fewer tokens or having a “nice” distribution of pieces didn’t necessarily mean better performance.

Why this is important:

  • For languages with complex word building, the right tokenizer can make a big difference.
  • If you must use BPE, adding morphological pre-splitting can rescue a lot of performance.
  • If you can choose, Unigram is often a safer bet for small encoder models.

Why does this matter?

Many world languages have rich morphology (they build long words by adding many endings). These languages are often underrepresented in AI. Picking the right tokenizer is a low-cost, high-impact decision when building smaller, efficient models for such languages. This research suggests:

  • Choose Unigram for small BERT-like models to get better results out of the box.
  • If using BPE, add morphological pre-splitting to narrow the gap.
  • Don’t rely only on compression or token-distribution scores to judge a tokenizer—those didn’t predict performance well here.

What could this lead to next?

  • Better tools for low-resource, morphologically rich languages by simply swapping tokenizers.
  • New ways to judge tokenizers that actually match downstream performance (since current “intrinsic” scores didn’t fully explain results).
  • Follow-up studies on larger models and on generative tasks (like translation or summarization), since this paper focused on small encoder models and understanding tasks.

A few simple definitions

  • Morphology/morphemes: The “building blocks” of words (like play + ing in “playing”). Languages like Telugu often stack many such blocks into one word.
  • Tokenizer: A method to break text into pieces (tokens) a model can handle. Subword tokenizers break words into useful chunks.
  • BPE vs. Unigram:
    • BPE: Merges frequent pairs of characters/chunks step by step—like repeatedly gluing common letter pairs together.
    • Unigram: Starts with a big list of possible chunks and picks a set that works best overall—like choosing the most useful “puzzle pieces” to cover words efficiently.
  • MorphScore: A way to check how well a tokenizer’s cuts match real word-part boundaries.
  • Corpus Token Count (CTC): How many tokens are needed to encode the text—fewer means more “compressed.”
  • Rényi entropy: A measure of how evenly different tokens are used across the text.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper, organized to guide future research.

  • Generalizability across model families and scales is untested:
    • Do the findings (Unigram > BPE; modest role of morphological alignment) hold for decoder-only and encoder–decoder models, and at larger scales (e.g., 100M–7B+)?
  • Training budget confounds are unresolved:
    • Were models trained on equal numbers of tokens/FLOPs across tokenizers? If not, does controlling for “tokens seen” (rather than steps) change conclusions, given different CTCs?
  • Sequence length and batching details are unspecified:
    • How do max sequence length and token-level batching interact with tokenizer-induced length differences, and do they bias model updates or truncation effects?
  • Mechanism behind Unigram’s dominance is unknown:
    • What properties of Unigram (e.g., probabilistic segmentation, token stability, subword fertility, token consistency across contexts) causally drive its gains over BPE?
  • Hybrid pre-segmentation helps BPE but not Unigram—why?
    • Is the lack of benefit to Unigram due to its training objective subsuming morphological cues, suboptimal pre-segmentation quality, or training hyperparameters? Ablate segmentation quality, granularity, and noise.
  • Intrinsic metrics inadequacy is not resolved:
    • Beyond CTC and Rényi entropy, which intrinsic measures (e.g., morpheme-boundary mutual information, subword fertility, token consistency, masked-LM per-token loss, alignment to aksharas/graphemes) best predict downstream performance?
  • Tokenizer variants not covered:
    • How do WordPiece, byte-level BPE, BPE-dropout, Unigram with subword regularization (sampling), or vocabulary refinement methods compare under the same setup?
  • OOV handling biases comparisons:
    • Word- and morphemic-level tokenizers used [UNK]; would open-vocabulary schemes (byte fallback, mixed char–subword backoff) alter results and fairness across tokenizers?
  • Language coverage is limited:
    • Do results extend to typologically different morphologies (e.g., Semitic root-and-pattern, polysynthetic languages, highly compounding languages like Finnish/Turkish, isolating languages)?
  • Hindi and English analyses lack statistical power:
    • More tokenizer–pretokenizer variants and runs are needed to perform the same ANCOVA/fixed-effects tests and validate whether Telugu trends generalize.
  • Scope limited to NLU tasks:
    • How do tokenizers interact with generative tasks (MT, summarization), morphology-heavy generation (inflection/lemmatization), and ASR text normalization?
  • MorphScore dataset limitations (Telugu) may bias alignment estimates:
    • Gold segmentations are analyzer-derived, filtered paradigm forms, and exclude single-morpheme words; coverage of compounding, sandhi, clitics, prefixes, and free-text variation remains unclear.
  • Script-sensitive boundary definition is unaddressed:
    • For abugida scripts, character-level boundaries may misalign with aksharas/grapheme clusters; do alignment scores change with more appropriate orthographic units?
  • Domain effects are untested:
    • Does tokenizer ranking change with different pretraining domains (news vs conversational vs web) or with domain-adaptive pretraining?
  • Vocabulary size–capacity trade-offs are not fully controlled:
    • Varying vocabulary size changes embedding parameters and token lengths; do findings hold when controlling for effective capacity (e.g., vocab-adaptive embeddings) or sweeping more sizes?
  • Positional encoding interactions are unexplored:
    • Do tokenizers that produce longer sequences differentially interact with absolute vs relative positional encodings or with longer max lengths?
  • Robustness to noise and variation is untested:
    • How resilient are tokenizers to spelling variation, code-mixing, transliteration, and informal orthography common in Indic languages?
  • Downstream error attribution is missing:
    • Which specific tokenization errors (e.g., morpheme boundary violations, over-/under-segmentation) propagate to task errors in POS, NER, and parsing?
  • Preprocessing/normalization details may affect outcomes:
    • Are Unicode normalization, diacritics handling, punctuation/token splitting, and numeral tokenization standardized across tokenizers and languages?
  • Interaction with MLM objective is not probed:
    • Would alternative objectives (span masking, whole-word/morpheme masking, RTD, ELECTRA-style pretraining) change tokenizer rankings?
  • Multilingual/shared vocabulary settings are untested:
    • In multilingual models with shared vocabularies, does Unigram remain superior, and how does subword sharing across languages affect morphology alignment and performance?
  • Evaluation of morphological tasks is limited:
    • Adding explicit morphological tagging, segmentation, and lemmatization tasks could more sensitively detect benefits of morphological alignment.
  • Quality and variability of pre-segmentation are under-explored:
    • How do different morphological analyzers, supervised segmenters, or semi-supervised methods (with error rates quantified) affect hybrid tokenizers?
  • Rényi entropy parameterization and alternatives are narrow:
    • Do conclusions hold across other α values, corpus slices, or when using conditional variants (e.g., entropy conditioned on context length) more aligned with MLM training?
  • Reproducibility of tokenizer training choices is under-specified:
    • Settings like character coverage, normalizers, training heuristics, and split rules for SentencePiece/BPE can sway outcomes; more detailed reporting and ablations are needed.
  • Compute-efficient evaluation of tokenizers is missing:
    • Given the high cost (72 models), can reliable proxy evaluations be devised to screen tokenizers before full pretraining (e.g., few-step MLM loss curves, small-probe tasks)?
  • Open question on aligning intrinsic and extrinsic metrics:
    • What composite intrinsic metric (potentially multi-objective over compression, morpheme alignment, subword fertility, and token stability) best predicts task performance across languages and tasks?

Practical Applications

Immediate Applications

The following applications can be deployed now using the paper’s findings, methods, and released resources. Each item includes target sectors, suggested tools/workflows, and feasibility notes.

  • Deploy Unigram tokenizers as the default for small NLU models in morphologically rich languages
    • Sectors: Software, Telecom, E-commerce, Public-sector IT
    • Tools/workflows: Switch to SentencePiece Unigram in model training recipes for Telugu (and likely Hindi); retrain NLU models (intent classification, NER, POS) with the same hyperparameters
    • Assumptions/dependencies: Evidence strongest for small encoder-only BERT (~8.5M params) and NLU tasks; performance gains may vary by domain and language
  • Retrofit BPE-based production pipelines with morphological pre-segmentation
    • Sectors: Enterprise NLP, Government e-services, Content platforms
    • Tools/workflows: Add Morfessor pre-segmentation before BPE (“Morph-aware BPE Booster”); retrain models to recover/improve performance without changing architecture
    • Assumptions/dependencies: Benefits shown for BPE; minimal gains for Unigram; need language-appropriate Morfessor models and QA
  • Tokenizer audit and selection for Indic-language deployments
    • Sectors: MLOps, Consulting, Platform engineering
    • Tools/workflows: “Tokenizer Decision Assistant” that benchmarks Unigram vs BPE (with/without pre-segmentation) on your corpus; reports MorphScore and downstream task deltas
    • Assumptions/dependencies: Requires representative in-domain evaluation sets; results best-aligned to NLU tasks
  • Improve on-device/edge NLU for Telugu/Hindi with compact Unigram-based encoders
    • Sectors: Mobile OEMs, Keyboard apps, Call centers (IVR), Smart devices
    • Tools/workflows: Package small Unigram-BERTs for offline inference: intent detection, slot filling, contact/entity extraction on-device
    • Assumptions/dependencies: Model compression/quantization pipelines; domain adaptation datasets
  • Boost accuracy in document understanding and information extraction in local languages
    • Sectors: Finance (KYC/AML), Healthcare (forms triage), E-governance (applications)
    • Tools/workflows: Replace word/BPE tokenizers with Unigram in entity extraction and POS/parsing models; expect better structure prediction (POS/NER/LAS) per paper’s findings
    • Assumptions/dependencies: Requires retraining and validation on confidential corpora; regulatory compliance for data access
  • Strengthen sentiment analysis and content moderation in Telugu/Hindi
    • Sectors: Media/Social platforms, Brand monitoring
    • Tools/workflows: Retrain sentiment and toxicity classifiers with Unigram; expect gains especially in text classification tasks
    • Assumptions/dependencies: Domain shift may require additional fine-tuning; performance improvements are moderate but consistent
  • Search and query understanding for morphologically rich languages
    • Sectors: E-commerce, Government portals, Support search
    • Tools/workflows: Use Unigram in query parsing and intent classification pipelines; improved handling of OOV and agglutinated forms
    • Assumptions/dependencies: IR ranking may still depend on other features; integration must preserve latency budgets
  • Update evaluation practices to include MorphScore and tokenizer disclosure
    • Sectors: Academia, Open-source model hubs, Corporate AI governance
    • Tools/workflows: Report MorphScore (precision/recall/F1) alongside downstream metrics in model cards; add tokenizer type and vocabulary size as audit fields
    • Assumptions/dependencies: Gold or proxy morphological segmentations are needed for MorphScore; apply exclusions per paper’s protocol
  • Create and extend morphological segmentation datasets using the Telugu benchmark
    • Sectors: Data providers, Academic labs, Community initiatives
    • Tools/workflows: Use the released Telugu gold-standard morpheme segmentations as a pattern for building Hindi/other language datasets; adopt consistent boundary annotation conventions
    • Assumptions/dependencies: Requires linguist involvement; availability/quality of morphological analyzers varies by language
  • Provide ready-to-use Unigram vocabularies and training recipes for Indic languages
    • Sectors: Developer tools, Open-source maintainers
    • Tools/workflows: Publish SentencePiece Unigram models with recommended vocabulary sizes (larger for Unigram, smaller for BPE if used), plus training configs and CI examples
    • Assumptions/dependencies: Corpus licensing; ongoing maintenance and community validation

Long-Term Applications

These opportunities require further research, scaling, or ecosystem development before widespread deployment.

  • Extend findings to larger models and generative tasks (MT, summarization, dialogue)
    • Sectors: Foundation model labs, MT providers, Conversational AI
    • Tools/products: Comparative studies for encoder-decoder and decoder-only LLMs; tokenizer tuning for generation fidelity and hallucination behavior
    • Assumptions/dependencies: Paper’s results are for small encoder-only BERTs and NLU tasks; generalization is unknown
  • Design next-generation tokenizers that reconcile statistical efficiency and morphological alignment
    • Sectors: Core NLP R&D, Standards
    • Tools/products: New tokenization algorithms combining Unigram’s strengths with morphology-aware priors; learnable segmentation with task-aware objectives
    • Assumptions/dependencies: Need intrinsic metrics beyond CTC/Rényi entropy; multi-language validation
  • AutoML for tokenization: data-driven tokenizer selector and vocabulary size tuner
    • Sectors: MLOps platforms, AutoNLP
    • Tools/products: Pipeline that samples tokenizers (Unigram/BPE + pre-segmentation), estimates MorphScore, predicts downstream performance and latency, and auto-selects a configuration
    • Assumptions/dependencies: Requires meta-learning across many languages/tasks; compute for exploratory trials
  • National/organizational policy for inclusive NLP procurement and benchmarks
    • Sectors: Government, Regulators, Standards bodies
    • Tools/products: Procurement guidelines requiring tokenizer evaluation (MorphScore + task metrics) for public-facing NLP systems; funding for morphological datasets/analyzers in under-resourced languages
    • Assumptions/dependencies: Policy adoption cycles; coordination with research/community stakeholders
  • Comprehensive tokenizer evaluation suites and benchmarks
    • Sectors: Academia, Open-source consortia
    • Tools/products: A public “IndicMorphBench” with morphologically diverse languages, gold segmentations, and standardized NLU tasks; leaderboards reporting both intrinsic and extrinsic metrics
    • Assumptions/dependencies: Data licensing, community annotation, agreement on protocols
  • Cross-family expansion to typologically diverse languages (e.g., Turkic, Uralic, Bantu)
    • Sectors: Global NLP providers, Localization
    • Tools/products: Morphological segmentation datasets and Unigram-optimized NLU models for new language families; case studies by typology
    • Assumptions/dependencies: Availability of corpora and morphological tools; local expertise and annotation capacity
  • Integrated ASR+NLU stacks optimized for morphological complexity
    • Sectors: Voice assistants, IVR, Automotive
    • Tools/products: Use Unigram-tokenized NLU for post-ASR understanding (intent/slots/NER) in highly inflected languages; explore subword consistency between ASR lexicons and NLU tokenization
    • Assumptions/dependencies: High-quality ASR models and lexicons; alignment between acoustic units and tokenization strategy
  • Domain-specific morph-aware IE in healthcare and legal
    • Sectors: Healthcare IT, Legal tech
    • Tools/products: Advanced NER and relation extraction leveraging Unigram and morphology-aware features to capture inflected entities and case markers
    • Assumptions/dependencies: Access to labeled domain corpora; compliance and privacy constraints
  • Educational tools leveraging morphological segmentation for literacy and language learning
    • Sectors: EdTech, Public education
    • Tools/products: Apps that visualize word formation (stems/affixes), improve spelling/grammar checking, and scaffold reading comprehension in agglutinative languages
    • Assumptions/dependencies: UX research with learners/teachers; validated pedagogy; content localization
  • Dynamic or task-adaptive tokenization at inference/training time
    • Sectors: Platform AI, Research
    • Tools/products: Systems that adapt tokenization granularity by task (e.g., finer for parsing, coarser for sentiment) while maintaining shared embeddings
    • Assumptions/dependencies: New training schemes, caching, and serving infrastructure; stability and reproducibility guarantees

Notes on Assumptions and Dependencies (Global)

  • Scope of evidence: Results are demonstrated on Telugu with preliminary Hindi/English evidence, using small (~8.5M) encoder-only BERT models and NLU tasks; generalization to larger models, other architectures, and generative tasks remains to be proven.
  • Tokenizer effects: Unigram consistently outperforms BPE; morphology-aware pre-segmentation substantially helps BPE but not Unigram; morphological alignment moderately correlates with performance (especially structure prediction), but is secondary to tokenizer choice.
  • Metrics: Corpus Token Count and Rényi entropy did not explain performance differences in this setup; relying solely on these metrics to choose tokenizers is risky.
  • Resource needs: High-quality morphological analyzers/segmenters, gold segmentation datasets, in-domain evaluation data, and compute for retraining are key dependencies.

Glossary

  • Agglutinative: A morphological type where words are formed by concatenating many morphemes, each carrying a single grammatical function. "Telugu (agglutinative)"
  • ANCOVA (Analysis of Covariance): A statistical test that compares group means while controlling for covariates. "When ANCOVA test introduced the F1-score (from MorphScores) as a covariate,"
  • ANOVA (Analysis of Variance): A statistical test for detecting differences among group means. "We performed ANOVA (Analysis of Variance) and ANCOVA (Analysis of Covariance) tests"
  • BERT (Bidirectional Encoder Representations from Transformers): A transformer-based encoder-only LLM architecture pre-trained with masked language modeling. "with standard BERT (Bidirec- tional Encoder Representations from Transformers) (Devlin et al., 2019) architecture"
  • Byte-Pair Encoding (BPE): A subword tokenization algorithm that iteratively merges frequent symbol pairs to form a vocabulary. "Byte-Pair Encoding (BPE)"
  • Corpus Token Count (CTC): An intrinsic metric measuring the total number of tokens needed to represent a corpus under a tokenizer. "Corpus Token Count (CTC) (Schmidt et al., 2024) is defined as the number of tokens required to en- code a given text."
  • Derivational (morphology): Processes forming new words (often changing part of speech/meaning) by adding affixes. "600 derivational and 7000 inflectional word forms."
  • Encoder-only transformer: A transformer architecture that uses only the encoder stack (no decoder), typical of BERT-style models. "We choose encoder-only transformer (Vaswani et al., 2023) model"
  • Fixed effects model: A regression model controlling for unobserved heterogeneity by including group-specific constants. "We also tested the correlation using fixed effects model"
  • Fusional: A morphological type where single affixes encode multiple grammatical categories simultaneously. "English (fu- sional)"
  • Gold morpheme segmentations: Manually validated ground-truth splits of words into morphemes for evaluation. "gold morpheme segmentations"
  • Hybrid approaches (tokenization): Methods combining linguistic pre-segmentation (e.g., morphology) with subword tokenizers. "hybrid ap- proaches that use morphological information for pre-segmentation"
  • Labeled Attachment Score (LAS): Dependency parsing metric measuring the percentage of tokens with both correct head and dependency label. "the labeled attach- ment score for dependency parsing"
  • Morfessor: An unsupervised algorithm/tool for morphological segmentation of words into morphemes. "Morfessor (Creutz and Lagus, 2007; Smit et al., 2014)"
  • Morphological alignment: The degree to which tokenizer boundaries correspond to true morpheme boundaries. "To evaluate morphological alignment of tokenizers in Telugu,"
  • Morphological analyzer: A tool that analyzes word forms into stems and affixes with morphological features. "morphologi- cal analyzer (Rao et al., 2011)"
  • MorphScore: A boundary-based metric assessing how well tokenization matches gold morpheme boundaries (precision/recall/F1). "MorphScore assesses how well segmentations from tokenizer correspond to ground-truth morpholog- ical boundaries."
  • Out-of-vocabulary (OOV): Tokens or words not present in a model’s vocabulary. "Out-of-vocabulary (OOV) words were handled"
  • Pre-segmentation: Segmenting text using linguistic resources (e.g., morphology) before training/appling a subword tokenizer. "pre-segmentation"
  • Pre-tokenizer: A preprocessing step that normalizes or segments input before the main tokenizer. "the pre-tokenizer showed no significant effect on performance."
  • Rényi entropy: An information-theoretic measure used here to characterize the distribution of token frequencies. "Rényi entropy"
  • Spearman correlation: A non-parametric measure of rank correlation capturing monotonic relationships. "Spearman correlation, on the other hand, which accounts for monotonic relationship, revealed a stronger and more significant relation- ship."
  • Structure prediction: Tasks that predict linguistic structure (e.g., POS, NER, dependencies) rather than simple labels. "Structure prediction includes the average F1-scores for part-of-speech tagging, named entity recognition, and the labeled attach- ment score for dependency parsing."
  • Tokenization Quality: A hypothesized property of tokenizers related to compression and token frequency distribution that may affect performance. "Tokenization Quality"
  • Type-to-token ratio: A lexical diversity measure comparing the number of unique word types to total tokens. "type- to-token ratio"
  • Unigram LLM (ULM): A subword tokenization approach that selects a token vocabulary by maximizing a corpus likelihood under a unigram model. "the Unigram Lan- guage Model (ULM) (Kudo, 2018)"
  • Unknown token ([UNK]): A special placeholder token used when encountering unseen words or symbols. "unknown token ([UNK])"
  • WordPiece: A subword tokenization algorithm similar to BPE but using likelihood-based criteria for merges. "WordPiece (Schuster and Nakajima, 2012)"

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.