How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

Published 18 Apr 2026 in cs.CL | (2604.17105v1)

Abstract: Tokenization is the first step in every LLM (LM), yet it never takes the sounds of words into account. We investigate how tokenization influences text-only LMs' ability to represent phonological knowledge. Through a series of probing experiments, we show that subword-based tokenization systematically weakens the encoding of both local (e.g., rhyme) and global (e.g., syllabification) phonological features. To quantify this effect, we introduce the syllabification-tokenization alignment distance (STAD), a metric that measures the misalignment between a model's tokenization and the natural syllable boundaries of words, and find that higher misalignment correlates with poorer phonological representations, providing a simple diagnostic for phonology-aware tokenization. To address these limitations, we propose a lightweight IPA-based fine-tuning method that infuses phonological awareness into LMs, leading to consistent improvements across three phonology-related tasks while largely preserving math and general reasoning ability, with 1.1\% and 0.9\% drops on GSM8K and MMLU, respectively.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that tokenization misalignment systematically degrades phonological tasks like rhyming awareness, G2P, and syllable counting.
It employs linear probing across multiple LM architectures and introduces the STAD metric to quantify token-syllable alignment.
IPA-based fine-tuning yields significant improvements in phonological competence with minimal impact on overall reasoning benchmarks.

Tokenization Constraints on Phonological Knowledge Representation in LLMs

Motivation and Research Questions

The paper investigates the impact of tokenization on the capacity of text-only LMs to encode and reason about phonological information—such as rhyming, syllabification, and pronunciation—without access to direct speech input (2604.17105). The central focus is on subword-based tokenizers (e.g., BPE, SentencePiece), which are ubiquitous in NLP but do not consider phonological structure during segmentation. The study formulates three core research questions:

To what extent do LMs encode latent phonological knowledge in their internal states?
How does tokenization influence both local and global phonological features?
Can post-hoc fine-tuning with phonological signals improve phonological competence without degrading general reasoning performance?

Methodological Overview

Phonological competence was evaluated via linear probing across three representative tasks:

Rhyming awareness: Classification of rhyming pairs irrespective of orthographic similarity.
Grapheme-to-phoneme (G2P): Regression to phonemic representations (ARPAbet).
Syllable counting: Regression to syllable counts.

Hidden states from various architectures (BERT, GPT-2, GPT-neo, Llama3 variants, Mistral) were probed across layers, establishing the depthwise distribution of phonological information. The analysis included control probes trained on random embeddings to distinguish genuine representations from probe memorization.

A novel metric—syllabification-tokenization alignment distance (STAD)—was introduced to quantify the degree of mismatch between token boundaries and natural syllable boundaries, enabling a diagnostic evaluation of tokenization quality with respect to phonological structures.

Tokenization granularity was manipulated by inserting character-level delimiters, and performance was compared to byte-level tokenization models (ByT5). The relationship between misaligned tokenization and linguistic phenomena such as cognates and loanwords was empirically tested using the CogNet database.

Fine-tuning with IPA-augmented QA data using LoRA adaptation was deployed to test improvements in phonological tasks while monitoring performance on GSM8K and MMLU for catastrophic forgetting.

Empirical Findings

Phonological Encodings in LMs

All assessed LMs encode detectable phonological information in their hidden states beyond chance, particularly in middle layers (20–60% depth). Larger models often, but not universally, exhibit stronger latent phonological features. Linear probes consistently outperformed random controls.

Tokenization as a Limiting Factor

Subword-based tokenizers systematically impair the representation of both local (e.g., rhyme) and global (e.g., syllabification) phonological features. Coarse granularity prevents adequate encoding of character-level sound patterns. Inserting delimiters produces significant gains in probing accuracy for rhyming tasks across architectures, supporting the claim that tokenization granularity directly influences phonological representation.

The STAD metric revealed that higher misalignment between token and syllable boundaries correlates with degraded performance in G2P and syllable counting tasks. Models performed substantially better on words with tokenization aligned to syllabification, validating STAD as a robust diagnostic tool. This effect is consistent across diverse LM architectures except bidirectional attention models (e.g., BERT), which warrant further investigation.

Words with high cross-linguistic relatedness, evidenced by numerous CogNet cognate/loanword entries, are more vulnerable to syllabification–tokenization misalignment, likely due to greater orthographic variability in training corpora and consequent infrequent $n$ -gram occurrence.

IPA-Aware Fine-Tuning

Lightweight IPA-based fine-tuning, interleaving phonological information with general instruction data, yielded consistent improvements across phonology tasks. The method achieved accuracy gains on rhyming, G2P, and syllable counting while inducing only marginal drops on GSM8K (1.1%) and MMLU (0.9%). Importantly, catastrophic forgetting was not observed, facilitating effective phonological adaptation with minimal trade-offs in general reasoning.

Insertion of character-level delimiters in raw inference prompts, however, did not reliably improve performance on rhyming awareness tasks, indicating that post-hoc tokenization modifications are insufficient without direct phonological constraint during training.

Implications and Future Directions

These results demonstrate that tokenization is a critical—and often overlooked—component in the emergence of phonological competence in LMs. Practical and theoretical implications include:

Tokenizer Design: STAD provides a framework for evaluating tokenizers by phonological alignment, incentivizing development of syllable-aware or phonology-informed tokenization algorithms, especially for speech-grounded or language-learning applications.
Fine-Tuning Strategies: IPA-augmented fine-tuning is an effective, minimally invasive retrofit for phonological adaptation, suggesting general utility for domain-specific reasoning enhancement in LMs.
Model Limitations: Phonological deficits in LMs are correlated with tokenization misalignment rather than absence of underlying knowledge, implying that architectural and training choices at the tokenization interface are pivotal.
Multilingual NLP: Cognate and loanword analysis highlights unique challenges in multilingual settings, especially in tokenization for languages with rich etymological histories or cross-linguistic overlap.
Speech-Text Integration: Findings motivate future work on joint speech-text models, emphasizing that tokenization strategies mediating the text-speech interface should respect phonological boundaries.

Further research is warranted to generalize findings beyond English, assess causal relationships between tokenizer misalignment and phonological failures, and deploy advanced continual learning protocols to mitigate trade-offs in domain adaptation.

Conclusion

Tokenization introduces systematic biases that constrain phonological knowledge representation in LMs. The paper provides quantitative evidence that token-syllable misalignment degrades both local and global phonological feature encoding, and that alignment metrics such as STAD are effective diagnostics. Lightweight IPA-based fine-tuning improves phonological task performance with negligible impact on general reasoning benchmarks. The results advocate for phonology-aware tokenizer design and targeted post-training methodologies to enhance LMs' phonological reasoning without compromising broader capabilities.

Markdown Report Issue