Phoneme-Level Pre-Trained LMs

Updated 7 September 2025

Phoneme-level PLMs are models that use phonemic tokens (e.g., IPA) to capture pronunciation patterns and minimize orthography-driven artifacts.
They employ architectures like LSTMs and Transformers with specialized masking strategies to support multilingual, low-resource, and spoken language tasks.
These models facilitate efficient cross-lingual parameter sharing and drive improvements in ASR, SLU, and TTS through robust decoding and transfer learning techniques.

Phoneme-level pre-trained LLMs (PLMs) are neural models whose input and sometimes output units are phonemic—usually using the International Phonetic Alphabet (IPA) or derivatives thereof—rather than graphemes, subwords, or words. By focusing on the phonemic level, these models aim to capture cross-lingual pronunciation patterns, minimize orthography-driven artifacts, and facilitate efficient parameter sharing for multilingual, low-resource, and spoken language tasks. This approach has influenced automatic speech recognition (ASR), spoken language understanding (SLU), text-to-speech (TTS), and linguistic probing applications.

1. Model Architectures and Training Regimes

Phoneme-level PLMs employ a range of architectures, but most combine a discrete phoneme embedding layer with sequence modeling components such as LSTMs or Transformer encoders. A canonical architecture is an embedding layer (mapping each phoneme symbol to a learned vector), followed by an LSTM for context modeling and a softmax output over possible next phonemes:

$p(c_t | c_{1}, ..., c_{t-1}) = \mathrm{softmax}(W_\mathrm{out} \cdot \mathrm{LSTM}(\mathrm{Emb}(c_{1}, ..., c_{t-1})) + b_\mathrm{out})$

where $c_t$ is the phoneme at time $t$ (Dalmia et al., 2019).

For large-scale pre-training, Transformer encoders—often initialized from BERT or RoBERTa—are commonly used, with modifications to ingest sequences entirely composed of phonemic tokens or hybrid sequences that blend phonemes with “sup-phoneme” units derived via byte pair encoding (BPE) over phoneme streams (e.g., Mixed-Phoneme BERT, XPhoneBERT) (Zhang et al., 2022, Nguyen et al., 2023). Pre-training objectives typically adapt masked language modeling (MLM) or span prediction to the phoneme domain, sometimes with auxiliary tasks such as phoneme-to-grapheme prediction (Li et al., 2023).

Multilingual pre-training leverages language-agnostic inventories (e.g., IPA) to create a shared phonemic vocabulary spanning dozens of languages. Model input units are then the union of phoneme sets across all languages, potentially augmented by task-specific tokens (e.g., <space>, <sos>) or word boundary markers. Cross-lingual parameter sharing is realized by masked training, applying loss only over the subset of relevant phonemes for each language during batch updates (Dalmia et al., 2019).

2. Multilinguality, Crosslingual Adaptation, and Efficiency

A principal benefit of phoneme-level pre-training is efficient multilingual parameter sharing. By modeling languages in a universal IPA-based space, a single model can serve multiple languages, drastically reducing the memory and computational footprint versus running separate monolingual models (Dalmia et al., 2019, Nguyen et al., 2023). For example, a single multi-PLM covering six languages matches the phoneme-level perplexity of its monolingual counterparts while using ~6× fewer parameters.

During cross-lingual adaptation, multilingual pre-trained PLMs are fine-tuned using a small amount of data from a new target language. The “masked-training” approach ensures faster convergence and lower perplexity in low-resource regimes (5–20% of target data) compared to monolingual-from-scratch training (Dalmia et al., 2019). As the amount of target-specific data increases (e.g., 50% coverage), performance between adapted and monolingual PLMs becomes similar, or the monolingual model may slightly outperform.

Efficient transfer also underpins zero-shot and few-shot recognition for extremely low-resource languages. In phoneme recognition, transfer from a multilingual phoneme model like Allosaurus or XLSR-53 is enhanced via mechanisms such as language-specific allophone layers (mapping universal phone predictions to target inventory via max-pooling and regularization) and articulatory feature–based mapping (using Hamming distances in articulatory vectors to relate source and target phoneme sets) (Siminyu et al., 2021, Xu et al., 2021).

3. Masking Strategies and Pretraining Objectives

Phoneme-level masking is an important development for self-supervised learning of speech representations. Unlike fixed-length or random masking, phoneme-level masking leverages forced-aligned phoneme boundaries to mask all acoustic frames corresponding to a phoneme. This masking strategy increases task complexity—since recovery requires reconstructing entire linguistic units—and is found to improve the sharpness and discriminative quality of learned representations (higher framewise phoneme and speaker classification accuracy, reduced local smoothness artifacts in spectrograms) (Zhang et al., 2022).

Masked language modeling (MLM) objectives are adapted to the phoneme or mixed-phoneme domain in several ways:

Masked Phoneme Prediction: Random subsets of phoneme tokens are replaced by a mask or random phoneme; the model must reconstruct the original sequence (Li et al., 2023, Nguyen et al., 2023).
Consistent Dual-Modality Masking: In mixed phoneme–sup-phoneme BERT, if a sup-phoneme is masked, all underlying phoneme tokens are masked as well, preventing information leakage (Zhang et al., 2022).
Cross-modal (text–phoneme) masking tasks: Objectives such as conditional MLM and cross-modal MLM force models to infer masked phoneme or text tokens from the corresponding unmasked sequence and the paired modality (Kim et al., 2020, Chen et al., 2021).
Span Reconstruction: T5lephone applies a span-level masking strategy with Poisson-distributed span length; 15% of phonemic tokens in each sequence are masked for denoising pretraining (Hsu et al., 2022).

Such strategies encourage models to learn context-sensitive, segment-level representations and facilitate generalization to downstream tasks (ASR, SLU, TTS).

4. Decoding, Integration with Speech Systems, and Downstream Applications

Phoneme-level PLMs are deployed most effectively as LLMs for CTC-based ASR, robust decoders, or encoders in TTS. In low-resource ASR, PLMs are integrated into prefix tree beam search decoders that restrict search to valid sequences in a lexicon, yielding word error rates (WER) competitive with, and often superior to, standard WFST-based decoders under domain or data mismatches (Dalmia et al., 2019, Ma et al., 5 Jun 2025).

A prominent recent trend is the decoupled speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G) decoding pipeline. Here, crosslingual pre-trained acoustic models (e.g., Whistle-S2P) generate phoneme sequences from raw speech, which are converted to grapheme or subword units via LLMs trained for P2G (e.g., mT5-based) (Ma et al., 5 Jun 2025). To counteract error propagation from noisy S2P outputs, two techniques are deployed:

Data Augmentation with Noisy Phonemes (DANP): The P2G stage is exposed to a wide range of error-prone phoneme hypotheses during training.
Randomized Top-K Marginalized (TKM) Decoding: Instead of relying on the 1-best S2P output, marginalization over multiple top-K hypotheses reduces cumulative information loss.

This dual-stage design enables leveraging large LLMs for grapheme decoding and achieves statistically significant relative WER improvements (for instance, 3.6%/6.9% over WFST baselines for Polish/German) (Ma et al., 5 Jun 2025).

In TTS, phoneme-level PLMs (e.g., Mixed-Phoneme BERT, PL-BERT, XPhoneBERT) replace or augment traditional text encoders. By providing contextually enriched but parameter-efficient phonemic representations, they boost prosodic naturalness and data efficiency, especially in multi-speaker or low-resource regimes (Zhang et al., 2022, Li et al., 2023, Nguyen et al., 2023, Yang et al., 31 Aug 2025). Enhanced phrase break prediction accuracy and mean opinion scores (MOS) are empirically validated, notably when using speaker-conditioned models at the phoneme level (Yang et al., 31 Aug 2025).

Automatic word segmentation and phonological probing tasks using phoneme-level LMs have been developed to paper model-internal sensitivity to linguistic structure, including the ability to recover acoustic word boundaries in the absence of orthographic cues (Goriely et al., 4 Apr 2025). Prediction-error signatures—such as entropy peaks at likely word onsets—are effective for unsupervised segmentation and motivate linguistically informed subword tokenizers.

5. Robustness, Adaptation, and Data Efficiency

Phoneme-level PLMs are particularly well-suited to low-resource and cross-domain adaptation tasks. Their performance advantage is realized by:

Parameter sharing across languages due to the use of a universal phonemic inventory.
Improved crosslingual transfer, where multilingual pre-training on unrelated or non-tonal languages can still enable accurate recognition of typologically distinct target languages, as demonstrated on Iu Mien and Luhya varieties (Dong et al., 18 Jul 2024, Siminyu et al., 2021).
The incorporation of allophone layers and articulatory-aware phoneme mapping, which align model outputs with language-specific phoneme inventories and pronunciation systems.

Even minimal adaptation (few-shot fine-tuning with as little as 10–100 utterances) can produce substantial reductions in phoneme or word error rates, reflecting the high degree of phonetic structure encoded by the pre-trained models (Siminyu et al., 2021, Dong et al., 18 Jul 2024). In spoken language understanding (SLU), cross-modal pre-training (e.g., ST-BERT) and joint textual-phonetic objectives improve data efficiency and robustness to ASR errors (Kim et al., 2020, Chen et al., 2021).

6. Evaluation Metrics and Analytical Benchmarks

Performance assessment for phoneme-level PLMs includes:

Phoneme-level perplexity (PPL), phone error rate (PER), and word error rate (WER) for ASR/CTC pipelines (Dalmia et al., 2019, Dong et al., 18 Jul 2024).
Subjective metrics such as mean opinion score (MOS) and comparison MOS (CMOS) in TTS applications, capturing naturalness and expressive prosody (Zhang et al., 2022, Li et al., 2023, Nguyen et al., 2023).
Frame-level articulatory feature (AF) probing, correlating the ability to capture detailed phone distinctions (manner, place, etc.) with downstream recognition performance. Higher macro-averaged F1 on AF tasks predicts lower PER (Ji et al., 2022).
Lexical and syntactic metrics (e.g., sWUGGY, sBLIMP, BLiMP, and GLUE benchmarks) to probe the semantics and grammar-learning of phoneme-level models, especially those operating on continuous phonemic streams (Goriely et al., 30 Oct 2024, Goriely et al., 4 Apr 2025). For language acquisition simulation and word segmentation tasks, cues derived from entropy, cross-entropy loss, and utterance boundary probability are evaluated for their ability to signal segmentation boundaries (Goriely et al., 4 Apr 2025).
Information-theoretic analyses, measuring mutual information and conditional entropy between phoneme-level representations and prosodic targets, further support their application in linguistically sensory tasks such as phrase break prediction (Yang et al., 31 Aug 2025).

7. Implications, Challenges, and Future Directions

Phoneme-level pre-trained language modeling offers unique opportunities for resource efficiency, robustness, and cross-modality alignment in speech and language technologies. The empirical successes in ASR, SLU, TTS, and phonological probing highlight their ability to circumvent orthographic mismatch, model fine-grained pronunciation, and enable new forms of distributional linguistic analysis.

Key open challenges include:

Optimizing model architectures, pretraining, and masking strategies for longer phoneme sequences, prosodic features, and speaker variation.
Improving or adapting phonemization pipelines for typologically diverse, tonal, or underdocumented languages to avoid representation bias.
Balancing trade-offs between semantic richness and acoustic fidelity, particularly for tasks that require both lexical comprehension and expressive resynthesis (Poli et al., 16 Sep 2024).
Scaling to truly universal, cross-lingual, and multimodal phonological modeling—where phoneme-level units can serve as a bridge between speech understanding and text-based NLP.

A plausible future direction involves the systematic integration of unsupervised boundary cues into adaptive tokenization for pre-training, the design of cognitively plausible models of speech acquisition, and further harmonization with end-to-end neural frameworks for speech-to-speech and multimodal generation.