Phoneme-Level Pre-Trained Language Models

Updated 20 May 2026

Phoneme-level PLMs are neural models trained on phonemic sequences to capture fine-grained articulatory and linguistic patterns.
They integrate both acoustic-based and discrete symbol-based architectures, enabling robust speech recognition, synthesis, and multilingual adaptation.
Evaluations using metrics like PER and boundary accuracy demonstrate their superior performance in unsupervised segmentation and cross-lingual transfer.

Phoneme-level pre-trained LLMs (PLMs) are a class of neural architectures trained directly on sequences of phonemes—rather than graphemes, words, or subwords—with objectives tailored to the structure and statistical properties of phonological data. By exposing models to phonemic representations, these PLMs enable improved handling of cross-linguistic phonology, acquisition-inspired benchmarks, and speech application pipelines where word and text boundaries are not always explicit or available. They encompass both text-derived (grapheme-to-phoneme) and acoustic signal-based approaches and underpin numerous innovations in speech recognition, synthesis, low-resource adaptation, and child-language modeling.

1. Model Architectures and Pre-training Objectives

Phoneme-level PLMs can be categorized by the modality of their inputs and the class of network architectures they employ.

1.1 Acoustic-based Architectures

Self-supervised learning (SSL) models such as CPC, wav2vec 2.0, and HuBERT operate on raw waveforms, learning representations that encode phonological and articulatory structure:

CPC adopts a two-stage architecture with a convolutional encoder and GRU context network. Pre-training uses a contrastive loss to predict future latent features from context vectors, optimizing:

$\mathcal{L}_{\rm CPC} = -\frac{1}{K} \sum_{k=1}^K \log \frac{\exp(z_{t+k}^\top W_k c_t)}{\sum_{z_j \in Z} \exp(z_j^\top W_k c_t)}$

wav2vec 2.0 extends this to a Transformer-based context network with a product quantizer, optimizing a masked contrastive loss and codebook diversity regularization.
HuBERT builds on wav2vec 2.0 by using offline discovered acoustic units (k-means on MFCCs, then iteration), with a masked prediction objective over latent clusters. The architecture matches wav2vec 2.0 (12 Transformer layers, 768-dim hidden), and exhibits superior articulatory feature encoding and cross-lingual phoneme recognition performance (Ji et al., 2022).

1.2 Discrete Symbol-based Architectures

LSTM/Transformer LM on phonemes: These models explicitly operate over token sequences of IPA phonemes, typically with embedding layers, LSTM/Transformer cores, and (in multilingual settings) language-conditioned output masking (Dalmia et al., 2019, Goriely et al., 2024).
Phoneme-based BERT/ALBERT Variants: Models like MP BERT and PL BERT recast BERT-style training (masked language modeling) on phoneme or sup-phoneme (BPE on phonemes) tokens (Yang et al., 31 Aug 2025).
T5lephone: The T5 model pretrained on byte-encoded phonemes from phonemized Wikipedia, optimizing the span-denoising objective as in canonical T5, with minimal architectural change to leverage cross-modal weights (Hsu et al., 2022).
Autoregressive Transformer LMs: GPT-2–style decoder-only models with learned phoneme embeddings, trained with standard next-token prediction to model child-directed and multilingual phonemic corpora (Goriely et al., 4 Apr 2025, Goriely et al., 2024).

2. Phoneme Representation, Tokenization, and Data Flow

PLMs at the phoneme level require precise mapping from audio or text to phoneme inventories, attention to word boundary and utterance demarcation, and often explicit strategies for managing cross-linguistic diversity.

Phonemization: Grapheme-to-phoneme (G2P) conversion using tools like espeak-ng or lookup tables yields IPA or language-specific phoneme sequences (Goriely et al., 2024, Hsu et al., 2022).
Phoneme Vocabulary: Typically 40–200 tokens per language; broader sets (IPA-wide) for multilingual models. Multilingual PLMs use per-language start-of-sequence and space symbols (Dalmia et al., 2019).
Continuous Stream vs. Word-Boundaries: Models may process continuous streams of phoneme tokens with or without explicit boundaries, affecting downstream parsing and learning (Goriely et al., 2024, Goriely et al., 4 Apr 2025).
Embedding Strategies: Models employ learned embedding layers (e.g., 64–768d), sometimes with language or sup-phoneme indices, or byte-level encodings for compatibility with text PLMs.

3. Evaluation Benchmarks and Probing Methodologies

Phoneme-level PLMs are evaluated via a variety of metrics, probing tasks, and low-resource benchmarks.

Task/Metric Area	Example Metric	Key Findings/Observations
Articulatory Feature Probing	Macro-F1, AF-score	HuBERT yields +34.4% (within) and +26.7% (cross) over MFCC in frame-level AF-score (Ji et al., 2022)
Phoneme Recognition	Phone Error Rate (PER)	HuBERT: 10.2% (within), 23.0% (cross), lowest among compared (CPC, wav2vec 2.0) (Ji et al., 2022)
Word Segmentation	F1, boundary accuracy	Linear probe: 70–90% accuracy, UBP cue best for cross-lingual segmentation (Goriely et al., 4 Apr 2025)
Spoken LM Comprehension	Zero-shot sWUGGY/sBLIMP	Phoneme-fine-tuned HuBERT matches large text LMs with 150× less data (Poli et al., 2024)
Downstream Speech Tasks	WER, MOS, BLEU, F1	PL BERT/MP BERT outperform subword LMs in TTS phrasing (F0.5 +0.12), T5lephone improves SQA/ST (Yang et al., 31 Aug 2025, Hsu et al., 2022)
Low-resource ASR	PPL, WER	Multi-PLM matches/betters monolingual LMs at ≤50% data, robust to domain shift (Dalmia et al., 2019)

Quantitative evidence supports that phoneme-level PLMs encode finer-grained acoustic-articulatory information, outperform text-PLMs in sound-based or low-resource tasks, and enable unsupervised or minimally-supervised word segmentation and phonological transfer.

4. Cross-lingual, Multilingual, and Low-resource Modeling

Phoneme-level PLMs are especially suited for multilingual, cross-lingual, and low-resource settings due to the universal nature of phoneme sets and prevalence of audio-only or non-standard text data.

Universal Phoneme Vocabularies: Models share IPA-based sets across languages and employ masking for language-specific outputs (Dalmia et al., 2019).
Cross-lingual Adaptation: Fine-tuning multilingual PLMs on small fractions (as little as 5%) of target language data yields strong PPL/WER, outperforming Weighted Finite-State Transducer (WFST) decoders under domain shift (Dalmia et al., 2019).
Child-directed and Multilingual Data: Models trained on phonemicized CHILDES data recover word structure and segmentation patterns in fully unsupervised, cross-lingual fashion (Goriely et al., 4 Apr 2025).
Low-resource ASR: Shallow fusion of phoneme-level LMs with CTC decoders outperforms open-vocab CLM and matches or beats WFST across Babel languages (Dalmia et al., 2019).

5. Phoneme-level PLMs in End-to-End Speech and Language Applications

Phoneme-level PLMs have been integrated flexibly into end-to-end pipelines spanning speech recognition, understanding, synthesis, and translation.

ASR and Resynthesis: Freezing or fine-tuning SSL speech representations with phoneme-classification heads yields highly context-invariant, abstract representations; LLMs over these units achieve lexical/syntactic comprehension rivaling text-PLMs trained on orders-of-magnitude more data. However, there is a trade-off: increasing abstraction reduces expressive resynthesis quality (higher WER/MCD for generated speech) (Poli et al., 2024).
Spoken Language Understanding (SLU): BERT-PLM consumes sequences of phoneme-posterior distributions; permutation-based masking objectives adapted for continuous posteriors enable full bi-directional context modeling, achieving ≥12.5% error reduction on intent detection benchmarks (Wang et al., 2019).
Text-to-Speech (TTS) Front-ends: Phoneme-level MP BERT and PL BERT power phrase break prediction, leveraging parallel phoneme/sup-phoneme embeddings, outperforming subword-level LMs with F0.5 improvement of +0.12 and MOS increased by 0.15–0.20 (Yang et al., 31 Aug 2025).
Multimodal and End-to-End SLU/QA: T5lephone, pretrained on phonemeized Wikipedia, bridges speech and text SSL by aligning acoustic representations with phoneme-tokenized text LMs. It demonstrates increased robustness to ASR error (+12 AOS), and improved BLEU for speech translation (Hsu et al., 2022).

6. Analytical and Practical Implications

Probing Phonological Structure: Phoneme-level PLMs provide models of phonological class learning, segmental distribution, and acquisition-inspired benchmarks (e.g., BabySLM, word segmentation) that textual PLMs cannot access (Goriely et al., 2024, Goriely et al., 4 Apr 2025).
Subword Tokenizer Design: Distributional phonological cues (e.g., boundary-token probabilities) from phoneme-LMs motivate non-standard, potentially superior subword tokenization algorithms compared to BPE, with empirical improvements in syntactic and lexical minimal-pair evaluations (Goriely et al., 4 Apr 2025).
Input Representation Sensitivity: Phonemic input induces a slight but statistically significant drop on conventional text-based understanding (BLiMP, GLUE) due to lack of punctuation/prosody information; explicit boundary tokens partially ameliorate this (Goriely et al., 2024).
Practical Recommendations: Use character-level phoneme tokenization with explicit boundary demarcation; retain flexibility for multilingual transfer and low-resource adaptation; consider hybrid/multimodal objectives for optimization (Goriely et al., 2024, Dalmia et al., 2019).

7. Limitations and Future Directions

Expressivity-Accuracy Trade-off: Increased phoneme-abstraction benefits language modeling accuracy but impairs expressive resynthesis, particularly for paralinguistic styles and prosodic detail (Poli et al., 2024).
Boundary and Prosody Cues: Loss of orthographic or prosodic cues may harm performance on tasks relying on punctuation or intonation; future architectures should integrate prosodic/phonotactic embeddings and explicit end-of-utterance markers (Goriely et al., 2024).
Extending Beyond English: There is ongoing work on universal, multilingual phoneme corpora (31 CHILDES languages) and adaptation of phoneme-level PLMs driven by models like T5lephone and BabyLM (Goriely et al., 2024, Hsu et al., 2022, Goriely et al., 4 Apr 2025).
Multitask/Multimodal Pretraining: Joint optimization of SSL, phoneme classification, and speech-generation objectives, as well as integration with text PLMs, is identified as a promising direction (Poli et al., 2024, Hsu et al., 2022).
Phonemic vs. Subword Representation: Analytical work demonstrates the value of phoneme-level PLMs for cognitive, linguistic, and speech settings where grapheme-based LMs are suboptimal or unavailable.

Phoneme-level PLMs constitute a foundational technology for robust, transparent, and cross-linguistically competent speech and language systems, with continuing research refining both their analytical reach and practical impact.