LLM-P2G: Neural Phoneme-to-Grapheme Conversion

Updated 23 December 2025

LLM-P2G is a two-stage model that converts phoneme sequences into graphemes using pretrained neural language models, enhancing ASR efficiency.
It replaces traditional WFST pipelines with a flexible neural framework by decoupling acoustic modeling from language generation for improved cross-lingual and low-resource tasks.
Innovative training strategies like DANP, TKM, and SKM mitigate error propagation and accelerate convergence, leading to lower word error rates and better scalability.

LLM-based Phoneme-to-Grapheme (LLM-P2G) methods address the conversion of phoneme sequences to text (graphemes) within the context of automatic speech recognition (ASR), multilingual speech technologies, and related phonetic tasks. By decoupling acoustic and language modeling via a two-stage architecture, LLM-P2G leverages the generative capabilities of large pretrained neural LLMs to replace traditional, lexicon-driven pipelines such as WFST decoding. Recent advances have demonstrated that LLM-P2G systems yield improved recognition accuracy, system scalability, and linguistic coverage, particularly in cross-lingual and low-resource scenarios (Ma et al., 5 Jun 2025, Ma et al., 20 Dec 2025, Li et al., 28 Oct 2025). This article details the fundamental pipeline, model architectures, training objectives, empirical benchmarks, and methodological variants underpinning modern LLM-P2G systems.

1. Architectural Framework and Pipeline

LLM-P2G systems operate through a modular two-stage process:

Stage 1 – Speech-to-Phoneme (S2P): An acoustic encoder, typically a Conformer or related sequence model trained with CTC loss, predicts a phoneme sequence $h = h_1,\dots,h_M$ from input acoustics $x = x_1,\dots,x_T$ . The model computes $p(h|x)$ , readily producing either 1-best or $K$ -best (top- $K$ ) phoneme hypotheses for downstream use.
Stage 2 – Phoneme-to-Grapheme (P2G): A LLM, such as mT5-base or a transformer-based decoder, is fine-tuned or prompted with $h$ to generate the target grapheme (subword/word) sequence $y = y_1,\dots,y_L$ , modeling $p(y|h)$ . The stage exploits the LLM's transliterative and contextual capabilities, allowing broader adaptation than fixed pronunciation lexicons or n-gram word LMs.

Advantage Over WFST: The classic WFST pipeline fuses the acoustic model, pronunciation lexicon, and n-gram LM into a finite-state decoding graph, which is inflexible and difficult to scale for new LMs or languages. LLM-P2G achieves a conceptually simpler, fully neural, and more extensible framework that directly leverages state-of-the-art LLMs (Ma et al., 5 Jun 2025, Ma et al., 20 Dec 2025).

2. Model Structure and Mathematical Formulation

2.1 LLM-based P2G Design

LLM-P2G utilizes encoder–decoder transformers (e.g., mT5-base with $12$ layers, hidden size $768$) as the core for grapheme generation. Phoneme tokens are embedded and positionally encoded, then processed by the encoder. The autoregressive decoder generates graphemes, using attention over both encoder representations and previous decoder outputs.

The generative process formalizes as: $p(y|h) = \prod_{i=1}^L p(y_i | h, y_1,…,y_{i-1}),$ with the overall pipeline, marginalizing over candidate phoneme sequences, given by

$p(y|x) = \sum_h p(h|x) \cdot p(y|h).$

Approximation of this sum is usually via a set of top- $K$ S2P hypotheses.

2.2 Hybrid S2P–P2G Models

More recent unified models (e.g., POWSM) adopt a joint encoder–decoder backbone (e.g., E-Branchformer + Transformer decoder), with multitask capability for ASR, G2P, P2G, and phone recognition—all parameter-shared except for task-specific prompts or targets. Input prompts for P2G combine a task token, language token, and a phoneme sequence, facilitating cross-task interoperability (Li et al., 28 Oct 2025).

3. Training Objectives and Strategies

LLM-P2G models address the inherent information loss in S2P–P2G cascades with robust training techniques:

Data Augmentation with Noisy Phonemes (DANP): The P2G LLM is trained on pairs $(\tilde{h},y)$ constructed by injecting realistic phoneme errors, either via S2P beam search or stochastic sampling, to mitigate train-test mismatches and increase robustness (Ma et al., 5 Jun 2025).
Top-K Marginalized (TKM) Training: Rather than conditioning on a single S2P hypothesis, TKM marginalizes over the top-$K$ phoneme sequences: $\mathcal{L}_{TKM} = -\sum_{(x,y)} \log\left[\sum_{k=1}^K p(h^{(k)}|x)p(y|h^{(k)})\right].$ This approach distributes gradient signals across multiple plausible phoneme paths (Ma et al., 5 Jun 2025, Ma et al., 20 Dec 2025).
Randomized TKM: For computational efficiency, each batch samples a random subset $n<K$ from the top- $K$ phoneme candidates to estimate the marginalization.
Sampling-K Marginalized (SKM) Training: SKM generalizes TKM by replacing beam search with random sampling for candidate phoneme sequences: $\mathcal{L}_{SKM} \approx -\log \left( \frac{1}{K}\sum_{k=1}^{K}p(h^{(k)}|x)p(y|h^{(k)}) \right),$ where $\{h^{(k)}\}_{k=1}^K$ are sampled independently from the S2P model. This increases phoneme path diversity, accelerates convergence, and delivers better generalization (Ma et al., 20 Dec 2025).

4. Decoding, Inference, and Resource Considerations

At inference, LLM-P2G decoding pipelines typically involve:

S2P beam search (or SKM sampling) produces $K$ candidate phoneme sequences $\{h^{(k)}\}$ along with associated probabilities $\alpha_k = p(h^{(k)}|x)$ .
For each $h^{(k)}$ , P2G beam search generates candidate grapheme sequences $y$ with probabilities $p(y|h^{(k)})$ .
Final scoring of each $y$ is performed via truncated marginalization:

$\text{score}(y) = \sum_{k=1}^K \alpha_k \cdot p(y|h^{(k)}).$

Optionally, external LM rescoring may be added: $\text{score}'(y) = \text{score}(y) + \lambda \log p_{LM}(y)$ .
Top-1 or top- $S$ $y$ sequences are output based on scores.

Computationally, SKM reduces the training overhead compared to beam-based TKM, while test-time cost remains dominated by the necessity of scoring multiple S2P–P2G combinations. Empirical resource profiles (SKM with $K=8$ ) report real-time factors (RTF) around $0.10$ with moderate GPU RAM use, comparable to traditional WFST-based methods (Ma et al., 20 Dec 2025).

5. Comparative Benchmarks and Empirical Results

Experimental validation occurs on datasets such as Common Voice v11.0 for Polish and German, and the multilingual IPAPack++ for transfer and low-resource evaluations:

Model	Polish WER (%)	German WER (%)
Whistle Phoneme + WFST	4.30	15.73
Whistle Subword	3.82	14.01
LLM-P2G + TKM	3.80	13.18
LLM-P2G + randomized TKM	3.68	13.03
LLM-P2G + SKM (proposed)	3.61	12.94

SKM (Sampling-K Marginalized training) yields the most robust gains, achieving 5.3% and 7.7% relative WER reductions over subword-based models on Polish and German respectively, with rapid convergence (~0.027 loss after 3k steps) and practical inference resource use (Ma et al., 20 Dec 2025).

In low-resource FLEURS languages with POWSM, direct gold-phoneme prompting for P2G reduces WER by substantial margins compared to ASR without phone prompts—e.g., Azerbaijani 67.5% → 37.1%—underscoring the effectiveness of decoupling pronunciation and orthography via the phoneme interface (Li et al., 28 Oct 2025).

Auxiliary P2G Predictors in TTS: PL-BERT introduces a phoneme-level BERT with a per-position grapheme prediction head as an auxiliary pretext task. During pre-training, cross-entropy loss is defined over word-level grapheme labels per phoneme, sculpting the encoder to be sensitive to orthographic structure without explicit grapheme generation at run-time. This strategy yields significantly improved naturalness and prosodic richness in TTS, particularly on out-of-distribution text; removing the P2G head sharply reduces grapheme prediction accuracy (Li et al., 2023).
Unified Multitask Models: POWSM demonstrates joint training of ASR, G2P, P2G, and phoneme recognition in a single model with shared attention-based architecture and shared hybrid CTC/attention loss, simplifying deployment and supporting rapid adaptation to new phonetic tasks and languages (Li et al., 28 Oct 2025).
Comparison to Adapter-based SpeechLLM: LLM-P2G with SKM marginalization outperforms adapter-based models (e.g., Whistle + pooling-adapter + mT5) in both WER and model simplicity (no additional projector layers) (Ma et al., 20 Dec 2025).

7. Design Challenges and Future Research Directions

Several challenges and open questions remain for LLM-P2G:

Information Loss in S2P–P2G Cascade: Phoneme recognition errors propagate to P2G, motivating marginalization and robust training. DANP and TKM/SKM methods mitigate, but do not fully resolve, the error compounding.
Extension to End-to-End Training: Current methods train S2P and P2G models separately or with frozen acoustic encoders. End-to-end optimization under the marginalization objective could further reduce the interface gap (Ma et al., 5 Jun 2025).
Language Conditioning and Adaptation: The efficacy of P2G depends on accurate language identification; decoder language tokens capture cross-linguistic phonotactics but require consistent labeling and adaptation (Li et al., 28 Oct 2025).
Scaling and Generalization: SKM training is language- and model-size agnostic, promoting application to new scripts and low-resource scenarios. Broader exploration of input representations, continuous phoneme embeddings, and richer S2P output structures (e.g., lattices) remains a promising direction. Unified approaches can support transfer learning and multitask expansion.
Auxiliary Losses vs. Direct Generation: In TTS and related tasks, auxiliary P2G objectives act solely during training to enrich representations rather than as direct inference modules. A plausible implication is that direct, lightweight P2G decoders on phoneme embeddings might be required for applications needing on-the-fly orthographic output in low-resource settings (Li et al., 2023).

LLM-P2G methods have transformed the phoneme-to-grapheme mapping paradigm in speech processing. Recent empirical advances demonstrate they outpace classical WFST baselines in key benchmarks, simplify system design, and open new opportunities for efficient, accurate, and scalable speech recognition across languages (Ma et al., 5 Jun 2025, Ma et al., 20 Dec 2025, Li et al., 28 Oct 2025).