G2P-Aided TTS Systems

Updated 11 December 2025

G2P-aided TTS systems are frameworks that integrate explicit grapheme-to-phoneme conversion to resolve pronunciation ambiguities and provide accurate phoneme supervision in speech synthesis.
They employ diverse architectures, including aligner-based models, neural sequence frameworks, and context-aware pipelines to handle heteronyms, polyphones, and cross-lingual challenges.
These systems improve performance metrics such as Phoneme Error Rate and MOS, enabling robust and natural speech output even in low-resource or noisy linguistic settings.

Grapheme-to-Phoneme (G2P)-aided Text-to-Speech (TTS) systems integrate explicit grapheme-to-phoneme conversion modules into the TTS front end or architecture, providing high-quality phoneme-level supervision for subsequent acoustic modeling. These systems are critical for handling script-phonology ambiguities, such as heteronyms (words with multiple pronunciations), polyphonic characters in logographic languages, cross-lingual phoneme mapping, and OOV (out-of-vocabulary) word pronunciation. Recent work encompasses aligner-based candidate scoring, neural context-aware G2P, lexicon-free approaches, multilingual models, and service-oriented pipelines.

1. G2P-Aided TTS: Role, Motivation, and Challenges

G2P-aided TTS systems explicitly incorporate a grapheme-to-phoneme conversion step, either as a pre-processing module or as an intrinsic component of the model architecture. This layer resolves the fundamental challenge stemming from the non-bijective mapping between orthography and pronunciation, especially in languages with substantial homography or polyphony (e.g., English heteronyms, Mandarin polyphones, Hebrew diacritics, or code-switching contexts). Direct mapping from text to speech without resolving G2P ambiguities leads to reduced naturalness, intelligibility, or errors in pitch and duration prediction. G2P modules thus serve both as alignment scaffolds for attention-based TTS, and as disambiguators for context-dependent pronunciation phenomena (Huang et al., 2023, Li, 2023, Kolani et al., 14 Jun 2025).

Key motivations for G2P-aided approaches include:

Disambiguation of context-dependent pronunciations (heteronyms, polyphones) where graphemic form underspecifies the phoneme sequence.
Stabilization of TTS model alignment, especially for sequence-to-sequence or non-autoregressive architectures, via monotonic phoneme supervision.
Extension to novel or low-resource languages by leveraging multilingual G2P or cross-lingual transfer (Comini et al., 2023, Chary et al., 3 Sep 2025, Do et al., 2023, Peters et al., 2017).
Compatibility with domain-specific phonetic conventions or phoneme/IPA inventories.

2. G2P-Aided TTS Architectures

A diverse set of architectures integrate G2P conversion into TTS pipelines, differing in representation, context modeling, and coupling with downstream modules:

Aligner-based Disambiguation: The RAD-TTS aligner employs dual text/audio encoders and a token–audio frame alignment mechanism to select between candidate phonemic realizations of heteronyms. Candidate pronunciations are scored by minimum L₂-alignment cost, enabling data-driven labeling without human annotation (Huang et al., 2023).
Neural Sequence Models: Transformer-based and LSTM-based encoders (sometimes multi-task with prosody prediction) are used for G2P conversion. These models embed graphemes and context, outputting phoneme sequences fed directly to TTS acoustic models (e.g., Tacotron, FastSpeech). Recent advancements include deep, multilingual front-ends and architectures combining BERT-derived features for polyphone disambiguation and prosodic boundary location (Zhang et al., 2020, Comini et al., 2023, Park et al., 2020, Chary et al., 3 Sep 2025).
Context-Aware and Retrieval-Augmented Models: Contextual G2P utilizing LLM in-context knowledge retrieval (ICKR) or prompt-engineered generation pipelines (e.g., for Mandarin polyphone disambiguation with external multi-level lexica) have demonstrated superior accuracy by leveraging large-scale semantic knowledge and context retrieval (Li, 2023, Han et al., 12 Nov 2024).
CTC and Parallel Models: LiteG2P uses a CNN–BiGRU–CTC pipeline with dictionary-based phoneme masks, enabling highly efficient, parallel inference suitable for embedded and real-time applications. These small-footprint models run at sub-millisecond latency and match or nearly match Transformer-based architectures (Wang et al., 2023).
Service-Oriented Systems: Architectures like “Piper + LCA-G2P” deploy heterogeneous context-free and context-aware phonemizers as independent services communicating via pipes. This enables decoupling of heavy disambiguation (homographs, morphological markers) from the TTS engine and supports real-time fallback strategies (Fetrat et al., 8 Dec 2025).
Self-Supervised, Lexicon-Free Pipelines: Recent work demonstrates TTS systems trained from data-driven pseudo-phoneme (HuBERT cluster) representations, eliminating the need for expert lexica altogether. Such models achieve MOS scores on par with state-of-the-art lexicon-based systems (Garg et al., 19 Jan 2024).

3. Methodologies for Disambiguation and Robustness

Disambiguation of context-sensitive pronunciations in G2P-aided TTS is approached via several algorithmic methodologies:

Candidate Generation and Scoring: For each ambiguous word or character, all possible phonemic sequences from a pronunciation lexicon are substituted into the text, and each candidate is scored against the paired audio using alignment cost (e.g., L₂-distance via Viterbi path extraction). A confidence metric is computed as the normalized gap between best and worst aligned candidate, enabling pruning of low-confidence assignments. This pipeline can automatically annotate large speech corpora to bootstrap robust G2P and heteronym classifiers, as well as direct G2P↔TTS supervision files (Huang et al., 2023).
Context Encoding: Contextual cues are injected through either: (1) self-attention over long character/token windows (BERT-based encoders or deep Transformers, often with explicit left/right context tokens, as in r-G2P (Zhao et al., 2022)); (2) explicit context tags (language-ID tokens, part-of-speech, semantic roles (Comini et al., 2023)); or (3) prompt conditioning when using LLMs for case retrieval or generation (Li, 2023, Han et al., 12 Nov 2024).
External Knowledge Augmentation: Systems like Dict-TTS and retrieval-augmented LLMs utilize online dictionaries, multi-level lexica, or semantic glossaries to augment G2P classification or generation. Attention over external memory (e.g., semantics-to-pronunciation attention, S2PA) or case-matching with LLMs tightly couples prior knowledge to sentence context, yielding improvements in polyphone accuracy (Li, 2023, Jiang et al., 2022, Han et al., 12 Nov 2024).
Robustness to Noise: Controlled synthesis of orthographic noise (insertions, deletions, substitutions, adversarial perturbation) is used to regularize G2P models and enhance their tolerance to real-world input, lowering WER/PER and improving downstream TTS robustness (Zhao et al., 2022).
Efficient Inference and Deployment: CTC-based modeling, model size reduction via knowledge distillation (TinyBERT) (Zhang et al., 2020), service-oriented decoupling (Fetrat et al., 8 Dec 2025), and on-device optimization (LiteG2P, Phonikud) permit sub-real-time performance even for contextually complex pipelines.

4. Evaluation Metrics and Empirical Results

Key performance indicators and evaluation metrics for G2P-aided TTS are:

Phoneme Error Rate (PER): Computed as (Substitutions + Deletions + Insertions) / Reference Length, this is the primary metric for the G2P module (Huang et al., 2023, Li, 2023, Garg et al., 19 Jan 2024, Fetrat et al., 8 Dec 2025).
Word Error Rate (WER) and Sentence Error Rate (SER): Percentage of words/sentences with at least one mispredicted phoneme.
Task-specific Accuracies: E.g., heteronym or polyphone disambiguation accuracy, homograph disambiguation, prosodic break F1, Ezafe insertion accuracy (for Persian), and other language-specific phonological tasks (Li, 2023, Comini et al., 2023, Fetrat et al., 8 Dec 2025, Kolani et al., 14 Jun 2025).
Mean Opinion Score (MOS): Subjective naturalness as rated by listeners, ranging typically 1–5. Improvements in PER/WER often translate to higher MOS in direct TTS evaluation (Garg et al., 19 Jan 2024, Wang et al., 2023, Huang et al., 2023).
Latency: Inference speed (ms/word, words/s, Real-Time Factor) for on-device or real-time constraints (Wang et al., 2023, Fetrat et al., 8 Dec 2025, Kolani et al., 14 Jun 2025).
Footprint: Model parameter count and disk/memory requirements, affecting deployment options.

Empirically, integration of G2P-aided pipelines has shown:

Phoneme disambiguation accuracy gains of 4.4–6.1 pp on rare/“hard” heteronym test sets (Huang et al., 2023).
Contextual LLM retrieval reduces homograph error rates by ~3.5% absolute and PER by 2.0% (29% relative) versus strong acoustic baselines (Han et al., 12 Nov 2024).
Unified BERT-based Mandarin front-ends outperform classic pipelines (TinyBERT-MTL: 96.79% character ACC, 91.99% sentence ACC) while reducing disk footprint by 75% (Zhang et al., 2020).
Service-oriented pipelines (Piper + LCA-G2P) reduce PER by 24%, homograph error by 34 pp, and maintain real-time processing (RTF=0.17) (Fetrat et al., 8 Dec 2025).
Lexicon-free, self-supervised G2P modules yield TTS MOS of 3.96, slightly exceeding standard neural G2P and rule-based front-ends (Garg et al., 19 Jan 2024).

5. Integration Patterns, Multilinguality, and Deployment

The integration of G2P into TTS spans a spectrum:

Multi-stage Pipelines: Classical: Text normalization → G2P → phoneme embedding → acoustic model → vocoder (Huang et al., 2023, Wang et al., 2023, Zhang et al., 2020).
Unified Multilingual Front-ends: A single Transformer model, with language ID and context-awareness, replaces per-language rule-based and post-lexical modules, supporting G2P, polyphone/homograph disambiguation, post-lexical rules, and diacritization within a single architecture (Comini et al., 2023, Chary et al., 3 Sep 2025, Peters et al., 2017).
Contextual Generation and Retrieval: For high-ambiguity or low-resource languages, external knowledge–augmented models (LLM retrieval with case matching, external dictionaries for S2PA) can supply contextually correct phonemes for TTS input, even with minimal direct supervision (Li, 2023, Han et al., 12 Nov 2024, Jiang et al., 2022).
Low-Resource and Code-Switching Scenarios: Massively multilingual G2P models allow rapid extension to unseen or underrepresented languages/dialects; code-switching is enabled by segmental language-ID tokens in the G2P front-end (Chary et al., 3 Sep 2025, Peters et al., 2017, Do et al., 2023).
On-device/Real-time TTS: Small, parallel, dictionary-fused G2P modules (LiteG2P, Phonikud) enable high-accuracy, low-latency TTS on edge/mobile devices (Wang et al., 2023, Kolani et al., 14 Jun 2025).

6. Extensions, Limitations, and Open Problems

Despite recent progress, G2P-aided TTS faces several challenges:

Data Dependence: Supervised disambiguation requires annotated corpora; aligner-based labeling partially mitigates this via self-labeling from paired speech, but novel heteronyms, low-frequency senses, or new dialects may not be covered (Huang et al., 2023).
Generalization: Even with multilingual training, language- or script-specific idiosyncrasies (e.g., Mandarin tone, French liaison, Arabic diacritics) may be underfit in a universal front-end. There exist trade-offs between monolingual and multilingual solutions, particularly in coverage of rare/complex phonological rules (Comini et al., 2023).
Robustness: Models remain sensitive to noisy or out-of-domain orthography. Explicit noise augmentation and adversarial training can improve this, but higher-order spelling errors or semantic context failures are still open problems (Zhao et al., 2022).
Latency–Quality Trade-off: Larger, contextually precise G2P pipelines sometimes incur unacceptable latency unless architecturally decoupled from the TTS engine (service-oriented) or heavily compressed/distilled (Fetrat et al., 8 Dec 2025, Zhang et al., 2020).
Interpretability: Lexicon-free/self-supervised phoneme representations boost generalizability, but lack human-interpretable labels, complicating downstream error diagnosis or linguistic analysis (Garg et al., 19 Jan 2024).
OOV Handling and Lexicon Expansion: OOV words not present in the lexicon fall back to neural G2P, which may be less accurate; hybrid approaches (fallback models, semi-supervised dictionary extension) are an active area.

7. Impact and Outlook

G2P-aided TTS architectures, via explicit phoneme supervision, context-aware disambiguation, robust modeling strategies, and multilingual generalization, consistently raise the production quality of speech synthesis. Quantitative gains are observed in PER, downstream MOS, and reduction of user-facing mispronunciations. At the same time, advances in lexicon-free cluster-based modeling and service-oriented integration patterns highlight promising future directions in efficiency, scalability, and adaptability.

Active research continues on joint training with acoustic models (end-to-end G2P-TTS), further acceleration (quantization, pruning, NPU deployment), integration of richer syntax/semantic or multimodal context for rare cases, and the development of universal, interpretable, and robust G2P front-ends for global TTS applications (Zhang et al., 2020, Comini et al., 2023, Chary et al., 3 Sep 2025, Fetrat et al., 8 Dec 2025, Zhao et al., 2022).