Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation

Published 10 May 2025 in cs.CL | (2505.06599v1)

Abstract: Grapheme-to-phoneme (G2P) conversion for Persian presents unique challenges due to its complex phonological features, particularly homographs and Ezafe, which exist in formal and informal language contexts. This paper introduces an intermediate language specifically designed for Persian language processing that addresses these challenges through a multi-faceted approach. Our methodology combines two key components: LLM prompting techniques and a specialized sequence-to-sequence machine transliteration architecture. We developed and implemented a systematic approach for constructing a comprehensive lexical database for homographs with multiple pronunciations disambiguation often termed polyphones, utilizing formal concept analysis for semantic differentiation. We train our model using two distinct datasets: the LLM-generated dataset for formal and informal Persian and the B-Plus podcasts for informal language variants. The experimental results demonstrate superior performance compared to existing state-of-the-art approaches, particularly in handling the complexities of Persian phoneme conversion. Our model significantly improves Phoneme Error Rate (PER) metrics, establishing a new benchmark for Persian G2P conversion accuracy. This work contributes to the growing research in low-resource language processing and provides a robust solution for Persian text-to-speech systems and demonstrating its applicability beyond Persian. Specifically, the approach can extend to languages with rich homographic phenomena such as Chinese and Arabic

Abstract PDF Upgrade to Chat

Authors (6)

Summary

Grapheme-to-Phoneme Conversion for Persian Language Processing

The paper "Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation" presents an innovative approach to tackling the unique challenges associated with Persian grapheme-to-phoneme (G2P) conversion. These challenges arise primarily from the language's complex phonological structure, including homographs with multiple pronunciations (polyphones) and the Ezafe phenomenon, which presents significant difficulties for computational processing due to its absence in standard script representation.

Methodological Advancements

The authors propose a dual-component methodology combining Large Language Model (LLM) prompting techniques with a sequence-to-sequence machine transliteration framework, specifically designed to address Persian's phonetic complexities. The approach introduces a novel intermediate language representation—termed "Pinglish"—to facilitate grapheme-to-phoneme mapping, demonstrating significant improvements in phoneme error rate (PER).

Dataset Construction

A notable strength of this study lies in its robust dataset construction process, critically addressing the scarcity of annotated data and the intricacies of homographs and Ezafe. The dataset incorporates raw Persian sentences generated by LLMs and informal Persian text sourced from B-Plus podcasts. The intermediate language representation allows for consistent and unambiguous phoneme mapping, departing from traditional IPA-based methods that often suffer from excessive complexity and redundancy.

Tokenization and Model Architecture

The implementation of a custom byte-pair encoding (BPE) tokenizer ensures effective handling of cross-linguistic elements within the dataset, while maintaining tokenization consistency. The EncoderDecoder model architecture, optimized for transliteration tasks, further enhances computational efficiency and contextual understanding, essential for real-time text-to-speech applications.

Key Results

Experimental results reveal a BLEU score of 94.6 and PER of 0.0196, indicating superior performance over existing state-of-the-art methods. The model achieves noteworthy precision, recall, and F1 scores in Ezafe detection, and considerable accuracy in homograph disambiguation. Such outcomes underscore the efficacy of the intermediate language framework and sequence-to-sequence transliteration architecture in capturing the nuances of Persian phonology.

Implications and Future Directions

The implications of this research extend beyond Persian language processing, offering potential applications in other languages with rich homographic phenomena, such as Chinese and Arabic. The intermediate language system contributes a scalable and adaptable solution, promising advancements in low-resource language processing and text-to-speech systems.

Future research should focus on extending this framework to other morphologically rich languages, leveraging transformer-based architectures for enhanced disambiguation and semantic-aware processing. Further investigation into multilingual models through cross-lingual transfer learning represents a promising avenue to expand the applicability of this approach.

In summary, the paper offers a compelling and technically sophisticated solution to Persian grapheme-to-phoneme conversion, providing a foundation for further exploration within computational linguistics and speech processing, particularly in contexts constrained by limited resources.