L1-aware Multilingual Mispronunciation Detection Framework (2309.07719v2)
Abstract: The phonological discrepancies between a speaker's native (L1) and the non-native language (L2) serves as a major factor for mispronunciation. This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation. An end-to-end speech encoder is trained on the input signal and its corresponding reference phoneme sequence. First, an attention mechanism is deployed to align the input audio with the reference phoneme sequence. Afterwards, the L1-L2-speech embedding are extracted from an auxiliary model, pretrained in a multi-task setup identifying L1 and L2 language, and are infused with the primary network. Finally, the L1-MultiMDD is then optimized for a unified multilingual phoneme recognition task using connectionist temporal classification (CTC) loss for the target languages: English, Arabic, and Mandarin. Our experiments demonstrate the effectiveness of the proposed L1-MultiMDD framework on both seen -- L2-ARTIC, LATIC, and AraVoiceL2v2; and unseen -- EpaDB and Speechocean762 datasets. The consistent gains in PER, and false rejection rate (FRR) across all target languages confirm our approach's robustness, efficacy, and generalizability.
- “The effectiveness of computer assisted pronunciation training for foreign language learning by children,” Computer Assisted Language Learning, 2008.
- “Phone-level pronunciation scoring and assessment for interactive language learning,” Speech communication, 2000.
- “Context-aware goodness of pronunciation for computer-assisted pronunciation training,” arXiv preprint arXiv:2008.08647, 2020.
- “An improved goodness of pronunciation (GOP) measure for pronunciation evaluation with DNN-HMM system considering HMM transition probabilities.,” in INTERSPEECH, 2019.
- “Transformer-based multi-aspect multi-granularity non-native english speaker pronunciation assessment,” in ICASSP, 2022.
- “3M: An effective multi-view, multi-granularity, and multi-aspect modeling approach to english pronunciation assessment,” in APSIPA ASC, 2022.
- “CNN-RNN-CTC based end-to-end mispronunciation detection and diagnosis,” in ICASSP, 2019.
- “SED-MDD: Towards sentence dependent end-to-end mispronunciation detection and diagnosis,” in ICASSP, 2020.
- “Transformer based end-to-end mispronunciation detection and diagnosis.,” in Interspeech, 2021.
- “A full text-dependent end to end mispronunciation detection and diagnosis with easy data augmentation techniques,” arXiv preprint arXiv:2104.08428, 2021.
- “Explore Wav2vec 2.0 for mispronunciation detection.,” in Interspeech, 2021.
- “Multi-view multi-task representation learning for mispronunciation detection,” arXiv preprint arXiv:2306.01845, 2023.
- “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516, 2023.
- “Multilingual speech evaluation: English, Malay and Tamil,” arXiv preprint arXiv:2107.03675, 2021.
- “Multi-lingual pronunciation assessment with unified phoneme set and language-specific embeddings,” in ICASSP 2023, 2023.
- “L2-ARCTIC: A non-native English speech corpus.,” in Interspeech, 2018.
- XIAO ZHANG, “LATIC: A non-native pre-labelled mandarin chinese validation corpus for automatic speech scoring and evaluation task,” 2021.
- “Speechblender: Speech augmentation framework for mispronunciation data generation,” 2023.
- “EpaDB: A database for development of pronunciation assessment systems.,” in INTERSPEECH, 2019.
- “Unsupervised cross-lingual representation learning for speech recognition,” 2020.
- “Capturing L2 segmental mispronunciations with joint-sequence models in computer-aided pronunciation training (CAPT),” in nternational Symposium on Chinese Spoken Language Processing, 2010.
- “A Joint Model for Pronunciation Assessment and Mispronunciation Detection and Diagnosis with Multi-task Learning,” in INTERSPEECH, 2023.