MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and Phonetic Domains for Speech Representation Learning (2310.11541v1)
Abstract: In this paper, we present a methodology for linguistic feature extraction, focusing particularly on automatically syllabifying words in multiple languages, with a design to be compatible with a forced-alignment tool, the Montreal Forced Aligner (MFA). In both the textual and phonetic domains, our method focuses on the extraction of phonetic transcriptions from text, stress marks, and a unified automatic syllabification (in text and phonetic domains). The system was built with open-source components and resources. Through an ablation study, we demonstrate the efficacy of our approach in automatically syllabifying words from several languages (English, French and Spanish). Additionally, we apply the technique to the transcriptions of the CMU ARCTIC dataset, generating valuable annotations available online\footnote{\url{https://github.com/noetits/MUST_P-SRL}} that are ideal for speech representation learning, speech unit discovery, and disentanglement of speech factors in several speech-related fields.
- The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514.
- On the syllabification of phonemes. In Proceedings of human language technologies: The 2009 annual conference of the north american chapter of the association for computational linguistics, pages 308–316.
- Brigitte Bigi and Katarzyna Klessa. 2015. Automatic syllabification of polish. In 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 262–266.
- Automatic detection of syllable boundaries in spontaneous speech. In 7th International conference on Language Resources and Evaluation (LREC 2010), pages 3285–3292.
- Brigitte Bigi and Caterina Petrone. 2014. A generic tool for the automatic syllabification of italian. A generic tool for the automatic syllabification of Italian, pages 73–77.
- Jessica DeLisi. 2015. Sonority sequencing violations and prosodic structure in latin and other indo-european languages. Indo-European Linguistics, 3(1):1–23.
- Automatic syllabification for spanish using lemmatization and derivation to solve the prefix’s prominence issue. Expert systems with applications, 40(17):7122–7131.
- Luca Iacoponi and Renata Savy. 2011. Sylli: Automatic phonological syllabification for italian. In Twelfth Annual Conference of the International Speech Communication Association.
- John Kominek and Alan W Black. 2004. The cmu arctic speech databases. In Fifth ISCA workshop on speech synthesis.
- Language-agnostic syllabification with neural sequence labeling. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pages 804–810. IEEE.
- Syllabification by phone categorization. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pages 47–48.
- Automatic syllabification in english: A comparison of different algorithms. Language and speech, 52(1):1–27.
- Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502.
- Meinard Müller. 2007. Dynamic time warping. Information retrieval for music and motion, pages 69–84.
- A syllable based statistical text to speech system. In 21st European signal processing conference (EUSIPCO 2013), pages 1–5. IEEE.
- Automatic syllabification using segmental conditional random fields. Computational Linguistics in the Netherlands Journal, 3:34–48.
- A survey on deep transfer learning. In International conference on artificial neural networks, pages 270–279. Springer.
- The architecture of the festival speech synthesis system. In The third ESCA/COCOSDA workshop (ETRW) on speech synthesis.
- Asr-based features for emotion recognition: A transfer learning approach. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pages 48–52. Association for Computational Linguistics.
- Exploring Transfer Learning for Low Resource Emotional TTS. In Intelligent Systems and Applications, pages 52–60, Cham. Springer International Publishing.
- Analysis and assessment of controllability of an expressive deep learning-based tts system. In Informatics, volume 8, page 84. MDPI.
- Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis. In Proc. Interspeech 2019, pages 4475–4479.
- Noé Tits and Zoé Broisson. 2023. Flowchase: a Mobile Application for Pronunciation Training. In Proc. 9th Workshop on Speech and Language Technology in Education (SLaTE), pages 93–94.
- Theo Vennemann. 1987. Preference laws for syllable structure: And the explanation of sound change with special reference to German, Germanic, Italian, and Latin. de Gruyter.
- Dong Wang and Thomas Fang Zheng. 2015. Transfer learning for speech and language processing. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 1225–1237. IEEE.
- Frequent violation of the sonority sequencing principle in hundreds of languages: how often and by which sequences? Linguistic Typology.
- L2-arctic: A non-native english speech corpus. In Interspeech, pages 2783–2787.
- Emotional voice conversion: Theory, databases and esd. Speech Communication, 137:1–18.