MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and Phonetic Domains for Speech Representation Learning
Abstract: In this paper, we present a methodology for linguistic feature extraction, focusing particularly on automatically syllabifying words in multiple languages, with a design to be compatible with a forced-alignment tool, the Montreal Forced Aligner (MFA). In both the textual and phonetic domains, our method focuses on the extraction of phonetic transcriptions from text, stress marks, and a unified automatic syllabification (in text and phonetic domains). The system was built with open-source components and resources. Through an ablation study, we demonstrate the efficacy of our approach in automatically syllabifying words from several languages (English, French and Spanish). Additionally, we apply the technique to the transcriptions of the CMU ARCTIC dataset, generating valuable annotations available online\footnote{\url{https://github.com/noetits/MUST_P-SRL}} that are ideal for speech representation learning, speech unit discovery, and disentanglement of speech factors in several speech-related fields.
- The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514.
- On the syllabification of phonemes. In Proceedings of human language technologies: The 2009 annual conference of the north american chapter of the association for computational linguistics, pages 308ā316.
- Brigitte Bigi and Katarzyna Klessa. 2015. Automatic syllabification of polish. In 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 262ā266.
- Automatic detection of syllable boundaries in spontaneous speech. In 7th International conference on Language Resources and Evaluation (LREC 2010), pages 3285ā3292.
- Brigitte Bigi and Caterina Petrone. 2014. A generic tool for the automatic syllabification of italian. A generic tool for the automatic syllabification of Italian, pages 73ā77.
- Jessica DeLisi. 2015. Sonority sequencing violations and prosodic structure in latin and other indo-european languages. Indo-European Linguistics, 3(1):1ā23.
- Automatic syllabification for spanish using lemmatization and derivation to solve the prefixās prominence issue. Expert systems with applications, 40(17):7122ā7131.
- Luca Iacoponi and Renata Savy. 2011. Sylli: Automatic phonological syllabification for italian. In Twelfth Annual Conference of the International Speech Communication Association.
- John Kominek and AlanĀ W Black. 2004. The cmu arctic speech databases. In Fifth ISCA workshop on speech synthesis.
- Language-agnostic syllabification with neural sequence labeling. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pages 804ā810. IEEE.
- Syllabification by phone categorization. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pages 47ā48.
- Automatic syllabification in english: A comparison of different algorithms. Language and speech, 52(1):1ā27.
- Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498ā502.
- Meinard Müller. 2007. Dynamic time warping. Information retrieval for music and motion, pages 69ā84.
- A syllable based statistical text to speech system. In 21st European signal processing conference (EUSIPCO 2013), pages 1ā5. IEEE.
- Automatic syllabification using segmental conditional random fields. Computational Linguistics in the Netherlands Journal, 3:34ā48.
- A survey on deep transfer learning. In International conference on artificial neural networks, pages 270ā279. Springer.
- The architecture of the festival speech synthesis system. In The third ESCA/COCOSDA workshop (ETRW) on speech synthesis.
- Asr-based features for emotion recognition: A transfer learning approach. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pages 48ā52. Association for Computational Linguistics.
- Exploring Transfer Learning for Low Resource Emotional TTS. In Intelligent Systems and Applications, pages 52ā60, Cham. Springer International Publishing.
- Analysis and assessment of controllability of an expressive deep learning-based tts system. In Informatics, volumeĀ 8, pageĀ 84. MDPI.
- Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis. In Proc. Interspeech 2019, pages 4475ā4479.
- NoĆ© Tits and ZoĆ© Broisson. 2023. Flowchase: a Mobile Application for Pronunciation Training. In Proc. 9th Workshop on Speech and Language Technology in Education (SLaTE), pages 93ā94.
- Theo Vennemann. 1987. Preference laws for syllable structure: And the explanation of sound change with special reference to German, Germanic, Italian, and Latin. de Gruyter.
- Dong Wang and ThomasĀ Fang Zheng. 2015. Transfer learning for speech and language processing. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 1225ā1237. IEEE.
- Frequent violation of the sonority sequencing principle in hundreds of languages: how often and by which sequences? Linguistic Typology.
- L2-arctic: A non-native english speech corpus. In Interspeech, pages 2783ā2787.
- Emotional voice conversion: Theory, databases and esd. Speech Communication, 137:1ā18.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.