Extending Multilingual Speech Synthesis to Languages Beyond Transcribed Data
Introduction
The development of text-to-speech (TTS) systems typically favors languages with an abundance of high-quality transcribed audio data. This situation presents a limitation, given the nearly 6,000 languages worldwide, many of which are considered low-resource due to the scarcity of such data. This paper introduces a novel framework that effectively expands TTS capabilities to over 100 languages, significantly increasing language coverage by utilizing unsupervised learning with untranscribed found data. This approach leverages a pretrained self-supervised multilingual speech foundation model for joint speech-text representation learning, demonstrating the ability to generate intelligible speech in languages without previously available transcribed speech data.
Related Work
Past efforts in multilingual TTS have been constrained by the availability of high-quality, paired speech-text data, limiting the scalability and applicability of TTS systems across the wide spectrum of global languages. Although some strategies have sought to alleviate data requirements through unpaired or synthetic training materials, these have often resulted in models with limited language coverage or compromised performance. By incorporating unsupervised learning strategies and leveraging advances in self-supervised speech pretraining and speech-text joint pretraining, this paper positions itself at the forefront of efforts to universalize TTS technology.
Proposed Framework
At the heart of the proposed solution is a joint multilingual speech-text model comprising several components designed to facilitate both supervised and unsupervised learning across languages. The framework employs a pretrained speech-to-feature (S2F) block and feature-to-speech (F2S) components, alongside novel training objectives suited for TTS language expansion. Crucially, this model introduces methods for leveraging found data—comprising speech-text paired data, untranscribed speech data, and unspoken text data—bypassing the need for curated datasets.
Training Objectives
The model is trained using a mixture of transcribed (paired) speech, untranscribed speech, and unspoken text data, enabling it to learn from a diversity of inputs. The training leverages RNN-T decoder alignments, feature loss, and duration prediction to optimize the model's performance across languages. A key innovation lies in the use of pseudo-labeling for untranscribed speech and aligned text MLM for unspoken text, enabling effective learning even in the absence of transcribed speech data.
Curriculum Training Procedures
The model employs a stage-wise training approach, beginning with the pretraining of speech and shared encoders, followed by targeted training of the shared encoder and the RNN-T decoder. The final stage involves joint training that integrates the supervised and unsupervised learning derived from various data types, refining the model's ability to generalize across languages.
Experimental Setting
The experimental framework underscores the model's applicability to a broad array of languages, demonstrating significant improvements in TTS quality. By training on diverse datasets spanning 100+ languages, and leveraging both public corpora and proprietary datasets, the paper showcases the model's robustness and versatility.
Results
The evaluation of the model reveals promising outcomes, particularly in generating intelligible speech from untranscribed data in more than 30 languages. When minimal supervised data (around 15 minutes of transcribed found data) is incorporated, the intelligibility and naturalness scores closely match those of ground-truth data in several languages. It's a noteworthy achievement that illustrates the model's capacity to significantly reduce the gap between high-resource and low-resource languages in TTS applications.
Conclusion
This paper introduces a transformative approach to multilingual TTS development that dramatically increases language coverage without relying on extensively curated datasets. By harnessing unsupervised learning techniques alongside a novel joint speech-text model, the framework facilitates the generation of high-quality speech across a vast array of languages. Looking ahead, the implications for global communication and access to information are profound, offering a pathway to more inclusive and equitable technology deployment worldwide. Future iterations of this work may explore further optimizations and applications, solidifying the foundation laid by this significant step forward in TTS research.