- The paper presents an audio-conditioned model that generates high-quality phonemic and prosodic labels from unlabeled speech data.
- It employs a Transformer-based architecture with pseudo data augmentation to overcome challenges from limited labeled data.
- Experiments demonstrate enhanced TTS performance with lower error rates and improved naturalness in speech synthesis.
Audio-conditioned Phonemic and Prosodic Annotation for Building Text-to-Speech Models from Unlabeled Speech Data
The paper "Audio-conditioned Phonemic and Prosodic Annotation for Building Text-to-Speech Models from Unlabeled Speech Data" presents an advanced methodology aimed at enhancing the construction of text-to-speech (TTS) datasets. The approach leverages an audio-conditioned annotation model, which is instrumental in annotating phonemic and prosodic labels from unlabeled speech samples. This innovation addresses the significant challenge in TTS model training—obtaining accurately labeled speech data.
Methodology
The authors introduce a finely-tuned annotation model that operates on a pre-trained automatic speech recognition (ASR) model. The model is enhanced using a limited amount of manually labeled data to create phonemic and prosodic labels. Notably, the challenge of insufficient label-speech paired data is mitigated through an augmentation method utilizing a TTS model. This auxiliary TTS model is initially trained on existing labeled data and subsequently employed to generate pseudo label-speech paired data from text-only corpora.
Fundamentally, the architecture of this annotation model is based on the Transformer, noted for its efficacy in sequence-to-sequence learning tasks. The input to the model comprises raw speech sequences, and the output is the TTS labels generated in an auto-regressive manner. The model's deployment on unlabeled speech data enables it to produce high-quality, labeled datasets, which are vital for training robust TTS models.
Experimental Framework
The paper delineates a multi-faceted experimental setup aimed at evaluating the model's performance. The experiments hinge on comparing the proposed method with baseline models under varying data conditions, including scenarios with limited labeled data.
- Datasets: Two principal datasets were employed—JSUT, containing single-speaker Japanese speech, and a proprietary LARGE dataset comprising multi-speaker Japanese speech samples. The JSUT dataset was used to test the performance of models trained on smaller datasets, while the LARGE dataset provided a more extensive training base.
- Metrics: The efficacy of the annotation model was evaluated using character error rate (CER) for phonemic labels and the F1 score for prosodic labels.
The proposed model demonstrated substantial improvements over baseline models. For instance, the model trained on the LARGE dataset achieved a CER of 0.54% and a prosodic F1 score of 98.84%, outperforming baseline models even when ground truth grapheme sequences were used (CER of 2.53%, prosodic F1 of 73.43%).
TTS Model Evaluation
The scope of the research extends to examining the impact of the annotated data on TTS model performance. Employing the Period VITS TTS architecture, subjective listening tests were conducted to gauge the naturalness of the generated speech. The mean opinion score (MOS) collected from native Japanese speakers indicated that the TTS models trained using data from the proposed annotation model achieved MOS scores comparable to or exceeding those trained on fully labeled datasets.
For instance, on the JSUT dataset, the model trained with the proposed method's annotations achieved a MOS of 4.11, closely matching the MOS of 4.15 for the model trained on manually annotated data. This demonstrates the robustness and effectiveness of the proposed method in real-world applications of TTS systems.
Practical and Theoretical Implications
The practical implications of this research are profound. By enabling efficient utilization of unlabeled speech data for TTS model training, the proposed method significantly reduces the dependency on extensive manual labeling, which is labor-intensive and costly. This advancement has the potential to democratize TTS technology, making it accessible for languages and dialects with limited labeled datasets.
Theoretically, the research enriches the understanding of integrating ASR models with TTS systems, especially in the domain of prosodic annotation. The success of the data augmentation strategy using an auxiliary TTS model also offers a novel approach for generating training data in other machine learning contexts where labeled data is scarce.
Future Directions
Looking forward, the extension of this approach to more complex speech samples, including those with emotional variations and distinct dialects, presents an intriguing area for future research. Moreover, the scalability of this annotation model across different languages and its performance in real-time applications warrant further investigation.
In summary, this paper articulates a sophisticated approach to annotating unlabeled speech data for TTS applications, showcasing both the methodological innovation and practical utility in advancing TTS technology.