Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data

Published 12 Jun 2024 in eess.AS | (2406.08111v1)

Abstract: This paper proposes an audio-conditioned phonemic and prosodic annotation model for building text-to-speech (TTS) datasets from unlabeled speech samples. For creating a TTS dataset that consists of label-speech paired data, the proposed annotation model leverages an automatic speech recognition (ASR) model to obtain phonemic and prosodic labels from unlabeled speech samples. By fine-tuning a large-scale pre-trained ASR model, we can construct the annotation model using a limited amount of label-speech paired data within an existing TTS dataset. To alleviate the shortage of label-speech paired data for training the annotation model, we generate pseudo label-speech paired data using text-only corpora and an auxiliary TTS model. This TTS model is also trained with the existing TTS dataset. Experimental results show that the TTS model trained with the dataset created by the proposed annotation method can synthesize speech as naturally as the one trained with a fully-labeled dataset.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper presents an audio-conditioned model that generates high-quality phonemic and prosodic labels from unlabeled speech data.
It employs a Transformer-based architecture with pseudo data augmentation to overcome challenges from limited labeled data.
Experiments demonstrate enhanced TTS performance with lower error rates and improved naturalness in speech synthesis.

Audio-conditioned Phonemic and Prosodic Annotation for Building Text-to-Speech Models from Unlabeled Speech Data

The paper "Audio-conditioned Phonemic and Prosodic Annotation for Building Text-to-Speech Models from Unlabeled Speech Data" presents an advanced methodology aimed at enhancing the construction of text-to-speech (TTS) datasets. The approach leverages an audio-conditioned annotation model, which is instrumental in annotating phonemic and prosodic labels from unlabeled speech samples. This innovation addresses the significant challenge in TTS model training—obtaining accurately labeled speech data.

Methodology

The authors introduce a finely-tuned annotation model that operates on a pre-trained automatic speech recognition (ASR) model. The model is enhanced using a limited amount of manually labeled data to create phonemic and prosodic labels. Notably, the challenge of insufficient label-speech paired data is mitigated through an augmentation method utilizing a TTS model. This auxiliary TTS model is initially trained on existing labeled data and subsequently employed to generate pseudo label-speech paired data from text-only corpora.

Fundamentally, the architecture of this annotation model is based on the Transformer, noted for its efficacy in sequence-to-sequence learning tasks. The input to the model comprises raw speech sequences, and the output is the TTS labels generated in an auto-regressive manner. The model's deployment on unlabeled speech data enables it to produce high-quality, labeled datasets, which are vital for training robust TTS models.

Experimental Framework

The paper delineates a multi-faceted experimental setup aimed at evaluating the model's performance. The experiments hinge on comparing the proposed method with baseline models under varying data conditions, including scenarios with limited labeled data.

Datasets: Two principal datasets were employed—JSUT, containing single-speaker Japanese speech, and a proprietary LARGE dataset comprising multi-speaker Japanese speech samples. The JSUT dataset was used to test the performance of models trained on smaller datasets, while the LARGE dataset provided a more extensive training base.
Metrics: The efficacy of the annotation model was evaluated using character error rate (CER) for phonemic labels and the F1 score for prosodic labels.

The proposed model demonstrated substantial improvements over baseline models. For instance, the model trained on the LARGE dataset achieved a CER of 0.54% and a prosodic F1 score of 98.84%, outperforming baseline models even when ground truth grapheme sequences were used (CER of 2.53%, prosodic F1 of 73.43%).

TTS Model Evaluation

The scope of the research extends to examining the impact of the annotated data on TTS model performance. Employing the Period VITS TTS architecture, subjective listening tests were conducted to gauge the naturalness of the generated speech. The mean opinion score (MOS) collected from native Japanese speakers indicated that the TTS models trained using data from the proposed annotation model achieved MOS scores comparable to or exceeding those trained on fully labeled datasets.

For instance, on the JSUT dataset, the model trained with the proposed method's annotations achieved a MOS of 4.11, closely matching the MOS of 4.15 for the model trained on manually annotated data. This demonstrates the robustness and effectiveness of the proposed method in real-world applications of TTS systems.

Practical and Theoretical Implications

The practical implications of this research are profound. By enabling efficient utilization of unlabeled speech data for TTS model training, the proposed method significantly reduces the dependency on extensive manual labeling, which is labor-intensive and costly. This advancement has the potential to democratize TTS technology, making it accessible for languages and dialects with limited labeled datasets.

Theoretically, the research enriches the understanding of integrating ASR models with TTS systems, especially in the domain of prosodic annotation. The success of the data augmentation strategy using an auxiliary TTS model also offers a novel approach for generating training data in other machine learning contexts where labeled data is scarce.

Future Directions

Looking forward, the extension of this approach to more complex speech samples, including those with emotional variations and distinct dialects, presents an intriguing area for future research. Moreover, the scalability of this annotation model across different languages and its performance in real-time applications warrant further investigation.

In summary, this paper articulates a sophisticated approach to annotating unlabeled speech data for TTS applications, showcasing both the methodological innovation and practical utility in advancing TTS technology.

Markdown