Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model

Published 26 Jul 2024 in cs.SD, cs.LG, and eess.AS | (2407.18879v1)

Abstract: This paper explores the use of TTS synthesized training data for KWS (keyword spotting) task while minimizing development cost and time. Keyword spotting models require a huge amount of training data to be accurate, and obtaining such training data can be costly. In the current state of the art, TTS models can generate large amounts of natural-sounding data, which can help reducing cost and time for KWS model development. Still, TTS generated data can be lacking diversity compared to real data. To pursue maximizing KWS model accuracy under the constraint of limited resources and current TTS capability, we explored various strategies to mix TTS data and real human speech data, with a focus on minimizing real data use and maximizing diversity of TTS output. Our experimental results indicate that relatively small amounts of real audio data with speaker diversity (100 speakers, 2k utterances) and large amounts of TTS synthesized data can achieve reasonably high accuracy (within 3x error rate of baseline), compared to the baseline (trained with 3.8M real positive utterances).

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that integrating TTS data with as few as 2k real utterances significantly reduces error rates in keyword spotting models.
It utilizes tailored text generation and advanced TTS systems like Virtuoso and AudioLM to enhance data diversity and model robustness.
The optimal mixing strategy achieved an FRR of 9.94%, offering a scalable, cost-efficient alternative to large-scale real-data collection.

Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model

The paper, "Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model," addresses the critical issue of data scarcity in the development of keyword spotting (KWS) models. The principal challenge in KWS model development is the acquisition of extensive, high-quality labeled data that encapsulates the diversity of pronunciations and acoustic environments necessary for robust performance. Traditional methods of gathering such data are both time-intensive and costly. Recent advancements in Text-to-Speech (TTS) technologies offer a potential solution by generating synthetic speech data that can supplement real data in model training.

Methodology

The authors propose a framework leveraging TTS synthesized data to train KWS models with reduced dependence on real data. The overarching goal is to enhance model accuracy while minimizing resource expenditure. The approach hinges on three primary strategies:

Text Generation Tailored for TTS: The researchers developed a text generator aimed at producing diverse and relevant text phrases for KWS training. Text generation is optimized to enhance the variability of the synthesized speech output by incorporating various prosody control symbols.
Advanced TTS Models Utilization: The study employs state-of-the-art TTS systems—Virtuoso and an AudioLM-based TTS model. Virtuoso is a multilingual model capable of producing speech across numerous languages and speaker profiles, while the AudioLM-based model excels in retaining the speaker's characteristics and prosodic features from input audio samples.
Data Mixing Strategies: The experimentation involved various strategies for mixing TTS-generated data with a minimal amount of real human speech, aiming to find an optimal balance that ensures high model accuracy at a lower data cost.

Experimental Setup

The researchers conducted a series of experiments to evaluate the impact of different data mixing strategies on the KWS model's performance. The models were tested using a combination of real and synthesized data, with the synthesized data generated by the Virtuoso and AudioLM systems. The experiments varied the quantity of real positive utterances, the number of speakers, and the amount of data per speaker.

Results

The results indicate that mixing a relatively small amount of real data with large volumes of TTS-generated data can yield high accuracy in KWS models. Specifically, the findings showed that a model trained with a combination of 2k real positive utterances from 100 speakers and extensive TTS data achieved performance within three times the error rate of a baseline model which was trained with 3.8 million real positive utterances. Key numerical outcomes are summarized as follows:

Baseline Model Performance: Real data only baseline achieved an FRR of 3.17%.
TTS Data Only: A model trained exclusively on TTS data showed an FRR of 46.47%.
Optimal Mixing: Integrating real data with TTS data reduced the FRR to 9.94% when using as few as 2k real utterances from 100 speakers.

Implications

The implications of this research are multifaceted:

Cost Efficiency: The ability to train effective KWS models with significantly less real data can lead to substantial cost savings in data collection and annotation processes.
Data Diversity: By leveraging TTS-generated data with prosody variations and multi-speaker capabilities, models can better handle diverse real-world scenarios, enhancing their robustness and reliability.
Future Development: This approach can be further refined by continuously improving TTS models, making synthesized data even more representative of real-world distributions.

In conclusion, the paper highlights how the innovative use of TTS-generated data can revolutionize the development process for keyword spotting models, offering a scalable and cost-effective alternative to traditional data collection methods. Future research could expand on this foundation by exploring more advanced TTS models and further optimizing the training data mixing strategies. The implications extend beyond KWS to broader ASR and NLP domains, potentially transforming how synthetic data is integrated into machine learning workflows.

Markdown