Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM

Published 20 Nov 2024 in cs.CL, cs.SD, and eess.AS | (2411.13159v1)

Abstract: Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems using text-only corpora, thereby reducing the cost of labeling real speech data. Existing research primarily utilizes additional text data and predefined speech styles supported by TTS models. In this paper, we propose Hard-Synth, a novel ASR data augmentation method that leverages LLMs and advanced zero-shot TTS. Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data. Rather than using predefined speech styles, we introduce a hard prompt selection method with zero-shot TTS to clone speech styles that the ASR model finds challenging to recognize. Experiments demonstrate that Hard-Synth significantly enhances the Conformer model, achieving relative word error rate (WER) reductions of 6.5\%/4.4\% on LibriSpeech dev/test-other subsets. Additionally, we show that Hard-Synth is data-efficient and capable of reducing bias in ASR.

Abstract PDF HTML Upgrade to Chat

Authors (9)

Summary

The paper introduces Hard-Synth, a novel ASR augmentation method that integrates zero-shot TTS with LLM-based text rewriting to produce diverse hard samples.
It employs weak ASR models for hard prompt selection and a rigorous filtering process to ensure high-quality synthetic speech.
Experimental results on LibriSpeech demonstrate significant WER reductions and potential benefits for bias mitigation and low-resource applications.

Overview of Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM

The paper "Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM" presents a sophisticated approach to augment automatic speech recognition (ASR) data leveraging advanced text-to-speech (TTS) and LLMs. The primary focus of this study is the novel ASR data augmentation technique, Hard-Synth, aimed at addressing challenges associated with the training of ASR models by generating more challenging and diverse speech samples.

Key Innovations and Methodologies

The Hard-Synth methodology makes notable contributions to the field in several ways:

Integration of TTS and LLMs: While traditional data augmentation methods for ASR typically use only TTS with text-based data, Hard-Synth innovatively combines zero-shot TTS with LLMs. It utilizes LLMs to rewrite and diversify the textual content of ASR training datasets, thus enhancing semantic variation without requiring additional data sources.
Hard Prompt Selection: A defining characteristic of Hard-Synth is its approach to selecting audio prompts. The authors introduce a method involving weak ASR models for identifying and selecting utterances that are difficult to recognize, referred to as "hard prompts." This technique is instrumental in generating synthetic audio that simulates challenging scenarios for ASR systems.
Data Filtering and Efficiency: The method includes a rigorous filtering process to ensure the quality of synthetic speech, employing an ASR model to transcribe synthetic output and compare it against a set threshold. The impressive efficiency of Hard-Synth is highlighted by achieving significant ASR performance improvements with only 16.15 hours of synthetic speech data, demonstrating cost-effective scalability.

Experimental Results

Experiments conducted using the LibriSpeech dataset reveal the potential of Hard-Synth:

For the Conformer model, the method achieves relative word error rate reductions of 6.5% and 4.4% on the dev-other and test-other subsets, respectively.
The experiments show that utilizing zero-shot TTS with reconstructed text from LLMs substantially improves the performance, especially on more challenging subsets of the data, as opposed to simpler datasets.

Implications and Future Directions

The practical implications of Hard-Synth are manifold:

Bias Mitigation: The study indicates a reduction in ASR biases such as gender-based performance disparities and speaker variance. This suggests potential applications in creating more equitable ASR systems across various demographic characteristics.
Application to Low-Resource Scenarios: Given its efficiency, the method holds promise for enhancing low-resource ASR tasks and applications, including those involving minority languages and speech with impairments, where dataset creation is traditionally challenging.

Looking forward, the research opens avenues for further exploration, particularly in tuning and refining TTS models to more accurately replicate complex acoustic environments found in hard samples. Additionally, extending the methodology to adapt to and improve pre-trained ASR models, especially in domain-specific applications, presents a valuable future research trajectory.

In summary, Hard-Synth represents an advancement in ASR data augmentation techniques, adeptly combining the power of zero-shot TTS and LLMs to generate challenging, diverse speech samples that significantly boost ASR model performance while maintaining computational and data efficiency.

Markdown Report Issue