From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition (2505.16972v1)

Published 22 May 2025 in cs.CL, cs.SD, and eess.AS

Abstract: Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.

Authors (4)

Tianduo Wang (5 papers)
Lu Xu (68 papers)
Wei Lu (325 papers)
Shanbo Cheng (23 papers)

Summary

Scaling Back-Translation for Speech Recognition

The paper "From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition" introduces an innovative approach to enhancing Automatic Speech Recognition (ASR) systems, especially for multilingual setups constrained by limited resources. The authors propose utilizing Speech Back-Translation, which leverages Text-to-Speech (TTS) models to generate synthetic speech from large-scale text corpora, thereby addressing the data scarcity issue prevalent in low-resource languages.

Key Contributions

Efficient Training with Limited Data: The paper demonstrates that only tens of hours of real transcribed speech are sufficient to train TTS models capable of generating synthetic speech at a scale several hundred times larger. This synthetic data can substantially improve the performance of ASR systems across multiple languages.
Intelligibility-Based Evaluation: The authors introduce a novel metric, Normalized Intelligibility, as a means to evaluate the quality of synthetic speech. This metric provides a rational basis to determine the conditions under which synthetic data can reliably benefit ASR training.
Large Scale Synthetic Data Generation: The research scales synthetic speech generation to 500,000 hours across ten languages. This data is used to pre-train Whisper-large-v3, resulting in average transcription error reductions of over 30%, showcasing the method's efficacy and scalability.

Experimental Insights

The paper meticulously evaluates the proposed method's impact on ASR performance across low, mid, and high-resource languages. A key experimental finding is the substantial performance improvement observed when synthetic data supplements real transcribed speech, particularly in less-resourced languages such as Vietnamese, Czech, and Hungarian. For low-resource languages, the model trained with synthetic data showed a marked reduction in transcription error rates compared to models trained with limited real data alone.

Furthermore, the approach exhibits strong generalization capabilities. Models pre-trained with Speech Back-Translation demonstrate enhanced robustness on out-of-domain datasets, indicating the method's versatility beyond in-domain scenarios.

Theoretical and Practical Implications

Theoretically, the paper provides a promising avenue for reducing dependencies on vast amounts of labeled training data, challenging traditional approaches that rely heavily on manual data collection for model training. The introduction of intelligibility thresholds offers a quantitative basis for assessing synthetic speech benefits, contributing to broader discussions on data augmentation and model evaluation standards.

Practically, the research holds the potential to vastly improve ASR system accessibility across a wider array of languages, with implications for commercial applications, educational technologies, and international accessibility tools where multilingual capabilities are paramount.

Future Directions

Future enhancements could include extending the methodology to support even lower-resource languages by improving TTS model architectures and exploring additional speech synthesis techniques. Integration with other data augmentation strategies, such as using generative adversarial networks or combining synthetic data with multimodal learning approaches, could further amplify the benefits demonstrated in this paper. Moreover, exploring the application of Speech Back-Translation to speech translation tasks presents an intriguing avenue for subsequent research.

In conclusion, the paper presents a well-founded strategy to tackle data scarcity in multilingual ASR through the innovative use of TTS and synthetic data, paving the way for further advancements in speech recognition technology.

Related Papers

Find Related Papers

YouTube

Show All Videos