Self-training and Pre-training are Complementary for Speech Recognition (2010.11430v1)

Published 22 Oct 2020 in cs.LG, cs.SD, and eess.AS

Abstract: Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data. However, it is not clear whether they learn similar patterns or if they can be effectively combined. In this paper, we show that pseudo-labeling and pre-training with wav2vec 2.0 are complementary in a variety of labeled data setups. Using just 10 minutes of labeled data from Libri-light as well as 53k hours of unlabeled data from LibriVox achieves WERs of 3.0%/5.2% on the clean and other test sets of Librispeech - rivaling the best published systems trained on 960 hours of labeled data only a year ago. Training on all labeled data of Librispeech achieves WERs of 1.5%/3.1%.

Authors (8)

Qiantong Xu (26 papers)
Alexei Baevski (39 papers)
Tatiana Likhomanenko (41 papers)
Paden Tomasello (17 papers)
Alexis Conneau (33 papers)
Ronan Collobert (55 papers)
Gabriel Synnaeve (97 papers)
Michael Auli (73 papers)

Citations (167)

View on Semantic Scholar

Summary

Self-Training and Pre-Training Synergy in Speech Recognition

The paper "Self-training and Pre-training are Complementary for Speech Recognition" presents an investigation into the combination of self-training and unsupervised pre-training techniques for improving speech recognition systems. The central claim is the complementary nature of pseudo-labeling and the pre-training of models like wav2vec 2.0, particularly in scenarios with varying amounts of labeled and unlabeled data.

Key Contributions

The authors demonstrate how the integration of self-training and pre-training can vastly enhance speech recognition accuracy, achieving significant reductions in Word Error Rate (WER) across various datasets. Notably, the research showcases how combining these methods rivals other state-of-the-art approaches, even when minimal labeled data is available.

Experimental Results

The experiments conducted used both low-resource and high-resource labeled data setups.

Low-Resource Scenarios: With only 10 minutes of labeled data and large volumes of unlabeled data from LibriVox (53k hours), the proposed method achieved an impressive WER of 3.0%/5.2% on Librispeech test sets. This represents a substantial performance enhancement over pre-training alone and previous pseudo-labeling methods.
High-Resource Scenarios: Employing 960 hours of labeled data, the integration of both methods yielded even further improvements. A WER of 1.5%/3.1% was achieved, demonstrating the model’s robustness and efficiency when significant labeled data is accessible.

Methodological Approach

The paper employs the wav2vec 2.0 model for unsupervised pre-training, exploiting the model's capabilities to generate high-fidelity speech representations from raw audio. This model was combined with a self-training approach based on pseudo-labeling, leveraging pretrained acoustic models to label large volumes of unlabeled speech data. This pseudo-labeled data was then utilized to train final models, either from scratch or fine-tuned, demonstrating the flexibility and power of the combined approach.

Implications

The findings have significant ramifications both practically and theoretically:

Practical Implications: In real-world applications, the ability to train speech recognition systems with minimal labeled data could facilitate the development of efficient models for languages with scarce resources. This enhances accessibility and scalability in deploying speech technologies globally.
Theoretical Implications: The paper underscores the synergistic effects of different learning paradigms. The complementary nature highlights potential pathways for further research in multimodal learning and the integration of heterogeneous training techniques.

Speculation on Future Developments

Moving forward, the integration of these techniques could be enhanced by exploring additional methods of distillation and optimization. Future research may delve into finer aspects of LLM integration during the pseudo-labeling phase or refine quantization methods for transforming observations into more informative representations. Additionally, expanding these methodologies into broader languages and dialects could yield further advancements in global speech recognition capabilities.

In summary, the paper posits a compelling case for the complementarity of self-training and pre-training methodologies in speech recognition. Through a series of rigorous experiments and evaluations, the findings outline a promising approach for advancing accuracy and efficiency in low-resource contexts. This research contributes a substantial enhancement to the speech recognition field, presenting opportunities for continued exploration in AI-driven language technologies.

PDF Markdown