Improved Noisy Student Training for Automatic Speech Recognition (2005.09629v2)

Published 19 May 2020 in eess.AS and cs.LG

Abstract: Recently, a semi-supervised learning method known as "noisy student training" has been shown to improve image classification performance of deep networks significantly. Noisy student training is an iterative self-training method that leverages augmentation to improve network performance. In this work, we adapt and improve noisy student training for automatic speech recognition, employing (adaptive) SpecAugment as the augmentation method. We find effective methods to filter, balance and augment the data generated in between self-training iterations. By doing so, we are able to obtain word error rates (WERs) 4.2%/8.6% on the clean/noisy LibriSpeech test sets by only using the clean 100h subset of LibriSpeech as the supervised set and the rest (860h) as the unlabeled set. Furthermore, we are able to achieve WERs 1.7%/3.4% on the clean/noisy LibriSpeech test sets by using the unlab-60k subset of LibriLight as the unlabeled set for LibriSpeech 960h. We are thus able to improve upon the previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h (4.74%/12.20%) and LibriSpeech (1.9%/4.1%).

PDF Abstract

Improved Noisy Student Training for Automatic Speech Recognition

The paper "Improved Noisy Student Training for Automatic Speech Recognition" presents a novel adaptation of the noisy student training (NST) method originally designed for image classification to enhance automatic speech recognition (ASR) systems. The authors introduce several innovations in the NST pipeline to improve its efficacy for ASR tasks, employing techniques such as adaptive SpecAugment for data augmentation, shallow fusion with LLMs, and gradational methods for filtering and augmentation.

The core contribution of the paper is the adaptation and enhancement of the NST framework to an ASR context. NST is an iterative self-training method where each model iteration leverages both labeled and unlabeled data, with the earlier model serving as a teacher that generates pseudo-labels for training subsequent student models. The key advancement lies in incorporating adaptive SpecAugment, which modifies the augmentation strength based on the learned model's performance. SpecAugment operates directly on audio spectrograms, which aids in creating resilient ASR models by enforcing invariant representations under diverse conditions.

Applying this modified NST process, the authors achieve impressive word error rates (WERs) on two prominent ASR tasks: LibriSpeech 100-860 and LibriSpeech-LibriLight. For the LibriSpeech 100-860 task, the authors report a WER of 4.2% on clean speech and 8.6% on noisy speech datasets, substantially outperforming previous state-of-the-art results. This improvement is achieved by leveraging only the clean 100-hour subset of the LibriSpeech dataset as the labeled data and the rest as the unlabeled pool. Furthermore, for the LibriSpeech-LibriLight task, a more extensive dataset including a 60,000-hour unlabeled set, the proposed approach reduced the clean set WER to 1.7% and the noisy set WER to 3.4%, setting new benchmarks for ASR performance.

The paper introduces several novel strategies to optimize the NST method for ASR:

Adaptive SpecAugment: Customization based on the model's needs, targeted at maintaining robustness across different speech patterns and noise levels.
Shallow Fusion: Incorporation of a LLM during the generation of pseudo-labels helps to mitigate transcription errors, resulting in improved model predictions.
Normalized Filtering Score: A new scoring mechanism for selecting which generated data should be retained based on confidence metrics and transcript length, allowing for systematic relaxation over iterations to maximize dataset growth effectively.
Gradational Methods: Progressive relaxation of filtering and augmentation strategies, enabling gradual improvement of the student model as it iteratively learns from more data.

These strategies address the systematic challenges posed by semi-supervised learning in ASR systems. By ensuring the robustness of pseudo-labels and gradually increasing dataset size and complexity, they enable a steady improvement in ASR accuracy.

The implications of this work are significant for both theoretical and practical ASR developments. The theoretical framework extends the utility of self-training beyond its traditional domains, demonstrating that carefully crafted augmentations and iterative refinements can substantially close the gap between supervised and semi-supervised learning performance. Practically, the improved NST method provides a scalable means of leveraging large volumes of unlabeled audio data effectively, which is crucial given the vast availability of such data compared to labeled datasets.

Future directions for research might involve further refining augmentation techniques, exploring different filtering metrics, and extending this framework to multilingual or multi-dialect ASR systems. Other avenues could examine the integration of additional external knowledge sources during the pseudo-labeling phase to enhance the generalizability of the models.

In conclusion, this paper exemplifies a detailed application of self-training methodologies to real-world ASR problems, presenting quantifiable improvements that reaffirm the potential of NST in advancing ASR capabilities.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Daniel S. Park (30 papers)
Yu Zhang (1399 papers)
Ye Jia (33 papers)
Wei Han (202 papers)
Chung-Cheng Chiu (48 papers)
Bo Li (1107 papers)
Yonghui Wu (115 papers)
Quoc V. Le (128 papers)

Citations (230)

View on Semantic Scholar

Improved Noisy Student Training for Automatic Speech Recognition (2005.09629v2)

Improved Noisy Student Training for Automatic Speech Recognition

Related Papers