SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition (1904.08779v3)

Published 18 Apr 2019 in eess.AS, cs.CL, cs.LG, cs.SD, and stat.ML

Abstract: We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a LLM, and 5.8% WER with shallow fusion with a LLM. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a LLM, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.

PDF Abstract

SpecAugment: A Data Augmentation Technique for Speech Recognition

The paper "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition" by Daniel S. Park et al. introduces an innovative yet straightforward approach for augmenting data in automatic speech recognition (ASR) systems. The technique, named SpecAugment, is applied directly to the feature inputs of a neural network, specifically the log mel spectrogram, as opposed to the raw audio data. This refined methodology comprises manipulating the spectrogram through time warping, frequency masking, and time masking, significantly enhancing the performance of end-to-end ASR systems.

Methodology

SpecAugment stands out due to its simplicity and efficacy, particularly the three types of augmentations:

Time Warping: This deformation shifts the time axis of the spectrogram horizontally, effectively simulating temporal distortions.
Frequency Masking: This technique involves randomly masking blocks of consecutive mel frequency channels to emulate a loss of spectral information.
Time Masking: Similar to frequency masking, time masking involves blocking consecutive time steps to mimic temporal dropout.

The paper also discusses applying these augmentations iteratively to enhance the robustness of the neural networks trained using these augmented data.

Experimental Validation

The proposed approach was validated using the Listen, Attend, and Spell (LAS) end-to-end ASR model on two well-known datasets: LibriSpeech 960h and Switchboard 300h. Notably, the integration of SpecAugment resulted in state-of-the-art performance:

LibriSpeech: Without a LLM, the model achieved 2.8% WER on the test-clean set and 6.8% WER on the test-other set. With shallow fusion of a LLM, the performance improved to 2.5% WER on test-clean and 5.8% WER on test-other. This significantly surpasses the previous state-of-the-art hybrid systems, which reported 7.5% WER.
Switchboard: The model achieved 7.2% WER on the Switchboard portion and 14.6% on the CallHome portion of the Hub5'00 test set without a LLM. With shallow fusion, the results improved to 6.8% and 14.1%, respectively. Previous hybrid systems recorded 8.3% and 17.3% WER.

Practical and Theoretical Implications

The implications of SpecAugment are extensive:

Practical: The augmentation technique provides a significant performance boost while being computationally efficient and easy to implement. This method allows for better generalization through robust training and can be applied online, negating the need for additional data.
Theoretical: By converting an overfitting ASR problem into an underfitting one, SpecAugment highlights the importance of balancing model complexity with data variability. The method showcases how simple manipulations in the feature space can result in substantial improvements, potentially guiding future research in data augmentation strategies.

Future Developments

SpecAugment sets a precedent for data augmentation in ASR and potentially other domains like image and text recognition. Future work could explore:

Extending Augmentation Strategies: Combining SpecAugment with other augmentation techniques to explore further gains.
Optimization for Different Models: Adapting and fine-tuning the parameters to optimize performance across various neural network architectures and datasets.
Real-Time Applications: Implementing SpecAugment in real-time speech recognition systems for enhanced robustness in diverse acoustic environments.

In conclusion, SpecAugment provides a powerful augmentation methodology that enhances the performance of ASR systems significantly. The approach's simplicity, combined with the substantial performance improvements observed, underscores its potential value in advancing ASR technology and contributing to broader applications in machine learning and artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Daniel S. Park (30 papers)
William Chan (54 papers)
Yu Zhang (1399 papers)
Chung-Cheng Chiu (48 papers)
Barret Zoph (38 papers)
Ekin D. Cubuk (37 papers)
Quoc V. Le (128 papers)

Citations (3,292)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos