wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (2006.11477v3)

Published 20 Jun 2020 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.

PDF Abstract

Overview of wav2vec 2.0 Framework for Self-Supervised Learning of Speech Representations

wav2vec 2.0 represents a significant contribution to the field of speech recognition by demonstrating how effective speech representations can be learned using self-supervised learning on raw audio data. This method outperforms existing semi-supervised techniques with a more straightforward approach. The paper, authored by Baevski, Zhou, Mohamed, and Auli from Facebook AI, introduces a self-supervised framework where the speech input is masked in the latent space, and a contrastive learning task is applied over quantized latent representations, which the model learns jointly.

Model Architecture

The core architecture of wav2vec 2.0 is divided into three main components: a multi-layer convolutional feature encoder, a Transformer network, and a quantization module. The feature encoder processes raw audio to produce latent speech representations. These representations are then masked and fed into the Transformer network to generate contextualized representations. The quantization module converts these latent representations into discrete units, which serve as targets in a contrastive learning task.

Feature Encoder

The feature encoder comprises several convolutional layers that normalize the raw audio input and extract features across various time steps. This encoder essentially compresses the audio signal into a more manageable sequence of latent representations, which are then used in subsequent stages of the model.

Transformer Network

The Transformer's role is to build contextualized representations from the latent features produced by the encoder. It uses self-attention mechanisms to capture dependencies over the entire sequence, analogous to how BERT handles masked LLMing but tailored for speech data.

Quantization Module

The quantization module employs product quantization to represent the latent features as discrete units. This involves selecting entries from multiple codebooks through a Gumbel softmax, making the process differentiable. Crucially, the model learns these discrete representations jointly with the contextualization process, enhancing the robustness and quality of the speech representations.

Training and Optimization

Training the wav2vec 2.0 model involves a self-supervised pre-training phase and a fine-tuning phase on labeled data. During pre-training, a certain proportion of the feature encoder outputs are masked, and the model must predict the correct quantized representation among several distractors. The training loss comprises a contrastive loss for this task and a codebook diversity loss to ensure uniform use of the quantization codebook.

Masking Strategy

The model employs a specific masking strategy where spans of time steps are randomly selected and masked. This span-based masking helps the model learn more robust representations by requiring it to rely on context for reconstruction.

Fine-Tuning

For fine-tuning, a Connectionist Temporal Classification (CTC) loss is used along with minor modifications such as SpecAugment-like masking to avoid overfitting. The pre-trained representations are adapted to labeled datasets to facilitate downstream speech recognition tasks.

Experimental Results

The experimental validation of wav2vec 2.0 includes tests on the Librispeech dataset across various data availability scenarios:

Low-Resource Settings: With only 10 minutes of labeled data, wav2vec 2.0 achieved a word error rate (WER) of 4.8/8.2 on Librispeech's clean/other test sets. This performance is a significant improvement over prior methods, highlighting the model's capacity to function effectively with minimal labeled data.
High-Resource Settings: When utilizing the full 960 hours of labeled Librispeech data, the model achieved a WER of 1.8/3.3, marking a competitive result even compared with state-of-the-art methods that employ more complex training strategies and model architectures.

The results also established new benchmarks on TIMIT phoneme recognition, reducing the phoneme error rate (PER) by a substantial margin compared to previous state-of-the-art models.

Implications and Future Directions

The implications of this research are multifaceted:

Scalability: wav2vec 2.0’s ability to leverage vast amounts of unlabeled data with minimal labeled data requirements makes it particularly suited for languages lacking extensive annotated datasets.
Generalization: The robust self-supervised learning approach simplifies model training pipelines and enhances generalization capabilities across different speech tasks and datasets.
Theoretical Advancements: The innovative use of discrete units learned jointly with contextualized representations marks a progression in how speech models can be designed and optimized.

Speculating on future advancements, combining wav2vec 2.0 with sequence-to-sequence models and experimenting with end-to-end trained LLMs could yield even lower error rates and wider applicability. Furthermore, integrating wav2vec 2.0 with multilingual datasets could open up new avenues in cross-lingual speech recognition.

In summary, wav2vec 2.0 presented by Baevski et al. offers a robust framework for self-supervised learning of speech representations and sets new performance standards for speech recognition, particularly in low-resource scenarios. Its adoption could significantly broaden the reach and efficacy of speech recognition technologies globally.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Alexei Baevski (39 papers)
Henry Zhou (5 papers)
Abdelrahman Mohamed (59 papers)
Michael Auli (73 papers)

Citations (4,938)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/cackerman21/status/1938526109567517029

YouTube

Show All Videos