Self-supervised Learning with Random-projection Quantizer for Speech Recognition (2202.01855v2)

Published 3 Feb 2022 in cs.CL, cs.SD, and eess.AS

Abstract: We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook is updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT.

Authors (5)

Chung-Cheng Chiu (48 papers)
James Qin (20 papers)
Yu Zhang (1400 papers)
Jiahui Yu (65 papers)
Yonghui Wu (115 papers)

Citations (144)

View on Semantic Scholar

Summary

Self-supervised Learning with Random-projection Quantizer for Speech Recognition

The paper under consideration introduces an innovative self-supervised learning approach for speech recognition, termed as BEST-RQ (BERT-based Speech pre-Training with Random-projection Quantizer). This method diverges from conventional approaches by employing a random-projection quantizer to generate discrete labels from speech signals, which subsequently serve as the prediction targets during the unsupervised learning phase. In the context of speech recognition, self-supervised learning has emerged as a powerful paradigm, bolstering the performance of models especially when labeled data is scarce. This paper effectively simplifies the learning framework by separating the quantizer from the model architecture and steering away from representation learning.

Key Contributions

The authors propose a self-supervised framework that refrains from training the quantizer, thereby significantly reducing the complexity traditionally associated with the dual objectives of representation and self-supervised learning. Instead, the random-projection quantizer employs fixed random configurations for both projection matrices and codebooks, maintaining consistency throughout the learning process. This detachment of the quantizer from the model permits flexibility, allowing integration with various model architectures including streaming setups.

Experiments reveal that BEST-RQ achieves word error rates (WERs) comparable to state-of-the-art non-streaming models on the LibriSpeech dataset, while outperforming existing methods such as wav2vec 2.0 and w2v-BERT when applied to streaming models. Furthermore, the versatile application across multilingual tasks indicates its potential for broad scalability and effectiveness in diverse linguistic contexts.

Numerical Results and Analysis

Importantly, on the LibriSpeech benchmark, BEST-RQ displayed a marked reduction in latency compared to wav2vec 2.0 and w2v-BERT, particularly in streaming scenarios, indicating its efficacy in real-time applications. When assessed on the Multilingual Spoken Language (MLS) dataset, the approach provided significant improvements in WER across languages compared to previously published methods, especially in low-resource settings like the MLS-10hrs subset.

The authors delve into the relationship between quantizer quality and self-supervised learning efficacy, observing that while traditional metrics of quantizer performance (such as direct ASR using quantized tokens) favor VQ-VAE-based quantizers, this advantage dissipates as pre-training data scale increases. This suggests that with ample data, the networks trained with self-supervised learning can surmount inherent limitations in quantizer quality.

Implications for Future Research

The simplification of the architecture by divorcing model design from the quantizer selection opens pathways to explore further reductions in computational overhead and complexity. As the paper highlights, codebook utilization—particularly the span of codes actively engaged during training—surfaces as a critical factor in pre-training success. Fine-tuning these parameters or adopting alternate initialization strategies may yield insights leading to enhanced performance consistency and reliability.

For future research, expanding the corpus of unsupervised training data, paired with the low-complexity quantizing approach presented here, could provide further solutions to challenges faced in creating and deploying speech recognition systems in resource-constrained settings and multilingual environments.

Ultimately, the paper demonstrates that self-supervised learning for speech recognition can be both effective and simplified, challenging the necessity of complex representation learning frameworks traditionally employed in the field. This creates opportunities for more agile and adaptable systems, potentially transforming the landscape of speech processing technologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/alphacep/status/1861618289845129311