HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units (2106.07447v1)

Published 14 Jun 2021 in cs.CL, cs.AI, cs.LG, and eess.AS

Abstract: Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and LLM over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.

PDF Abstract

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

The paper "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units" proposes a novel approach to self-supervised speech representation learning, addressing several unique challenges inherent to the domain of speech signals. Major hurdles include the presence of multiple sound units per utterance, the absence of a lexicon during pre-training, and the variable lengths and unsegmented nature of sound units. The authors introduce Hidden-Unit BERT (HuBERT), a method that employs offline clustering to create labels for masked prediction tasks, fostering the learning of robust speech representations.

Methodology

HuBERT capitalizes on a combination of BERT modeling techniques with innovations tailored for speech. The critical components of HuBERT are:

Offline Clustering: This step involves clustering speech frames using k-means, which serve as pseudo-labels for the pre-training task.
Masked Prediction Loss: The model is trained to predict the pseudo-labels of masked regions in the input, compelling it to glean acoustic and linguistic cues from the surrounding unmasked frames.
Iterative Refinement: The clustering step is iterated, leveraging improved latent features from an initial HuBERT model to create better clusters for subsequent training phases.

Experimental Setup

The authors evaluate HuBERT's performance on two significant datasets: Librispeech (960 hours) and Libri-light (60,000 hours). The model is fine-tuned on subsets of these data, ranging from 10 minutes to 960 hours. They present results for three model sizes—Base (90M parameters), Large (300M), and X-Large (1B).

Empirical Results

The HuBERT model demonstrates either comparable or superior performance relative to the state-of-the-art wav2vec 2.0, particularly noted in the following aspects:

WER Reduction: The X-Large HuBERT model shows up to a 19% and 13% relative reduction in WER on the more challenging dev-other and test-other subsets, respectively.
Fine-Tuning Compact Data: HuBERT Large achieves a 0.1% and 0.6% lower WER than wav2vec 2.0 Large on test-clean and test-other, respectively, with just 10 minutes of labeled data.
Scalability: HuBERT is effectively scalable from base to X-Large models, providing consistent improvements across multiple data subsets.

Theoretical and Practical Implications

HuBERT's reliance on the masked prediction loss over masked frames is critical for its resilience to the quality of cluster targets. The substantial WER improvements indicate that HuBERT can be instrumental in low-resource settings, reducing the need for extensive labeled datasets. This positions HuBERT as a prominent model for industrial applications, particularly for rapid deployment across diverse languages and dialects lacking substantial linguistic resources.

Moreover, the iterative refinement process underscores the significance of leveraging previous iterations of the model to enhance pseudo-label quality, aligning with the broader trends in semi-supervised learning methodologies.

Future Prospects

Given its inception and promising results, HuBERT opens several avenues for future exploration:

Single Phase Training: The authors indicate the potential for enhancing HuBERT’s training efficiency by consolidating the clustering and model training into a single phase.
Broader Applications: Beyond ASR, the high-quality representations gleaned from HuBERT could be extended to a variety of recognition and generation tasks in speech technology.

Conclusion

Overall, the paper presents a well-founded and methodologically sound approach to enhancing self-supervised speech representation learning. HuBERT's innovative use of masked prediction, coupled with robust iterative refinement strategies, cements its place as a formidable successor to existing methods like wav2vec 2.0. The promising empirical results and scalable architecture underscore its practical relevance and potential impact on the field of speech recognition and beyond.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Wei-Ning Hsu (76 papers)
Benjamin Bolte (5 papers)
Yao-Hung Hubert Tsai (41 papers)
Kushal Lakhotia (15 papers)
Ruslan Salakhutdinov (248 papers)
Abdelrahman Mohamed (59 papers)

Citations (2,456)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos